Yatrosin Yatrosin - 2 months ago 5
R Question

Missing levels in legend using autoplot () + scale_color_manual () functions

I'm trying to plot three different bed files (one with SNP data, another one with deletions and a third one with duplication data), but I cannot manage the legend to contain the values of the three layers unless I put the data altogether in the same file. The problem about combining the three files into a single one is that I'm not able to set ylims for each of the levels of the variable.

This is an example of one of my input files (the one containing the SNP info):

chr10 47000019 47000019 rs150696937 2 +
chr11 1017064 1017064 NA 2 +
chr11 1017280 1017280 rs199539548 2 +
chr11 1017294 1017294 NA 2 +
chr11 1017756 1017756 NA 2 +
chr13 31898038 31898038 rs200460848 2 +
chr13 40298639 40298639 NA 2 +
chr13 48996928 48996928 rs530812916 2 +
chr13 50204777 50204777 rs117251022 2 +
chr14 20216005 20216005 rs566685404 2 +
chr14 20404076 20404076 rs114526346 2 +
chr21 10944668 10944668 rs138088406 2 +


I'm using the score column to specify the kind of variant type that I want to plot in the following way: "1" = Deletion; "2" = SNP and "3" = Duplication.

These are the libraries I'm using:

## Load libraries and required databases
library(ggbio)
data(hg19IdeogramCyto, package = "biovizBase")

library(GenomicRanges)
hg19 <- keepSeqlevels(hg19IdeogramCyto, paste0("chr", c(1:22, "X", "Y")))

biovizBase::isIdeogram(hg19)

data("hg19IdeogramCyto", package = "biovizBase")

data("hg19Ideogram", package = "biovizBase")


I use the Bed2GRanges function available at this website: http://davetang.org/muse/2015/02/04/bed-granges/ to convert my bed file into a GRanges object.

# Required Bed2GRanges function

# BED to GRanges
#
# This function loads a BED-like file and stores it as a GRanges object.
# The tab-delimited file must be ordered as 'chr', 'start', 'end', 'id', 'score', 'strand'.
# The minimal BED file must have the 'chr', 'start', 'end' columns.
# Any columns after the strand column are ignored.
#
# @param file Location of your file
# @keywords BED GRanges
# @export
# @examples
# bed_to_granges('my_bed_file.bed')

bed_to_granges <- function(file){
df <- read.table(file,
header=F,
stringsAsFactors=F)

if(length(df) > 6){
df <- df[,-c(7:length(df))]
}

if(length(df)<3){
stop("File has less than 3 columns")
}

header <- c('chr','start','end','id','score','strand')
names(df) <- header[1:length(names(df))]

if('strand' %in% colnames(df)){
df$strand <- gsub(pattern="[^+-]+", replacement = '*', x = df$strand)
}

library("GenomicRanges")

if(length(df)==3){
gr <- with(df, GRanges(chr, IRanges(start, end)))
} else if (length(df)==4){
gr <- with(df, GRanges(chr, IRanges(start, end), id=id))
} else if (length(df)==5){
gr <- with(df, GRanges(chr, IRanges(start, end), id=id, score=as.character(score)))
} else if (length(df)==6){
gr <- with(df, GRanges(chr, IRanges(start, end), id=id, score=as.character(score), strand=strand))
}
return(gr)
}


I import my bedfile:

## Import bed files as GRanges file
SNP <- bed_to_granges("SNPs.bed")
seqlengths(SNP) <- seqlengths(hg19Ideogram)[names(seqlengths(SNP))]
SNP_dn <- keepSeqlevels(SNP, paste0("chr", c(1:22, "X", "Y")))


I plot my data:

#Plotting SNP_dn according to score column
test <- autoplot(SNP_dn, aes(color = score)) +
scale_color_manual("Variant type",
values = score <- c("black", "red", "blue"),
breaks = c("2","1","3"),
drop = FALSE,
labels = c("SNP", "Deletion", "Duplication")) +
theme(legend.position = "right")

test


Even though I specify the option
drop = FALSE
, I always miss the levels "Deletion" and "Duplication" in the legend.

I have been struggling with this issue for a few days, but I'm not able to figure out how to solve it.

I would like to have a legend that includes the three levels that I have specified with the scale_color_manual () function (i.e. "SNP", "Deletion", "Duplication"), even if there are none of them in the bed file.

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] biovizBase_1.20.0 ggbio_1.20.2 GenomicRanges_1.24.3 GenomeInfoDb_1.8.7 IRanges_2.6.1
[6] S4Vectors_0.10.3 ggplot2_2.1.0 BiocGenerics_0.18.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.7 lattice_0.20-34 Rsamtools_1.24.0
[4] Biostrings_2.40.2 digest_0.6.10 mime_0.5
[7] R6_2.1.3 plyr_1.8.4 chron_2.3-47
[10] acepack_1.3-3.3 RSQLite_1.0.0 BiocInstaller_1.22.3
[13] httr_1.2.1 zlibbioc_1.18.0 GenomicFeatures_1.24.5
[16] data.table_1.9.6 rpart_4.1-10 Matrix_1.2-7.1
[19] labeling_0.3 splines_3.3.1 BiocParallel_1.6.6
[22] AnnotationHub_2.4.2 stringr_1.1.0 foreign_0.8-67
[25] RCurl_1.95-4.8 biomaRt_2.28.0 munsell_0.4.3
[28] shiny_0.14 httpuv_1.3.3 rtracklayer_1.32.2
[31] htmltools_0.3.5 nnet_7.3-12 SummarizedExperiment_1.2.3
[34] gridExtra_2.2.1 interactiveDisplayBase_1.10.3 Hmisc_3.17-4
[37] XML_3.98-1.4 reshape_0.8.5 GenomicAlignments_1.8.4
[40] bitops_1.0-6 RBGL_1.48.1 grid_3.3.1
[43] xtable_1.8-2 GGally_1.2.0 gtable_0.2.0
[46] DBI_0.5-1 magrittr_1.5 scales_0.4.0
[49] graph_1.50.0 stringi_1.1.1 XVector_0.12.1
[52] reshape2_1.4.1 latticeExtra_0.6-28 Formula_1.2-1
[55] RColorBrewer_1.1-2 ensembldb_1.4.7 tools_3.3.1
[58] dichromat_2.0-0 OrganismDbi_1.14.1 BSgenome_1.40.1
[61] Biobase_2.32.0 survival_2.39-5 AnnotationDbi_1.34.4
[64] colorspace_1.2-6 cluster_2.0.4 VariantAnnotation_1.18.7


Thank you very much,

Best,

Answer

One option is to make sure your factor contains all the levels you want to plot. This will allow drop = FALSE to be effective.

You can do this via factor and the levels argument. For example, if I wanted to add the level 5 to mtcars::cyl:

mtcars$cyl = factor(mtcars$cyl, levels = c("4", "5", "6", "8"))

Another option is to replace breaks with limits in scale_color_manual. This approach does not rely on the actual factor levels in the data (so drop = FALSE doesn't do anything).

scale_color_manual("Variant type",
                values = c("black", "red", "blue"),
                limits = c("2","1","3"),
                labels = c("SNP", "Deletion", "Duplication"))
Comments