A D A D - 3 months ago 18
R Question

Understanding an RLE coverage value

Using R and bioconductor.

I'm not sure how to understand an integer rle that you'd get from functions like coverage() such as this

integer-Rle of length 3312 with 246 runs
Lengths: 25 34 249 16 7 11 16 ... 2 32 2 26 34 49
Values : 0 1 0 1 2 3 2 ... 1 2 1 0 1 0


Okay so I get that it represents coverage of one range vs other ranges. In this case reads of an experiment over a given range. What do the 'runs' mean? What about the 'Lengths' and 'Values'? I thought that maybe Lengths represent a postion and values represent the amount of times its covered but then why would there be multiples of the same position such as 2 above? Why would they be out of order?

I ask because I'm using

sum(coverage)


to compare the coverage of one range to another of a different length and I was wondering if that was appropriate.

Answer

Probably it's better to ask about Bioconductor packages on the Bioconductor support site.

The interpretation is that there is a run of 25 nucleotides with 0 coverage, then a run of 24 nucleotides with 1 coverage (i.e., a single read) then another run of 249 nucleotides with no coverage, then things start to get interesting as multiple reads overlap positions. From the summary line at the top of the output, your read covers 3312 nucleotides, maybe from a single transcript? If you were to

plot(as.integer(coverage))

you'd get a quick plot of how coverage varies along the length of the transcript.

Maybe sum(coverage) is appropriate; a more usual metric is to count reads rather than coverage, e.g., with GenomicRanges::summarizeOverlaps() illustrated in this DESeq2 work flow in the context of RNA-seq.

Comments