Stephen Williams Stephen Williams - 9 months ago 148
R Question

R: "label" a row based on conditions from another data.table

I have a data.table (A) that is over 100,000 rows long. There are 3 columns.


chrom start end
1: chr1 6484847 6484896
2: chr1 6484896 6484945
3: chr1 6484945 6484994
4: chr1 6484994 6485043
5: chr1 6485043 6485092
---

183569: chrX 106893605 106893654
183570: chrX 106893654 106893703
183571: chrX 106893703 106893752
183572: chrX 106893752 106893801
183573: chrX 106893801 106894256


I'd like to generate a new column named "gene" that provides a label for each row based annotations from another data.table which has ~90 rows (B). Seen below:


chrom start end gene
1: chr1 6484847 6521004 ESPN
2: chr1 41249683 41306124 KCNQ4
3: chr1 55464616 55474465 BSND
42: chrX 82763268 82764775 POU3F4
43: chrX 100600643 100603957 TIMM8A
44: chrX 106871653 106894256 PRPS1


If the row start value in data.table A is within the row start and end values of data.table B I need the row in A to be labeled with the correct gene accordingly.

For example the resulting complete data.table A would be


chrom start end gene
1: chr1 6484847 6484896 ESPN
2: chr1 6484896 6484945 ESPN
3: chr1 6484945 6484994 ESPN
4: chr1 6484994 6485043 ESPN
5: chr1 6485043 6485092 ESPN
---

183569: chrX 106893605 106893654 TIMM8A
183570: chrX 106893654 106893703 TIMM8A
183571: chrX 106893703 106893752 TIMM8A
183572: chrX 106893752 106893801 TIMM8A
183573: chrX 106893801 106894256 TIMM8A


I've attempted some nested loops to do this but that seems like it would take WAY too long. I think there must be a way to do this with the data.table package but I can't seem to figure it out.

Any and all suggestions would be greatly appreciated.

Answer Source

While it's certainly possible to do this in base R (or potentially using data.table), I would highly recommend using GenomicRanges; it's a very powerful and flexible R/Bioconductor library that's been designed for these kind of tasks.

Here is an example using GenomicRanges::findOverlaps:

# Sample data
df1 <- read.table(text =
    "chrom     start      end
     chr1   6484847   6484896
     chr1   6484896   6484945
     chr1   6484945   6484994
     chr1   6484994   6485043
     chr1   6485043   6485092", sep = "", header = T, stringsAsFactors = F);

df2 <- read.table(text =
    "chrom     start       end     gene
     chr1   6484847   6521004     ESPN
     chr1  41249683  41306124     KCNQ4
     chr1  55464616  55474465     BSND
     chrX  82763268  82764775     POU3F4
     chrX 100600643 100603957     TIMM8A
     chrX 106871653 106894256     PRPS1", sep = "", header = TRUE, stringsAsFactors = F);

# Convert to GRanges objects
gr1 <- with(df1, GRanges(chrom, IRanges(start = start, end = end)));
gr2 <- with(df2, GRanges(chrom, IRanges(start = start, end = end), gene = gene));

# Find features from gr1 that overlap with gr2
m <- findOverlaps(gr1, gr2);

# Add gene annotation as metadata to gr1
mcols(gr1)$gene[queryHits(m)] <- mcols(gr2)$gene[subjectHits(m)];
gr1;
#GRanges object with 5 ranges and 1 metadata column:
#      seqnames             ranges strand |        gene
#         <Rle>          <IRanges>  <Rle> | <character>
#  [1]     chr1 [6484847, 6484896]      * |        ESPN
#  [2]     chr1 [6484896, 6484945]      * |        ESPN
#  [3]     chr1 [6484945, 6484994]      * |        ESPN
#  [4]     chr1 [6484994, 6485043]      * |        ESPN
#  [5]     chr1 [6485043, 6485092]      * |        ESPN
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download