Tamara Dominguez Poncelas Tamara Dominguez Poncelas - 3 months ago 14
R Question

Set starting position for aggregate function?

I'm counting the number of occurrences for each 100kb section (bins) in my data, using

start
and
end
columns. I have used
aggregate
for that purpose.

Data sample:

chr start end length abs_summit pileup X.log10.pvalue. fold_enrichment
1 chr1 10004 10586 583 10076 288 262.84540 19.37227
2 chr1 28946 29387 442 29309 59 37.01597 8.33123
3 chr1 384456 386620 2165 385473 38 30.88671 10.66657
4 chr1 544777 546003 1227 545467 46 29.03529 7.95905
5 chr1 546962 547834 873 547696 37 21.86056 6.93344
6 chr1 564682 565377 696 565177 2396 801.42346 4.73626
7 chr1 565647 565859 213 565768 2225 677.54956 4.33460
8 chr1 566082 566749 668 566207 2363 767.32574 4.60286
9 chr1 567264 567682 419 567385 2559 900.85590 4.98421
10 chr1 569289 569585 297 569395 1994 535.04041 3.88158
11 chr1 603864 605365 1502 604917 28 20.02823 8.06871
12 chr1 713780 714492 713 714122 80 62.03205 12.10543
13 chr1 726303 726397 95 726331 35 20.22534 6.65208
14 chr1 726902 727015 114 726956 38 23.21246 7.27584
15 chr1 762303 763398 1096 762976 50 28.46482 7.09851
16 chr1 894589 894800 212 894677 58 28.29763 6.05185
17 chr1 912206 912835 630 912372 60 25.23332 5.16066
18 chr1 1013683 1014743 1061 1013926 67 28.39317 5.28122
19 chr1 1051254 1052109 856 1051607 76 45.31027 8.12284
20 chr1 1092833 1093509 677 1092949 50 21.65445 5.17642


Code to make bins each 100kb:

normal_count1 = aggregate(end ~ chr + start%/%100000, data=normal, FUN=length)


Which results in:

chr x100Kb occurrences_norm
1 chr1 0 2
39 chr1 3 1
56 chr1 5 7
67 chr1 6 1
79 chr1 7 4
91 chr1 8 1
102 chr1 9 1


Where
x100kb
is the bin number.

However, I would like to add a new 'reading frame' starting in the position 50000 (so bin number 1 would go from 50kb to 150kb,bin number 2 from 150 to 250, etc). I have tried using
aggregate
again, but this hasn't worked (it's adding 50000 to the start value, I think):

normal_count2 = aggregate(end ~ chr + (start+50000)%/%100000, data=normal, FUN=length)


Is there any way of doing this with
aggregate
or is there a more appropiate function I should use?

42- 42-
Answer

I haven't figured out what the issue you are seeing but instead tried to follow your description of what was wanted. I used a modified dataset with values that spanned the first desired boundary:

normal <-   read.table(text="  chr   start    end   length  abs_summit pileup   X.log10.pvalue. fold_enrichment
1  chr1   10004   10586    583      10076    288       262.84540        19.37227
2  chr1   28946   29387    442      29309     59        37.01597         8.33123
3  chr1  49999  386620   2165     385473     38        30.88671        10.66657
4  chr1  50000  546003   1227     545467     46        29.03529         7.95905
5  chr1  50001  547834    873     547696     37        21.86056         6.93344
6  chr1  564682  565377    696     565177   2396       801.42346         4.73626
7  chr1  565647  565859    213     565768   2225       677.54956         4.33460
8  chr1  566082  566749    668     566207   2363       767.32574         4.60286
9  chr1  567264  567682    419     567385   2559       900.85590         4.98421
10 chr1  569289  569585    297     569395   1994       535.04041         3.88158
11 chr1  603864  605365   1502     604917     28        20.02823         8.06871
12 chr1  713780  714492    713     714122     80        62.03205        12.10543
13 chr1  726303  726397     95     726331     35        20.22534         6.65208
14 chr1  726902  727015    114     726956     38        23.21246         7.27584
15 chr1  762303  763398   1096     762976     50        28.46482         7.09851
16 chr1  894589  894800    212     894677     58        28.29763         6.05185
17 chr1  912206  912835    630     912372     60        25.23332         5.16066
18 chr1 1013683 1014743   1061    1013926     67        28.39317         5.28122
19 chr1 1051254 1052109    856    1051607     76        45.31027         8.12284
20 chr1 1092833 1093509    677    1092949     50        21.65445         5.17642
", header=TRUE)
normal$bin=(normal$start+50000)%/%100000
normal$bin=(normal$start+50000)%/%100000
(normal_count1 = aggregate(end ~ chr + bin, data=normal, FUN=length))
#---
   chr bin end
1 chr1   0   3
2 chr1   1   2
3 chr1   6   6
4 chr1   7   3
5 chr1   8   1
6 chr1   9   2
7 chr1  10   1
8 chr1  11   2

If the problem is the appearance of the "0"-bin then subsetting is needed and the aggregate.formula function does have a 'subset' argument, so it would just be: normal_count1 = aggregate(end ~ chr + bin, data=normal, FUN=length, subset= (start >= 50000) )

Comments