agatha agatha - 2 months ago 20
R Question

Peak detection in Manhattan plot

The attached plot (Manhattan plot) contains on the x axis chromosome positions from the genome and on the Y axis -log(p), where p is a p-value associated with the points (variants) from that specific position.
enter image description here

I have used the following R code to generate it (from the gap package) :

require(gap)
affy <-c(40220, 41400, 33801, 32334, 32056, 31470, 25835, 27457, 22864, 28501, 26273,
24954, 19188, 15721, 14356, 15309, 11281, 14881, 6399, 12400, 7125, 6207)
CM <- cumsum(affy)
n.markers <- sum(affy)
n.chr <- length(affy)
test <- data.frame(chr=rep(1:n.chr,affy),pos=1:n.markers,p=runif(n.markers))
oldpar <- par()
par(cex=0.6)
colors <- c("red","blue","green","cyan","yellow","gray","magenta","red","blue","green", "cyan","yellow","gray","magenta","red","blue","green","cyan","yellow","gray","magenta","red")
mhtplot(test,control=mht.control(colors=colors),pch=19,bg=colors)
> head(test)
chr pos p
1 1 1 0.79296584
2 1 2 0.96675136
3 1 3 0.43870076
4 1 4 0.79825513
5 1 5 0.87554143
6 1 6 0.01207523


I am interested in getting the coordinates of the peaks of the plot above a certain threshold (-log(p)) .

Answer

If you want the indices of the values above the 99th percentile:

# Add new column with log values
test = transform(test, log_p = -log10(test[["p"]]))
# Get the 99th percentile
pct99 = quantile(test[["log_p"]], 0.99)

...and get the values from the original data test:

peaks = test[test[["log_p"]] > pct99,]
> head(peaks)
    chr pos           p    log_p
5     1   5 0.002798126 2.553133
135   1 135 0.003077302 2.511830
211   1 211 0.003174833 2.498279
586   1 586 0.005766859 2.239061
598   1 598 0.008864987 2.052322
790   1 790 0.001284629 2.891222

You can use this with any threshold. Note that I have not calculated the first derivative, see this question for some pointers:

How to calculate first derivative of time series

after calculating the first derivative, you can find the peaks by looking at points in the timeseries where the first derivative is (almost) zero. After identifying these peaks, you can check which ones are above the threshold.