Paul Paul - 3 months ago 19
R Question

Estimating p-value thresholds from a distribution plot

My data is in the following format and includes a particular statistic

site LRStat
1 3.580728
2 2.978038
3 5.058644
4 3.699278
5 4.349046

This is just a sample of the data.

I then obtained the null LR distribution as well by permuting random pairs of data. I used this to plot a histogram with frequency in the y-axes and LR statistic in the x-axes. How is it possible to determine the critical p-value cut-off points based on the null distribution (as shown in the below figure)?

enter image description here

42- 42-

You now have a sampling distribution of LR values. The quantile function in R will give you an estimate of whatever "critical value" you prefer. If, for instance, you decided you wanted the conventional 0.05 "p-value" you could take your dataframe, named LR_df for illustration, and issue this command:

quantile( LR_df[ , 'LRStat'] , 0.95) 

If yo wnated all of those "probabilities" on hte figure you would use a vector of values complementary to unity. This gives you the LSstat values at which a given proportion of the sample are higher than that value.

quantile( LR_df[ , 'LRStat'] , c(0.9, 0.95, 0.99, 0.999, 0.9999) ) 

The p-values are just a sampling distribution of a test statistic under a null hypothesis. Your null hypothesis in this case is that the LRstats are uniformly distributed. (I know it sounds strange to put it that way, but if you want to argue with the statisticians then get a copy of .) The choice of p-value for cutoff will depend on scientific or business setting. If you were assessing an investment opportunity the cutoff might be 0.15 but if you are trying to find new scientific knowledge, I think it should be higher. The field of molecular genetics has a lot of junk (i.e. fails to reproduce results) in their literature because they were not strict enough in the statistical methods.