Paul - 5 months ago 33

R Question

My data is in the following format and includes a particular statistic

`site LRStat`

1 3.580728

2 2.978038

3 5.058644

4 3.699278

5 4.349046

This is just a sample of the data.

I then obtained the null LR distribution as well by permuting random pairs of data. I used this to plot a histogram with frequency in the y-axes and LR statistic in the x-axes. How is it possible to determine the critical p-value cut-off points based on the null distribution (as shown in the below figure)?

Answer

You now have a sampling distribution of LR values. The `quantile`

function in R will give you an estimate of whatever "critical value" you prefer. If, for instance, you decided you wanted the conventional 0.05 "p-value" you could take your dataframe, named LR_df for illustration, and issue this command:

```
quantile( LR_df[ , 'LRStat'] , 0.95)
```

If yo wnated all of those "probabilities" on hte figure you would use a vector of values complementary to unity. This gives you the `LSstat`

values at which a given proportion of the sample are higher than that value.

```
quantile( LR_df[ , 'LRStat'] , c(0.9, 0.95, 0.99, 0.999, 0.9999) )
```

The p-values are just a sampling distribution of a test statistic under a null hypothesis. Your null hypothesis in this case is that the LRstats are uniformly distributed. (I know it sounds strange to put it that way, but if you want to argue with the statisticians then get a copy of http://amstat.tandfonline.com/doi/pdf/10.1198/000313008X332421 .) The choice of p-value for cutoff will depend on scientific or business setting. If you were assessing an investment opportunity the cutoff might be 0.15 but if you are trying to find new scientific knowledge, I think it should be higher. The field of molecular genetics has a lot of junk (i.e. fails to reproduce results) in their literature because they were not strict enough in the statistical methods.