bee guy bee guy - 1 month ago 5
R Question

Normal/rankit scores from non-normal data with ties for regression in [R]

I would like to see whether pathogen levels influenced certain bumblebee colony development parameters such as the number of queen pupae produced.

Due to non-normality of the data I would like to do a rankit transformation of the data as suggested by Bishara & Hittner (2012).

To define this transformation, let x_r be the ascending rank of x, such
that x_r = 1 for the lowest value of x. The RIN transformation function
used here is

f(x)= Φ^(-1) ((x_r-0.5)/n)

where Φ^(-1) is the inverse normal cumulative distribution function and
n is the sample size (Bliss, 1967).

I could not figure out how to do the rankit transformation in R. I tried this:

my.df$queen.pupae_rankit = qnorm((rank(my.df$queen.pupae)-0.5)/length(my.df$queen.pupae))

However, the ties seem to prevent a normal distribution of the rankit scores:


enter image description here

Therefore, I would like to know

  1. How can I get rankit scores from data with ties in R?

  2. Is the qnorm function actually the correct function to get the inverse cumulative distribution function?

  3. Bishara & Hittner (2012) used the rankit scores in Pearson correlations rather than regressions. I know in a regression only
    the independent variable has to be normally distributed. Should I
    anyway, as Bishara & Hittner (2012) did transform also the
    dependent variable?

PS. I also looked into the
rntransform {GenABEL} , rankInverseNormalDataFrame {FRESA.CAD} and qNormScore {SuppDists}, but I could not figure out how to use them to get the rankit scores I want. The data looks like this:

structure(list(queen.pupae = c(0L, 0L, 131L, 9L, 0L, 0L, 24L,
0L, 141L, 1L, 0L, 0L, 0L, 0L, 11L, 45L, 0L, 1L, 0L, 5L, 84L,
5L, 5L, 1L, 0L, 0L, 116L, 0L, 0L, 0L, 0L, 0L, 13L, 92L, 1L, 45L,
120L, 137L, 40L, 100L, 119L, 74L, 8L, 41L, 19L, 1L, 52L, 32L,
123L, 0L, 0L, 5L, 162L, 68L, 10L, 0L, 20L, 229L, 2L, 87L, 219L,
143L, 82L, 1L), worker.adults = c(146L, 185L, 181L, 145L, 244L,
185L, 152L, 114L, 254L, 337L, 210L, 290L, 162L, 186L, 84L, 166L,
295L, 107L, 229L, 203L, 125L, 183L, 246L, 217L, 22L, 106L, 150L,
112L, 45L, 116L, 120L, 152L, 66L, 78L, 65L, 160L, 149L, 247L,
60L, 193L, 255L, 184L, 300L, 41L, 96L, 101L, 37L, 45L, 291L,
353L, 158L, 243L, 146L, 128L, 40L, 390L, 129L, 59L, 77L, 663L,
295L, 498L, 254L, 449L), pathogen1.dna = c(0, 318111.127271693,
0, 68623.2739754326, 1574.45287019555, 34424.6122347574, 2400.58041860919,
43515.3059302234, 4832293.58571446, 8799.05541479988, 0, 28825.2443389828,
0, 1523.13350414953, 8865474.42623986, 0, 0, 521807.198120121,
5174641.18054382, 0, 15904014.4954482, 43560.4440044516, 0, 25389.0067977301,
388996.478514811, 95206.2277317915, 11828659.0129974, 807202.672709897,
5359061.63083682, 0, 21041.1231283436, 31817666.6056002, 4545923.10675542,
10685.8600591283, 16115.7029438609, 0, 67887826.6688623, 16943.6858267549,
1492919.02988919, 49436.4613189687, 711743.102574896, 0, 23651052.7433696,
76175.2980832307, 21563.8738983475, 76520.1382493025, 164861.507683675,
2203260.57078847, 24348427.1595032, 134749.527642678, 276476.323303274,
10329030.0039368, 93822.2696353729, 12872122.4242484, 31680707.4838652,
6701547.09356281, 2369578.88255313, 1413650.78332731, 522467.993244771,
989515.406542198, 3837021.29623798, 1020067.61286839, 37534060.9859563,
43371163.4363934)), .Names = c("queen.pupae", "worker.adults",
"pathogen1.dna"), class = "data.frame", row.names = c(NA, -64L

  1. You could use a different method for dealing with ties in the ranks. For example: rank(my.df$queen.pupae, ties.method = "random") Have a look at ?rank for more options.

  2. Yes, I believe qnorm is the right function here!

  3. I don't think so, but I'm not sure. You could try ask on Cross Validated.