L.Sobble - 10 months ago 53

R Question

I have a data frame with 96 observations and 1106 variables.

I would like to run logistic regression on the observations by leaving one out at a time. So for the first set of observations there would be 95 total with the first observation removed, the second set of observations there would be 95 total with the second observation removed, and so forth so that there are 95 sets of observations that each have one observation left out.

In addition, I would like to run each set of these observations on only one variable at a time. After running the regression for 95 observations on one variable, I would like to extract the p-value (leaving out the intercept p-value).

I have been able to do all of this manually, one at a time. However it is very tedious to do this 96 times and I'm sure there must be a way to automate this with a loop or multiple loops.

Here is a demonstration of how I have been doing this manually for 10 observations.

`## Create 10 data frames by removing one observation from each ##`

di.1 <- mainDF [-1,]

di.2 <- mainDF [-2,]

di.3 <- mainDF [-3,]

di.4 <- mainDF [-4,]

di.5 <- mainDF [-5,]

di.6 <- mainDF [-6,]

di.7 <- mainDF [-7,]

di.8 <- mainDF [-8,]

di.9 <- mainDF [-9,]

di.10 <- mainDF [-10,]

## Create data frames to put each p-value result in ##

dt.1 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)

dt.2 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)

dt.3 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)

dt.4 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)

dt.5 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)

dt.6 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)

dt.7 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)

dt.8 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)

dt.9 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)

dt.10 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)

## Run logistic regression on each data frame with one one obs. left out ##

## GLM run on one variable at a time##

## Extract p-values and put in separate dfs ##

for (i in 2:1106)

{

formulas <- glm(response ~ di.1[,i], data=di.1, family= "binomial")

dt.1[i,] <- coef(summary(formulas))[,4]

}

for (i in 2:1106)

{

formulas <- glm(response ~ di.2[,i], data=di.2, family= "binomial")

dt.2[i,] <- coef(summary(formulas))[,4]

}

for (i in 2:1106)

{

formulas <- glm(response ~ di.3[,i], data=di.3, family= "binomial")

dt.3[i,] <- coef(summary(formulas))[,4]

}

for (i in 2:1106)

{

formulas <- glm(response ~ di.4[,i], data=di.4, family= "binomial")

dt.4[i,] <- coef(summary(formulas))[,4]

}

for (i in 2:1106)

{

formulas <- glm(response ~ di.5[,i], data=di.5, family= "binomial")

dt.5[i,] <- coef(summary(formulas))[,4]

}

for (i in 2:1106)

{

formulas <- glm(response ~ di.6[,i], data=di.6, family= "binomial")

dt.6[i,] <- coef(summary(formulas))[,4]

}

for (i in 2:1106)

{

formulas <- glm(response ~ di.7[,i], data=di.7, family= "binomial")

dt.7[i,] <- coef(summary(formulas))[,4]

}

for (i in 2:1106)

{

formulas <- glm(response ~ di.8[,i], data=di.8, family= "binomial")

dt.8[i,] <- coef(summary(formulas))[,4]

}

for (i in 2:1106)

{

formulas <- glm(response ~ di.9[,i], data=di.9, family= "binomial")

dt.9[i,] <- coef(summary(formulas))[,4]

}

for (i in 2:1106)

{

formulas <- glm(response ~ di.10[,i], data=di.10, family= "binomial")

dt.10[i,] <- coef(summary(formulas))[,4]

}

## Remove intercept p-values ##

dt.1<- dt.1[-c(1)]

dt.2<- dt.2[-c(1)]

dt.3<- dt.3[-c(1)]

dt.4<- dt.4[-c(1)]

dt.5<- dt.5[-c(1)]

dt.6<- dt.6[-c(1)]

dt.7<- dt.7[-c(1)]

dt.8<- dt.8[-c(1)]

dt.9<- dt.9[-c(1)]

dt.10<- dt.10[-c(1)]

## Export data frames, then manually copy and paste them into one CSV ##

write.csv(dt.1, file = "MyData.csv")

write.csv(dt.2, file = "MyData2.csv")

write.csv(dt.3, file = "MyData3.csv")

write.csv(dt.4, file = "MyData4.csv")

write.csv(dt.5, file = "MyData5.csv")

write.csv(dt.6, file = "MyData6.csv")

write.csv(dt.7, file = "MyData7.csv")

write.csv(dt.8, file = "MyData8.csv")

write.csv(dt.9, file = "MyData9.csv")

write.csv(dt.10, file = "MyData10.csv")

I would like to know how I could do all of this work without having to go through each observation one at a time.

Here is a chunk of the data that I am using:

`Response X1 X2 X3 X4 X5 X6 X7 X8 X9 X10`

P1 N 1 1 1 0 1 0 1 0 2 2

P2 N 2 1 1 0 2 2 1 2 2 2

P3 N 2 1 2 1 1 0 1 1 0 1

P4 Y 1 1 2 0 1 0 0 1 1 1

P5 N 2 2 1 1 1 0 0 0 1 1

P6 N 2 1 2 1 1 0 0 0 2 1

P7 Y 2 1 1 0 2 0 0 0 2 0

P8 Y 2 1 1 0 2 0 0 1 0 2

P9 N 1 1 1 0 2 0 0 0 1 0

P10 N 2 1 2 1 1 0 1 0 0 2

Thank you very much for your time!

Answer Source

As I said earlier in my comment, I won't use `glm`

and `summary.glm`

as that will be too slow for your task, given that you are going to fit `96 * 1106`

GLM. I will use `glm.fit`

, and compute p-values for regression coefficients myselft. The function `f`

below does this. It takes a 1D-vector `x`

as covariate (no `NA`

allowed) and another 1D-vector `y`

as response (no `NA`

allowed). Since Logistic regression is done, it is required that `y`

is a factor of two levels (or 0-1 binary values).

```
f <- function (x, y) {
## call `glm.fit`
fit <- glm.fit(cbind(1,x), y, family = binomial())
## estimated regression coefficients
beta <- unname(fit$coefficients)
## since there are only two coefficients, I don't bother using `chol2inv`
## then extract square root of diagonals for standard errors
se <- sqrt(diag(chol2inv(fit$qr$qr, size = fit$qr$rank)))
## deal with possible rank-deficient case
if (length(se) < 2L) se <- c(se, NA_real_)
## z-score
z <- beta / se
## p-value (0.05 significance level)
2 * pnorm(-abs(z))
}
```

We can have a test on this function, if you don't trust it for correctness. Take your example data frame `dat`

as an example and we do `Response ~ X1`

:

```
dat <-
structure(list(Response = structure(c(1L, 1L, 1L, 2L, 1L, 1L,
2L, 2L, 1L, 1L), .Label = c("N", "Y"), class = "factor"), X1 = c(1L,
2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L), X2 = c(1L, 1L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 1L), X3 = c(1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
2L), X4 = c(0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L), X5 = c(1L,
2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L), X6 = c(0L, 2L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L), X7 = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
1L), X8 = c(0L, 2L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 0L), X9 = c(2L,
2L, 0L, 1L, 1L, 2L, 2L, 0L, 1L, 0L), X10 = c(2L, 2L, 1L, 1L,
1L, 1L, 0L, 2L, 0L, 2L)), .Names = c("Response", "X1", "X2",
"X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10"), row.names = c("P1",
"P2", "P3", "P4", "P5", "P6", "P7", "P8", "P9", "P10"), class = "data.frame")
## code response into factor
dat[[1]] <- factor(dat[[1]])
## call `f`
f(dat[[2]], dat[[1]])
# [1] 0.8559137 0.8804148
## call `glm` + `summary.glm`
coef(summary(glm(Response ~ X1, data = dat, family = binomial())))
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -0.4700036 2.588435 -0.1815783 0.8559137
#X1 -0.2231436 1.483239 -0.1504434 0.8804148
```

So p-values match!

We now need another function `g`

to organize what you plan to do as a double-nested loop. The outer loop controls "leave-one-out", while the inner loop is arranged by `lapply`

to loop through data frame columns. In the end of each iteration of outer loop, the resulting data frame of p-values are written to a ".csv" file.

```
g <- function (dat) {
## convert response to factor (if it is not readily is)
y <- as.factor(dat[[1]])
## leave-one-out
for (i in 1:nrow(dat)) {
## covariates data frame
covariates <- dat[-i, -1]
## response vector
response <- y[-i]
## call `f` to get a data frame of p-values
result <- as.data.frame(lapply(covariates, f, y = response))
## write data frame to file
write.csv(result, file = paste0(i,".csv"), row.names = FALSE)
}
}
```

When I run `g(dat)`

, I get ten ".csv" files as expected.

**Follow-up:**

I am still grasping how to do loops in R so seeing this is very helpful. In applying this to my data, would I put the name of the data frame I'd like to use in the

`dat`

? And do I need to specify the data frame in the`glm.fit`

function portion?

No. `glm.fit`

(and `lm.fit`

, too) has no formula interface. It takes only numerical matrices without missing values for direct matrix computations to get estimation. This is exactly why it is faster than `glm`

. It does not take and digest a data frame. You can read `?glm.fit`

to see what arguments it takes.

Your data frame `dat`

does not have to have column names. As said above, we have in nowhere used formula interface. The function `g`

assumes that the first column of `dat`

is response, while all other columns are independent variables. Also, `g`

does not check missing values / `NA`

, so you should ensure that `dat`

has no incomplete cases. These are only requirements for `g`

and `f`

.

The only advantage of having column names in `dat`

, is that those column names will be written as header in the exported ".csv" files, which may increase readability.