Raag Agrawal Raag Agrawal - 1 month ago 7
R Question

How to iterate GLMs in H2O

My dataset looks like this:

rownum genes samples y tissue
1 | A | a |1 | Muscle
2 | B | a |1 | Brain
3 | C | a |1 | Brain
4 | D | a |0 | Muscle
5 | E | a |0 | Brain
6 | F | a |0 | Muscle


I want to create many h2o.frames that are based on tissue identity. Like this:

Muscle:

rownum genes samples y tissue
1 | A | a |1 | Muscle
2 | D | a |0 | Muscle
3 | F | a |0 | Muscle


Brain:

rownum genes samples y tissue
1 | B | a |1 | Brain
2 | C | a |1 | Brain
3 | E | a |0 | Brain


While I am currently doing it manually, that becomes difficult when I add more tissues to the dataset.

I also want to then push those h2o.frames to h2o.glm and iteratively save the model.

"INSERT TISSUE NAME HERE" = h2o.glm(y = "y", x =
c("genes","samples"),
training_frame = ITERATE H2O FRAMES HERE, family = 'poisson')


and then save the model

INSERT TISSUE NAME HERE <- h2o.saveModel(object= INSERT TISSUE NAME
HERE, force=TRUE)


I would appreciate any help or advice you might have. I do know about interaction terms in GLM, but would like to do this for now.

Answer Source

Since you did not provide the data directly, I copied your example from above as an R data.frame.

# Example data as an R data.frame
df <- data.frame(genes = c("A","B","C","D","E","F"),
                 samples = c("a","a","a","a","a","a"),
                 y = c(1,1,1,0,0,0),
                 tissue = c("Muscle","Brain","Brain","Muscle","Brain","Muscle"))

# Convert R data.frame to H2OFrame
hf <- as.h2o(df)

However, I assume you have this data in a CSV on your computer, so in reality, what you'd do is this:

# Load data from disk directly into H2O cluster
hf <- h2o.importFile("tissue_samples.csv")

Now that you have the data in an H2OFrame, there are only a few more steps:

# List of unique tissue types
tissue_types <- as.list(h2o.unique(hf$tissue))

# Create list of frames (one for each tissue type)
frames <- sapply(tissue_types, function(t) hf[(hf[,"tissue"] %in% t),])

# Set up h2o.glm arguments
x <- c("genes", "samples")
y <- "y"

# List of glms (one for each tissue type)
glms <- sapply(frames, function(fr) h2o.glm(x = x, y = y, 
                       family = "poisson", training_frame = fr))

# Save the models
model_names <- sapply(glms, function(m) h2o.saveModel(m, path = "/Users/me/", force = TRUE))

# Look at model names
print(model_names)
# [1] "/Users/me/GLM_model_R_1497937770060_222"
# [2] "/Users/me/GLM_model_R_1497937770060_223"