Fredrik Karlsson - 1 year ago 123
R Question

# L1 penalized regression fails to predict from model

This question may be too package specific, but I would value input on what can be wrong in my use of the

`predict`
function on my data set.

The procedure I'm using is the following:

``````require(penalized)
# neg contains negative data
# pos contains positive data
``````

Now, the procedure below aims to construct comparable (balanced in terms os positive and negative cases) training and validation data sets.

``````# 50% negative training set
negSamp <- neg %>% sample_frac(0.5) %>% as.data.frame()
# Negative validation set
negCompl <- neg[setdiff(row.names(neg),row.names(negSamp)),]
# 50% positive training set
posSamp <- pos %>% sample_frac(0.5) %>% as.data.frame()
# Positive validation set
posCompl <- pos[setdiff(row.names(pos),row.names(posSamp)),]
# Combine sets
validat <- rbind(negSamp,posSamp)
training <- rbind(negCompl,posCompl)
``````

Ok, so here we now have two comparable sets.

``````[1] FALSE  TRUE
> dim(training)
[1] 1061  381
> dim(validat)
[1] 1060  381
> identical(names(training),names(validat))
[1] TRUE
``````

I fit the model to the training set without a problem (and I've tried using a range of Lambda1 values here). But, fitting the model to the validation data set fails, with a just odd error description.

``````> fit <- penalized(VoiceTremor,training[-1],data=training,lambda1=40,standardize=TRUE)
# nonzero coefficients: 13
> fit2 <- predict(fit, penalized=validat[-1], data=validat)
Error in .local(object, ...) :
row counts of "penalized", "unpenalized" and/or "data" do not match
``````

Just to make sure that this is not due to some NA's in the data set:

``````> identical(validat,na.omit(validat))
[1] TRUE
``````

Oddly enough, I may generate some new data that is comparable to the proper data set:

``````data.frame(VoiceTremor="NVT",matrix(rnorm(380000),nrow=1000,ncol=380) ) -> neg
data.frame(VoiceTremor="VT",matrix(rnorm(380000),nrow=1000,ncol=380) ) -> pos
> dim(pos)
[1] 1000  381
> dim(neg)
[1] 1000  381
``````

and run the procedure above, and then the second fit works!
How come? What could be wrong with my second (not training) data set?

Ok,

I found the solution to this problem. The problem was in my finding of complementary data sets.

``````neg[setdiff(row.names(neg),row.names(negSamp)),]
``````

does not do the right thing, but

``````neg %>%
rownames_to_column() %>%
filter(! rowname %in% row.names(negSamp)) %>%
column_to_rownames() %>% data.frame()
``````

does. With this change, along with using `data.frame` instead of `as.data.frame` then it all works.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download