Sam Marshal Sam Marshal - 2 months ago 16
R Question

Error in setting up and cleaning a dataframe R

I am attempting to generate out of sample predictions and am getting this message after running the following code

Error: variable 'dummygen' was fitted with type "numeric" but type "factor" was supplied
.

I checked the
str
to verify that the two variables I am using are both numeric and they appear to be. I did a bunch of hunting around on here and think this might be somewhat related, but I haven't been able to get the suggestions to work.

Here is the code I have so far.

library(foreign)
library(plyr)
library(rvest)
library(stringi)
library(purrr)
library(XLConnect)
library(splitstackshape)
library(tidyr)
library(dplyr)

donner_raw <- read.csv("donner.txt", sep="\t", header = FALSE)
colnames(donner_raw) <- c("age_gen", "survive")

donner_raw <- separate(donner_raw, age_gen, into = c("age", "gender"), "(?<=\\d)(?=[A-Za-z])")

dummygen <- as.numeric(donner_raw$gender == "M")
donner_raw <- cbind(donner_raw, dummygen)

donner_raw <- transform(donner_raw, age = as.numeric(age))
donner_raw <- transform(donner_raw, dummygen = as.numeric(dummygen))

logit <- glm(survive ~ age + dummygen,family=binomial(link='logit'),data=donner_raw)

newlogit <- data.frame(age=seq(1,6, length=20), dummygen=("0"))
ooslogit <- predict.glm(logit, newlogit, se.fit=TRUE)


I'm not sure where in the process of what I've done I messed up. Here is a reproducible part of the data.

dput(droplevels(head(donner_raw)))
structure(list(age = structure(c(6L, 4L, 5L, 3L, 2L, 1L), .Label = c("13", "3", "4", "45", "6", "60"), class = "factor"), gender = c("M", "F", "F", "F", "F", "F"), dummygen = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor")), .Names = c("age", "gender", "survive", "dummygen"), row.names = c(NA, 6L), class = "data.frame")


Additionally, here is the output from
sessionInfo()


R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ggplot2_2.1.0 dplyr_0.5.0 tidyr_0.5.1
[4] splitstackshape_1.4.2 data.table_1.9.6 XLConnect_0.2-12
[7] XLConnectJars_0.2-12 purrr_0.2.2 stringi_1.1.1
[10] rvest_0.3.2 xml2_1.0.0 plyr_1.8.4
[13] foreign_0.8-66

loaded via a namespace (and not attached):
[1] Rcpp_0.12.6 magrittr_1.5 munsell_0.4.3 colorspace_1.2-6
[5] R6_2.1.2 httr_1.2.1 tools_3.3.1 grid_3.3.1
[9] gtable_0.2.0 DBI_0.4-1 assertthat_0.1 tibble_1.1
[13] rJava_0.9-8 gender_0.5.1 scales_0.4.0 chron_2.3-47

Answer

Let's simply read and think about the error message:

Error: variable 'dummygen' was fitted with type "numeric" but type "factor" was supplied

This error occurs after the line:

ooslogit <- predict.glm(logit, newlogit, se.fit=TRUE)

(Presumably, at least, because you're question isn't very clear about this and provides lots of code that doesn't seem related.)

So R is telling you that when the model was fit the variable dummygen was numeric, but now you've given it a factor.

So let's look:

str(newlogit)
'data.frame':   20 obs. of  2 variables:
 $ age     : num  1 1.26 1.53 1.79 2.05 ...
 $ dummygen: Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...

Yep!

So your problem was that you inexplicably created the data frame newlogit by specifying:

newlogit <- data.frame(age=seq(1,6, length=20), dummygen=("0"))

which clearly specifies that the variable dummygen is not going to be numeric. Just convert it back, or remove the quotes in the first place. For example:

newlogit <- data.frame(age=seq(1,6, length=20), dummygen= 0)

or

newlogit$dummygen <- as.numeric(newlogit$dummygen)