Vipin Vipin - 1 year ago 57
R Question

Spread a string to multiple columns in R

I am trying to do one-hot-encoding of the below character dataframe in R.

x1 <- c('')
x2 <- c('A1,A2')
x3 <- c('A2,A3,A4')
test <- as.data.frame(rbind(x1,x2,x3))


I am trying to bring the data to the format:

x1 <- c(0,0,0,0)
x2 <- c(1,1,0,0)
x3 <- c(0,1,1,1)
result <- as.data.frame(rbind(x1,x2,x3))
names(result) = c('A1','A2','A3','A4')


The delimiter that is used is comma and I can split on the comma using:

test$V1 = as.character(test$V1)
split_list = strsplit(test$V1, ",")


This gives me a list of lists which cannot be coerced directly into a dataframe. Is there a better way of doing this. I was trying out "https://www.rdocumentation.org/packages/CatEncoders/versions/0.1.0/topics/OneHotEncoder.fit". The package was spreading a single column rather than multiple columns as needed in this case.

Answer Source

A custom function to spread the unique strings values into columns:

x1 <- c('')
x2 <- c('A1,A2')
x3 <- c('A2,A3,A4')
test <- data.frame(col1=rbind(x1,x2,x3), stringsAsFactors = F) # test$col1 is a character column

cast_variables <- function(df, variable){
  df[df==""] <- "missing" #handling missingness
  x <- as.character(unique(df[[variable]]))
  x <- gsub(" ", "", toString(x)) #so it can split on strings like "A1,A2" and "A1, A2"
  x <- unlist(strsplit(x, ","))
  x <- as.character(x)
  new_columns <- unique(sort(x))[-grep("missing", unique(sort(x)))]
   for (i in seq_along(new_columns)){
    df$temp <- NA
    df$temp <- ifelse(grepl(new_columns[i], df[[variable]]), 1, 0)
    colnames(df)[colnames(df) == "temp"] <- new_columns[i]
  }
  return(df)
}

test <- cast_variables(test, "col1")
print(test)
#       col1 A1 A2 A3 A4
#x1  missing  0  0  0  0
#x2    A1,A2  1  1  0  0
#x3 A2,A3,A4  0  1  1  1
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download