Iris Zhanfu Iris Zhanfu - 3 months ago 16
R Question

How to split multiple values in one cell as several feature columns marked with boolean value?(R/Python)

I have a .cvs dataset, which has one column with multiple values.
I hope I can split these values and change them into single feature columns marked with boolean to stand if a specific item has this feature
e.g.:

| year_built | amenity |
----------------------------------------------------
| 1990 | Courtyard, |
| 2015 | Elevator,Pets - Cats ok, |
| 1998 | Elevator,Pets - Cats ok,Post-War |


transfer to

| year_built | amenity | Elevator | Pets - Cats ok | Post-War | Courtyard |
------------------------------------------------------------------------------------------------------
| 1990 | Courtyard, | 0 | 0 | 0 | 1 |
| 2015 | Elevator,Pets - Cats ok, | 1 | 1 | 0 | 0 |
| 1998 | Elevator,Pets - Cats ok,Post-War | 1 | 1 | 1 | 0 |


I checked the scikit learn class 'binarizer' in preprocessing package,it is kind of can achieve what I want ,but before that I also need some method to help split these values and recognize them.

Is there any methods to deal with this using R or Python?

Answer

I demonstrate one approach with a fake dataset.

set.seed(123)
d <- lapply(replicate(10, sample(letters[1:10], sample(1:10, 1))), function(x){
  paste(x, collapse = ",")
})
d <- data.frame(id = 1:10, features = unlist(d), stringsAsFactors = FALSE)
d
#    id            features
# 1   1               h,d,j
# 2   2 a,e,h,d,c,i,b,f,g,j
# 3   3   c,a,j,g,f,d,h,e,b
# 4   4     f,j,c,b,i,e,h,d
# 5   5                   e
# 6   6     c,j,b,a,i,f,h,g
# 7   7                 c,e
# 8   8               i,a,d
# 9   9     b,f,j,a,e,i,h,d
# 10 10                   d

Split up the features using the commas, and recover a vector of unique features:

d$split <- strsplit(d$features, ",")
features <- unique(unlist(d$split))
features
# [1] "h" "d" "j" "a" "e" "c" "i" "b" "f" "g"

We use the %in% operator to check if the features are found in each observation's set of features:

features_matrix <- do.call(rbind, lapply(d$split, function(x) features %in% x))
features_matrix <- as.data.frame(features_matrix)
dimnames(features_matrix) <- list(d$id, features)
features_matrix
#        h     d     j     a     e     c     i     b     f     g
# 1   TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# 2   TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
# 3   TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
# 4   TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
# 5  FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
# 6   TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
# 7  FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
# 8  FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
# 9   TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
# 10 FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Comments