TemplateRex TemplateRex - 4 months ago 14
R Question

How to split a column of a variable number of concatenated tags into one column per tag?

Consider the following data:

library(tibble)

key <- c("a", "b", "c", "d", "e")
tags <- c("A,B", "B", "A,E", "C,D", "")
data <- tibble(key, tags)


Here,
key
could mean book title and
tags
could be genres, or
key
could be an email sender and
tags
could mean recipients. Essential is that the column
tags
can have a variable (possibly zero) number of different substrings.

For splitting a fixed number of concatenated tags (e.g. like a data) I can use
tidyr::spread
, and I can use string splitting to separate the
tags
column itself, but how to combine the two?

I would like the transformed data to look like this:

key A B C D E
a TRUE TRUE FALSE FALSE FALSE
b FALSE TRUE FALSE FALSE FALSE
c TRUE FALSE FALSE FALSE TRUE
d FALSE FALSE TRUE TRUE FALSE
e FALSE FALSE FALSE FALSE FALSE


I can see it's possible to do this in several steps by splitting
tags
, determining the unique substrings and loop over each of them and test if
tags
for each row contains the string. But I'd prefer to do this in a pipeline using the tidyverse.

Question: how can I split the variable number of concatened tags into one column per tag?

Answer

The separate_rows function from tidyr may help you get where you want. This splits the strings within tags into separate rows instead of separate columns, which sets you up to use spread.

To get the TRUE/FALSE result I created a new column of all TRUE to use as the value column, and then filled the missing with FALSE in spread. In the end,spread kept the blank cell as a column name, which I removed via select. There may be a better way to do this (maybe convert to NA?).

library(tidyr)
library(dplyr)

data %>%
    separate_rows(tags) %>%
    mutate(tagslog = TRUE) %>%
    spread(tags, tagslog, fill = FALSE) %>%
    select(-one_of(""))

    key     A     B     C     D     E
* <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
1     a  TRUE  TRUE FALSE FALSE FALSE
2     b FALSE  TRUE FALSE FALSE FALSE
3     c  TRUE FALSE FALSE FALSE  TRUE
4     d FALSE FALSE  TRUE  TRUE FALSE
5     e FALSE FALSE FALSE FALSE FALSE

You can almost get where you want with just separate_rows and table, but I still had that extra blank column that would need to be removed.

data %>%
    separate_rows(tags) %>%
    with(., table(key, tags) == 1)

   tags
key           A     B     C     D     E
  a FALSE  TRUE  TRUE FALSE FALSE FALSE
  b FALSE FALSE  TRUE FALSE FALSE FALSE
  c FALSE  TRUE FALSE FALSE FALSE  TRUE
  d FALSE FALSE FALSE  TRUE  TRUE FALSE
  e  TRUE FALSE FALSE FALSE FALSE FALSE
Comments