Jazzmine Jazzmine - 1 month ago 7
R Question

In R using dplyr separate to split uneven number of variables in a column

I have data like this:

x <- c("France:4|Morroco:8|Italy:2", "Scotland:6|Mexico:2", "Scotland:2")
> player_country_info <- data.frame(x)
> setnames(player_country_info, "player_country_data")
> names(player_country_info)
[1] "player_country_data"
> is.data.frame(player_country_info)
[1] TRUE
> head(player_country_info)
country_data
1 France:4|Morocco:8|Italy:2
2 Scotland:6|Mexico:2
3 Scotland:2


I'd like an intermediary data frame that look like this:

player_country_data.1 player_country_data.2 player_country_data.3
France:4 Morocco:8 Italy:2
Scotland:6 Mexico:2 NA
Scotland:2 NA NA


I plan on then using the dplyr::separate function to separate the above to look like this, using this command for each column.

player_country_info %>% separate( col=player_country_data.1, into=c("country_name.1","player_count.1), sep=":")

country_name.1 player_count.1 country_name.2 player.2 country_name.3 player.3
France 4 Morocco 8 Italy 2
Scotland 6 Mexico 2
Scotland 2


Is there an more efficient way to do the above? Perhaps a command that does it in one step? Or should I process it with a for loop outside a while loop?

Thanks

Answer

With separate from the tidyr package:

library(tidyr)
country_info %>% 
  separate(country_data, 
           into = sprintf('%s.%s', rep(c('country','player.count'),3), rep(1:3, each=2)))

the result:

  country.1 player.count.1 country.2 player.count.2 country.3 player.count.3
1    France              4   Morroco              8     Italy              2
2  Scotland              6    Mexico              2      <NA>           <NA>
3  Scotland              2      <NA>           <NA>      <NA>           <NA>

Separate automatically recognizes : and | as characters on which it has to separate. If you want to separate on a specific character, you need to specify that with the sep argument. In this case you could use sep = '[:|]'. This also prevents misbehavior of the automatic detection when there are missing values (see discussion in the comments).

With sprintf you paste together the two vectors rep(c('country','player.count'),3) and rep(1:3, each=2) into a vector of column names where %s.%s tells sprintf to treat the two vectors are string-vectors and paste them together with a dot as separator. See ?sprintf for more info. The each argument tells rep not to repete the whole vector a number of times, but to repete each element of the vector a number of times.