T. M. T. M. - 3 months ago 14
R Question

Using R - Condensing multiple columns into a new column without repeating contents

I'm a botanist and very beginner R user. I wonder if you can help me on finding a solution for a writing a script. I've been using R for optimizing the process of creating texts from spreadsheets. For that I use the

MonographaR
package, and I'm fine with it. The problem itself is dealing with
data.frame
. My spreadsheet (CSV file) is basically composed of columns for species, rows for characters, and their intersection cells are state of characters. I want to have a final script which allows me to get 2 or more columns combined into a new column on the original spreadsheet. When the cells have different contents the new cell content must have the individual contents separate by coma+space
", "
. When the cells have equal contents the new cell must have the same content only once, without repeating it. The scripts I've tried to write using concatenation,
cbind
etc repeated cells content, and I wasn't happy with it.

My initial CSV looks like this,

cattleya.minor cattleya.maxima cattleya.pumila
colour red red red
surface sharp smooth sharp
leaves 1 3 4


and I'd like to have a final result like this

cattleya cattleya.minor cattleya.maxima cattleya.pumila
colour red red red red
surface sharp, smooth sharp smooth sharp
leaves 1, 3, 4 1 3 4


Thank you very much indeed.

Answer

As @alistaire commented, start with "tidy" data and things will be much easier.

# Starting data (which I've called "dat")
dat
        cattleya.minor cattleya.maxima cattleya.pumila
colour             red             red             red
surface          sharp          smooth           sharp
leaves               1               3               4
library(reshape2)
library(tibble)
library(dplyr)

# Make data tidy
dat.tidy = dat %>% 
  rownames_to_column(var="Characteristic") %>%                # Turn rownames into a data column
  melt(id.var="Characteristic", variable.name="Species") %>%  # Reshape to "long" format
  dcast(Species ~ Characteristic)                             # Cast back to wide so that each characteristic gets its own column

dat.tidy    
          Species colour leaves surface
1  cattleya.minor    red      1   sharp
2 cattleya.maxima    red      3  smooth
3 cattleya.pumila    red      4   sharp
# Summarize by genus
dat.tidy %>%
  group_by(Genus=gsub("(.*)\\..*","\\1",Species)) %>%       # Collapse to genus (remove species designation)
  summarise_all(funs(paste(unique(.), collapse=", "))) %>%  # For each charactreristic, paste together each unique value for a given genus
  select(-Species)
     Genus colour  leaves       surface
1 cattleya    red 1, 3, 4 sharp, smooth