sweetmusicality sweetmusicality - 1 year ago 49
R Question

Find all unique strings in R

I am relatively new to R. I have a dataframe

df
that looks like this (one character variable only...my actual df spans 100k+ rows, but for simplicity, let's look at 5 rows only):

V1
oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects
angioedema chemically induced, angioedema chemically induced, oximetry
abo blood group system, imipramine poisoning, adverse effects
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy
thrombosis drug therapy


I want to be able to output every single unique string so that it looks like this:

V1
oximetry
hydrogen peroxide adverse effects
epoprostenol adverse effects
angioedema chemically induced
abo blood group system
imipramine poisoning
adverse effects
isoenzymes
myocardial infarction drug therapy
thrombosis drug therapy


Do I use the
tm
package? I tried using
dtm
but my code was inefficient since it would convert
dtm
to matrix which would require a lot of memory from 100k+ rows.

Please advise. Thanks!

Answer Source

try this:

library(stringr)
library(tidyverse)

df <- data.frame(variable = c(
'oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects',
'angioedema chemically induced, angioedema chemically induced, oximetry',
'abo blood group system, imipramine poisoning, adverse effects',
'isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy',
'thrombosis drug therapy'), stringsAsFactors=FALSE)

mutate(df, variable = str_split(variable, ', ')) %>%
  unnest() %>% distinct()
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download