Oli Paul Oli Paul - 2 months ago 7
R Question

Separating strings into one list to remove duplicates

I have a large text file (50,000 rows) I'm trying to remove duplicates/find unique words from

The rows/strings in the CSV vary such that three lines could look like the following:

I like cars
Ford
Cars go fast


I would like to first separate each row/string and then combine them so I would get the following list from above:

I
like
cars
Ford
Cars
go
fast


Once that list is complete it should be easy to change the cases of each word and then remove duplicates leaving a unique list of all words in the document.

Some rows are paragraphs and thus Excel just can't handle the job. I'm guessing
paste
and
paste(unique())
may be useful but I'm having trouble using
read.csv
to get the words from the document in the desired format.

These paragraphs may include punctuation, numbers, and random characters like @, so may need to transform the strings first.

EDIT:

3 methods that work, but different results, here is a link to the csv, any insight on why results are different would be appreciated.

https://onedrive.live.com/redir?resid=61FAC513EBF4A4B9!296&authkey=!AMsiIuW4lCD_qrs&ithint=file%2ccsv

Answer

We can use scan

df1 <- data.frame(words= unique(scan(text=as.character(df$s), what="", sep=" ")))
df1
#  words
#1     I
#2  like
#3  cars
#4  Ford
#5  Cars
#6    go
#7  fast

Or a faster approach would be

library(stringi)
data.frame(words = unique(unlist(stri_extract_all(df$s, regex="\\S+"))))
Comments