Vasile Vasile - 1 month ago 10
R Question

applying str_split to a column in dataframe

I have the following df named i:

structure(list(price = c(11772, 14790, 2990, 1499, 21980, 27999
), fuel = c("diesel", "petrol", "petrol", "diesel", "diesel",
"petrol"), gearbox = c("manual", "manual", "manual", "manual",
"automatic", "manual"), colour = c("white", "purple", "yellow",
"silver", "red", "rising blue metalli"), engine_size = c(1685,
1199, 998, 1753, 2179, 1984), mileage = c(18839, 7649, 45058,
126000, 31891, 100), year = c("2013 hyundai ix35", "2016 citroen citroen ds3 cabrio",
"2007 peugeot 107 hatchback", "2007 ford ford focus hatchback", "2012 jaguar xf saloon",
"2016 volkswagen scirocco coupe"), doors = c(5, 2, 3, 5, 4, 3
)), .Names = c("price", "fuel", "gearbox", "colour", "engine_size",
"mileage", "year", "doors"), row.names = c(NA, 6L), class = "data.frame")


Some of the words in column 'year' are duplicated. I would like to remove them. As a first step I would like to separate the character string in this column in separate words.
I was able to do it for a separate string, but when I try to apply it to the whole data frame it gives an error

unlist(str_split( "2013 hyunday ix35", "[[:blank:]]"))


[1] "2013" "hyunday" "ix35"

for( k in 1:nrow(i))
+ i[k,7]<-unlist(str_split( i[k, 7], "[[:blank:]]"))


Error in
[<-.data.frame
(
*tmp*
, k, 7, value = c("2013", "hyundai", :
replacement has 3 rows, data has 1

Answer

We can split by one or more space (\\s+), and paste the unique elements together by looping through the list output (sapply(..)

i$year <- sapply(strsplit(i$year, "\\s+"), function(x) paste(unique(x), collapse=' '))
Comments