Andrew Nguyen Andrew Nguyen - 2 months ago 32
R Question

sparkr dataframe filter by column using regex

I have a

sparkR
dataframe called Tweets with a column named
bodyText
.

What I am trying to do is filter the dataframe by a regex condition on the bodyText. So for example filter by tweets that have "rally" or "protest" in the bodyText.

What I have tried so far is:

subset(twitter_df, grepl("(?<=\\b)rally", twitter_df$bodyText, ignore.case = TRUE))
filter(twitter_df, grepl("(?<=\\b)rally", twitter_df$bodyText, ignore.case = TRUE))


but in both cases receive this error:


Error in as.character.default(x) :
no method for coercing this S4 class to a vector
Calls: main ... .local -> [ -> grepl -> as.character -> as.character.default

Answer

You can convert the Spark data frame to a rdd, apply the filter and convert it back:

# setup reproducable sample
df <- data.frame(id=c(1:4), bodyText=c("rally","protest","text1","text2"))
head(twitter_df.filtered)
twitter_df <- as.DataFrame(df)
head(twitter_df)


# convert to rdd
twitter_df.rdd <- SparkR:::toRDD(twitter_df)
# filter rdd
twitter_df.rdd.filtered <- SparkR:::filterRDD(twitter_df.rdd, function(s) { grepl("(?<=\\b)rally", s$bodyText, ignore.case = TRUE, perl = TRUE) })
# convert to Spark data frame
twitter_df.filtered <- as.DataFrame(twitter_df.rdd.filtered)
head(twitter_df.filtered)

Note the parameter perl is set to TRUE or the used expression is invalid.

Comments