Andrew Nguyen Andrew Nguyen - 1 year ago 102
R Question

sparkr dataframe filter by column using regex

I have a

dataframe called Tweets with a column named

What I am trying to do is filter the dataframe by a regex condition on the bodyText. So for example filter by tweets that have "rally" or "protest" in the bodyText.

What I have tried so far is:

subset(twitter_df, grepl("(?<=\\b)rally", twitter_df$bodyText, = TRUE))
filter(twitter_df, grepl("(?<=\\b)rally", twitter_df$bodyText, = TRUE))

but in both cases receive this error:

Error in as.character.default(x) :
no method for coercing this S4 class to a vector
Calls: main ... .local -> [ -> grepl -> as.character -> as.character.default

Answer Source

You can convert the Spark data frame to a rdd, apply the filter and convert it back:

# setup reproducable sample
df <- data.frame(id=c(1:4), bodyText=c("rally","protest","text1","text2"))
twitter_df <- as.DataFrame(df)

# convert to rdd
twitter_df.rdd <- SparkR:::toRDD(twitter_df)
# filter rdd
twitter_df.rdd.filtered <- SparkR:::filterRDD(twitter_df.rdd, function(s) { grepl("(?<=\\b)rally", s$bodyText, = TRUE, perl = TRUE) })
# convert to Spark data frame
twitter_df.filtered <- as.DataFrame(twitter_df.rdd.filtered)

Note the parameter perl is set to TRUE or the used expression is invalid.