dreww2 dreww2 - 2 months ago 9
R Question

String matching to data.frames of different sizes

I have two data.frames of different sizes and I'm looking for the most efficient way to match strings from one data.frame to another, and extract some relevant information.

Here is an example:

Two initial data.frames, a and b, and the desired result:

a = data.frame(term = c("red", "salad", "rope", "ball", "tent", "plane", "gift", "meat"),
age = c(30, 24, 52, 44, 73, 44, 33, 12),
visits = c(5, 1, 3, 2, 8, 5, 19, 3))

b = data.frame(string = c("the red ball went over the fence",
"sorry to see that your tent fell down",
"the ball fell into the red salad",
"serious people eat peanuts on Sundays"))

desired_result = data.frame(string = b$string,
num_matches = c(2, 1, 3, 0),
avg_age = c(37, 73, 32.66667, NA),
avg_visits = c(3.5, 8, 2.66667, NA))


Here are the data.frames in a more readable format:

> a
term age visits
1 red 30 5
2 salad 24 1
3 rope 52 3
4 ball 44 2
5 tent 73 8
6 plane 44 5
7 gift 33 19
8 meat 12 3

> b
string
1 the red ball went over the fence
2 sorry to see that your tent fell down
3 the ball fell into the red salad
4 serious people eat peanuts on Sundays

> desired_result
string num_matches avg_age avg_visits
1 the red ball went over the fence 2 37.00000 3.50000
2 sorry to see that your tent fell down 1 73.00000 8.00000
3 the ball fell into the red salad 3 32.66667 2.66667
4 serious people eat peanuts on Sundays 0 NA NA



  • num_matches is the number of "terms" in "string"

  • avg_age is the mean age of the "terms" found in "string"

  • avg_visits is the mean number of visits of the "terms" found in "string"



Any ideas on how to implement this in an efficient way?

Thank you.

Answer

Use data.table, process each row with by = string. save the match results in a list, then aggregate by the match results.

Note the matches column is a list of list, each cell holding a list. You need wrap the match results with .() which is actually another list() because data.table expect a list for normal columns.

library(data.table)
library(stringr)
a = data.table(term = c("red", "salad", "rope", "ball", "tent", "plane", "gift", "meat"),
  age = c(30, 24, 52, 44, 73, 44, 33, 12),
  visits = c(5, 1, 3, 2, 8, 5, 19, 3))
b = data.table(string = c("the red ball went over the fence",
  "sorry to see that your tent fell down",
  "the ball fell into the red salad",
  "serious people eat peanuts on Sundays"))

b[, matches := vector("list", .N)]
b[, matches := .(list(str_detect(string, a[, term]))), by = string]
b[, num_matches := sum(unlist(matches)), by = string]
b[, avg_age := mean(a[unlist(matches), age]), by = string]
b[, avg_visits := mean(a[unlist(matches), visits]), by = string]
Comments