dreww2 - 1 year ago 59
R Question

# String matching to data.frames of different sizes

I have two data.frames of different sizes and I'm looking for the most efficient way to match strings from one data.frame to another, and extract some relevant information.

Here is an example:

Two initial data.frames, a and b, and the desired result:

``````a = data.frame(term = c("red", "salad", "rope", "ball", "tent", "plane", "gift", "meat"),
age = c(30, 24, 52, 44, 73, 44, 33, 12),
visits = c(5, 1, 3, 2, 8, 5, 19, 3))

b = data.frame(string = c("the red ball went over the fence",
"sorry to see that your tent fell down",
"the ball fell into the red salad",
"serious people eat peanuts on Sundays"))

desired_result = data.frame(string = b\$string,
num_matches = c(2, 1, 3, 0),
avg_age = c(37, 73, 32.66667, NA),
avg_visits = c(3.5, 8, 2.66667, NA))
``````

Here are the data.frames in a more readable format:

``````> a
term age visits
1   red  30      5
3  rope  52      3
4  ball  44      2
5  tent  73      8
6 plane  44      5
8  meat  12      3

> b
string
1      the red ball went over the fence
2 sorry to see that your tent fell down
3      the ball fell into the red salad
4 serious people eat peanuts on Sundays

> desired_result
string num_matches  avg_age avg_visits
1      the red ball went over the fence           2 37.00000    3.50000
2 sorry to see that your tent fell down           1 73.00000    8.00000
3      the ball fell into the red salad           3 32.66667    2.66667
4 serious people eat peanuts on Sundays           0       NA         NA
``````

• num_matches is the number of "terms" in "string"

• avg_age is the mean age of the "terms" found in "string"

• avg_visits is the mean number of visits of the "terms" found in "string"

Any ideas on how to implement this in an efficient way?

Thank you.

Use `data.table`, process each row with `by = string`. save the match results in a list, then aggregate by the match results.

Note the `matches` column is a list of list, each cell holding a list. You need wrap the match results with `.()` which is actually another `list()` because data.table expect a list for normal columns.

``````library(data.table)
library(stringr)
a = data.table(term = c("red", "salad", "rope", "ball", "tent", "plane", "gift", "meat"),
age = c(30, 24, 52, 44, 73, 44, 33, 12),
visits = c(5, 1, 3, 2, 8, 5, 19, 3))
b = data.table(string = c("the red ball went over the fence",
"sorry to see that your tent fell down",
"the ball fell into the red salad",
"serious people eat peanuts on Sundays"))

b[, matches := vector("list", .N)]
b[, matches := .(list(str_detect(string, a[, term]))), by = string]
b[, num_matches := sum(unlist(matches)), by = string]
b[, avg_age := mean(a[unlist(matches), age]), by = string]
b[, avg_visits := mean(a[unlist(matches), visits]), by = string]
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download