user3354212 user3354212 - 3 months ago 16
R Question

How to assign text into a column based on another dataframe in r

I have a dataframe:

df = read.table(text="group X1 X2 X3 X4 X5 X6 X7
P1 H H H H H H H
P1 C D C D B C C
P1 D C B A C D H
P1 D C B A C D D
P1 C D C D B C D
P2 C D B D C D C
P2 H H H H H H H
P2 D C C A B C D
P3 C D C D B C C
P3 H H H H H H H
P3 C D C D B C C
P3 D C B A C D D", header=T, stringsAsFactors=F)


I have another dataframe:

df2 = read.table(text="Group col R S
P1 'X2 X4 X7' 'C A D' 'D D C'
P2 'X2 X3 X4 X6' 'C C A C' 'D B D D'
P3 'X3 X5 X6 X7' 'B C D D' 'C B C C'", header=T, stringsAsFactors=F)


I would like to add a column named "assign" to hold the assignment which is based on df2. For example, if df$group=="P1", then only concatenate columns in df shown in df2$col "P1" row, if all columns have the same letter "H", then assign "H" to the "assign" column; if match the string in df2$R column, assign "R"; if match the string in df2$S column, assign "S"; if not match any three cases as mentioned, then assign "U".
I have tested my script in the group "P1", but I don't know how to return the assigned value to the df and go through the loop. Appreciate any helps.
I expect the result as:

df = read.table(text="group 1 2 3 4 5 6 7 assign
P1 H H H H H H H H
P1 C D C D B C C S
P1 D C B A C D D U
P1 D C B A C D D R
P1 C D C D B C D U
P2 C D B D C D C S
P2 H H H H H H H H
P2 D C C A B C D R
P3 C D C D B C C S
P3 H H H H H H H H
P3 C D C D B C C S
P3 D C B A C D D R
", header=T, stringsAsFactors=F)

Answer

You can use data.table and solve your problem in three steps:

  1. merge the data.tables

  2. this is the key step, build a pattern to match later, the cool thing is that we can use a flexible number of .SDcols across the by groups in the data.table

  3. build the assign variable

Here is the code:

# data
require(data.table)
dt = data.table(df)
dt2 = data.table(df2)

# add col_int, a list(!) of col indices, to dt2 for each Group
dt3 = dt2[, list(col_name = strsplit(col, ' ')[[1]]), by = Group]
dt3 = dt3[, col_idx := match(col_name, names(dt))]
dt3 = dt3[, list(col_idx = list(col_idx)), by = Group]
dt2 = merge(dt2, dt3, by = 'Group')

# solution
dt = merge(x = dt,
           y = dt2,
           by = 'Group')

idx_matching_table = names(dt)

# a: using strings
dt[,
   j = pattern := {
     .SD[, do.call('paste', c(.SD)), .SDcols = strsplit(col, ' ')[[1]]]
   },
   by = list(Group, col)]

# b: using indices
dt[,
   j = pattern_2 := {
     # .SD has less cols (compared to dt), therefore find out what the integer index of col_idx in .SD is:
     col_idx_sd = match(idx_matching_table[col_idx[[1]]], names(.SD))
     .SD[, do.call('paste', c(.SD)), .SDcols = col_idx_sd]
   },
   by = list(Group, col)]
dt[, identical(pattern, pattern_2)] # TRUE

dt[, assign := 'U']
dt[pattern %like% '[H ]+H', assign := 'H']
dt[pattern == R, assign := 'R']
dt[pattern == S, assign := 'S']

EDIT I replaced apply(.SD, 1, paste, collapse = ' ') with do.call('paste', c(.SD)) to avoid coercion to matrix.

Comments