Atwp67 Atwp67 - 2 months ago 18
R Question

Quanteda kwic append data to output

I'd like to append some metadata to the kwic output such as a customer ID (see below) so that it's easy to lookup against a master file. I've tried appending data using cbind but nothing matches up correctly.

If this is possible examples would be greatly appreciated.

docname position contextPre keyword contextPost CustID
text3790 5 nothing at all looks good and sounds great 1
text3801 11 think the offer is a good value and has a lot 3
text3874 10 not so sure thats a good word to use 5


originating data.frame

CustID Comment
1 nothing at all looks good and sounds great
2 did not see anything that was very appealing
3 I think the offer is a good value and has a lot of potential
4 these items look terrible how are you still in business
5 not so sure thats a good word to use
6 having a hard time believing some place would sell an item so low
7 it may be worth investing in some additional equipment

Answer

At first I thought the ideal solution is to use docvars, but kwic don't seem to have an option to show them. I still need to merge a id-doc mapping table with the kwic result.

library(data.table)
library(quanteda)

s <- "CustID,   Comment
1,      nothing at all looks good and sounds great
2,      did not see anything that was very appealing
3,      I think the offer is a good value and has a lot of potential
4,      these items look terrible how are you still in business
5,      not so sure thats a good word to use
6,      having a hard time believing some place would sell an item so low
7,      it may be worth investing in some additional equipment"

# I'm using data.table mainly to read the data easily. 
dt <- fread(s)
df <- as.data.frame(dt)
# all operations below apply to data frame
myCorpus <- corpus(df$Comment)
# the Corpus and CustID came from same data frame, 
# thus ensured the mapping is correct
docvars(myCorpus, "CustID") <- df$CustID
summary(myCorpus)
# build the mapping table of docname and CustID. 
# The docname is in row.names, have to make an explicit column
dv_table <- docvars(myCorpus)
id_table <- data.frame(docname = row.names(dv_table), CustID = dv_table$CustID)
result <- kwic(myCorpus, "good", window = 3, valuetype = "glob")
id_result <- merge(result, id_table, by = "docname")

result:

> id_result
  docname position   contextPre keyword      contextPost CustID
1   text1        5 at all looks    good and sounds great      1
2   text3        7   offer is a    good value and has         3
3   text5        6 sure thats a    good word to use           5