joep1 joep1 - 28 days ago 8
R Question

Coding Matrix with overlap counts in R

I am proficient in Python but a complete novice in R. I can't find an answer to this question elsewhere online, and whilst it's going to be a bit lengthy, I am hoping it will be useful to other users of the R library RQDA.

Essentially, RQDA is a qualitative research tool, that is primarily used for assigning codes (themes) to text files. It's a bit like a highlighter pen that counts where it has highlighted.

If you put in a lot of files, you can code the text in different places with themes (e.g. a project about interviewing people working in cloth manufacturing might be "equipment", "sewing", "linen", "silk", "lighting", "lunch breaks", etc). This enables you to count how many times different codes were used, and in RQDA it gives a table output as follows:

rowid cid fid codenamefilename index1 index2 CodingLength
1 1 12 1 silk 2010-01-28 409 939 530
2 2 21 1 cotton 2010-01-28 1008 1172 164
3 3 12 1 silk 2010-01-28 1173 1924 751
4 4 39 1 sewing 2010-01-28 1008 1250 751
5 5 38 1 weaving 2010-01-28 1173 1924 751
6 6 78 1 costs 2010-01-28 727 939 212
7 7 23 1 lunch 2010-01-28 1553 1788 235
8 9 7 2 lunch 2010-01-29 1001 1230 371
9 10 4 2 weaving 2010-01-29 1547 1724 135
10 11 6 2 social 2010-01-29 1001 1290 350
11 12 7 2 silk 2010-01-29 1926 2276 350
12 14 17 2 supply 2010-01-29 1926 2276 350
13 15 78 2 costs 2010-01-29 1926 2276 350
14 17 78 2 weaving 2010-01-29 1890 2106 212


codename = code the text was given (theme)

filename = filename of text (in this case, date of diary entry)

index1 = character position in file where code starts (highlighted text)

index2 = character position in file where code ends (highlighted text)

CodingLength = overall length of coded/highlighted text

What I'd like to do is to iterate over the entire table (around 1,500 rows) with the total list of codes (codename in the table above, around 100 unique codes) in order to output a 2-way matrix of overlap between codes, for example (indicative only, with 5 codes):

silk cotton sewing weaving lunch breaks socialising
silk * 0 0 3 2 0
cotton 0 * 5 0 0 0
sewing 0 5 * 0 0 0
weaving 3 0 0 * 0 0
lunchs 2 0 0 0 * 5
socialg 0 0 0 0 5 *


(Code messed up a bit on this output but hopefully you get the idea)

Therefore, in R I need a bit of code that will iterate over the code list and count the number of instances where A) filename is the same and B) there is overlap in the range between index1 and index2 (CodingLength probably not important).

Apart from the following vague hunches I am lost as to exactly how to make this work:


  1. I probably need to asign the table as a variable e.g:

    coding_table <- getCodingTable()

  2. I probably need to make a list of the unique variables e.g:

    x = c("silk","cotton","weaving","sewing","lunch" ... etc. )

  3. I need a function that does the checks

  4. I need a for-loop for the rows

  5. I need a boolean test where the range and file name is checked e.g. any(409:939 %in% 727:939) && filename == filename



Based on this, can anyone see a way to produce a very short solution to this? I feel like the equivalent in python would be 10 lines maximum, but given the extra bits required in R I am completely lost as to how to do this.

Answer

You can use the foverlap function in the data.table package to create an edgelist and then turn this into a weighted adjacency matrix. (See here).

Using a combination of data.table, dplyr, and igraph, I think this gets you what you want (can't verify without data, though).

First, you set your data frame as a data table and set the key for index1 and index2. Then, foverlap identities entries where index1 and index2 have any overlap. After eliminating self-overlaps, replace the ids generated by foverlaps with corresponding codenames from the data set. This creates an edgelist. Pass this edgelist to igraph to create an igraph object and return it as an adjacency matrix.

require(igraph); require(data.table); require(dplyr)

el <- setkey(setDT(coding_table), filename, index1, index2) %>%
  foverlaps(., ., type="any", which=TRUE) %>%
  .[coding_table$codename[xid] != coding_table$codename[yid]] %>%
  .[, `:=`(xid = coding_table$codename[xid], yid = coding_table$codename[yid])]

m <- as.matrix(get.adjacency(graph.data.frame(el)))

Of course, dplyr is totally optional; the piping just makes it a bit neater and avoids creating more objects in the environment.