panbar panbar - 3 months ago 26
Perl Question

create contingency table

I have hundreds of text files with various list elements (in thousands). Three simplified representative files are given below (here row elements as colours).

group1.txt

red
blue
red
green
pink
red


group2.txt

yellow
brown
cyan
yellow
brown
red
violet
orange


group3.txt

orange
violet
pink
cyan
grey


I could create a sorted count table with the following script -

awk -F '\t' '{print $1}' * | sort | uniq -c | sort -nr


>

4 red
2 yellow
2 violet
2 pink
2 orange
2 cyan
2 brown
1 grey
1 green
1 blue


I would like to create a contingency table as follows -

Colour group1 group2 group3
red 3 1 0
green 1 0 0
blue 0 0 0
yellow 0 2 0
orange 0 1 1
grey 0 0 1
violet 0 1 1
pink 1 0 1
brown 0 2
cyan 0 1 1


How can I create this contingency table using awk, python, perl or R?

Answer

Set up files (this is just so we have an example to work with - not part of the actual machinery for constructing the contingency table):

writeLines(c("red","blue","red","green","pink","red"),
           con="group1.txt")
writeLines(c("yellow","brown","cyan","yellow","brown","red",
             "violet","orange"),
           con="group2.txt")
writeLines(c("orange","violet","pink","cyan","grey"),
           con="group3.txt")

Most of the work is in reading in and arranging the data: let's say we know that the files are called groupNN.txt where NN is a number ...

flist <- list.files(pattern="group[0-9]+.txt")
grpnames <- gsub("\\.txt$","",flist)

Read colour files:

col_list <- lapply(flist,scan,what="character")

Matching vector of group IDs:

grpvec <- rep(grpnames,sapply(col_list,length))

Now just use table:

table(unlist(col_list),grpvec)
##     grp
## col      group1 group2 group3
##   blue        1      0      0
##   brown       0      2      0
##   cyan        0      1      1
##   green       1      0      0
##   grey        0      0      1
##   orange      0      1      1
##   pink        1      0      1
##   red         3      1      0
##   violet      0      1      1
##   yellow      0      2      0

(This is ordered alphabetically; I'm not sure how important that is to you ...)

Comments