Lucia O Lucia O - 6 months ago 11
Bash Question

Create a table from the frequencies of taxa found in multiple CSV files

I have 12

.csv
files containing fetched taxonomic name and their frequencies (number of times each name was fetched). I've created a master
.txt
file listing all the unique taxa found at least once across the 12 files. I need to make a concatenated table in
csv
format where the headers of the rows are the name of each file, and the headers of the columns are all the unique taxa listed in the master
.txt
file. The table has to be populated with the frequencies next to each taxa in each
.csv
input file. Although the master list contains all taxa possible in the 12 files, not all files contain all the taxa. When the taxa is missing I need to place a '0'.

.csv
input:

$cat file_1

1,Salmo salar
12,Solanum pennellii
18,Staphylococcus xylosus
...

$cat file_2

1,Salmo salar
14,Staphylococcus xylosus
123,Strongyloides stercoralis
...

$cat file_3

123,Solanum pennellii
11,Staphylococcus xylosus
41,Strongyloides stercoralis
...


.txt
master list:

$cat master

Salmo salar
Solanum pennellii
Staphylococcus xylosus
Strongyloides stercoralis
...


.csv
output (what I need):

Sample,Salmo salar,Solanum pennellii,Staphylococcus xylosus,Strongyloides stercoralis
File_1,1,12,18,0
File_2,1,0,14,123
File_3,0,123,11,41


I previously tried making a small python script without a master list and using
.tsv
input files containing repeated names of taxa instead of frequencies. I wasn't able to obtain a table that showed missing taxa for each file, so I decided to create a master list and collapse the input files. I am very new at python, so any help will be appreciated.

Answer

For those about to awk, we salute you!

awk was created for such processing.

Give a try to this:

awk -F "," -v OFS="," '
 FNR==1 {samples[++fni]=FILENAME}
 {if (!taxakeys[$2]) {taxakeys[$2]=1; taxas[++ti]=$2};frequencies[samples[fni],$2]+=$1}
 END {
   printf("Sample"); for (j=1;j<=ti;j++) { printf("%s%s",OFS,taxas[j])}; printf("\n") 
   for (i=1; i<=fni; i++) {
     printf("%s",samples[i]); for (j=1;j<=ti;j++) { printf("%s%d",OFS,frequencies[samples[i],taxas[j]])}; printf("\n")
   }
 }'

The test:

$ awk -F "," -v OFS="," '
 FNR==1 {samples[++fni]=FILENAME}
 {if (!taxakeys[$2]) {taxakeys[$2]=1; taxas[++ti]=$2};frequencies[samples[fni],$2]+=$1}
 END {
   printf("Sample"); for (j=1;j<=ti;j++) { printf("%s%s",OFS,taxas[j])}; printf("\n") 
   for (i=1; i<=fni; i++) {
     printf("%s",samples[i]); for (j=1;j<=ti;j++) { printf("%s%d",OFS,frequencies[samples[i],taxas[j]])}; printf("\n")
   }
 }' file_*

Sample,Salmo salar,Solanum pennellii,Staphylococcus xylosus,Strongyloides stercoralis
file_1,1,12,18,0
file_2,1,0,14,123
file_3,0,123,11,41
Comments