Dale Pin Dale Pin - 4 months ago 9
Linux Question

Joining multiple files in Linux

I have the following files (underscores represent tab delimiter, filenames not included in file contents):

(sample001.file)

Name_____scores_____gender
Joey_____54_____Boy
Kyle_____87_____Girl
Sia______43_____Girl
Marge____87_____Girl


(sample002.file)

Name_____scores_____gender
Joey_____23_____Boy
Pedro____76_____Boy
Kyle_____76_____Girl


(sample003.file)

Name_____scores_____gender
Kyle_____34_____Girl
James____65_____Boy
Pedro____76_____Boy
Sia______65_____Girl
Marge____23_____Girl


I would like all of these files integrated into one, containing only first- and second-column data. It would look like this:

(integrate.file)

Name_____sample001____sample002_____sample003
Joey_____54_____23____0
Kyle_____87_____76____34
Sia______43_____0_____65
Marge____87_____0_____23
Pedro____0______76____76
James____0______0_____65


Basically, names should only have a single entry in the first column, and that if there is no data on any sample, it should be zero. Headers are not necessary, but they can be present.

Can anyone help me on this?

Answer

Using Bash and process substitution, you can do it it for three files in a single (fairly long) command pipeline:

join -e 0 -a 1 -a 2 -t $'\t' -o 0,1.2,2.2 \
     <(sed 1d sample001.file | sort) \
     <(sed 1d sample002.file | sort) |
join -e 0 -a 1 -a 2 -t $'\t' -o 0,1.2,1.3,2.2 \
   - <(sed 1d sample003.file | sort)

Note that join requires its inputs to be sorted on the join column, which is column 1 in this case. The sed 1d commands delete the heading line before sorting the data.

The -e0 says 'put 0 when values are missing. The -a1 and -a2 options say 'preserve all the rows from file 1 and file 2'. The -t $'\t' option uses Bash's ANSI C Quoting to generate a tab for the separator. If you omit the -t option, it 'works' but the output columns are separated by blanks, not tabs. The -o options specify the columns to print: 0 is the joining column (column 1 in each file); and 1.2 is column 2 from file 1, etc. The file name - in the second join means 'read standard input'.

The output from that on the sample data is:

James   0       0       65
Joey    54      23      0
Kyle    87      76      34
Marge   87      0       23
Pedro   0       76      76
Sia     43      0       65