Amanda Botelho Alvarenga Amanda Botelho Alvarenga - 4 months ago 13
Linux Question

How to transform column to row in Terminal Linux? But more complex

How I transform column to row in Terminal Linux? But more complex... The following is an example of my data:

SNP_Name ID_Animal Allele Chr Position
rs01 215 AB 1 100
rs02 215 AA 2 200
rs03 215 BA 3 300
rs04 215 AA 4 400
rs01 300 AB 1 100
rs02 300 BB 2 200
rs03 300 AA 3 300
rs04 300 AB 4 400
rs01 666 BB 1 100
rs02 666 AA 2 200
rs03 666 AB 3 300
rs04 666 AB 4 400


I want to transform this into the following:

SNP_Name Chr Position 215(ID_animal) 300(ID_Animal) 666(ID_Animal)
rs01 1 100 AB AB BB
rs02 2 200 AA BB AA
rs03 3 300 BA AA AB
rs04 4 400 AA AB AB


The row
ID_animal
change in column with respective allele. How I do this?
But I will work with 55,000 repetition per
ID_animal
. So, I want to be only 55,000 row and (animal
number
+
SNP_Name
+
Chr
+
Position
) of column.

Thank you.

Answer

The issue here is the amount of data, and I don't want to give a solution that reads everything into memory and then outputs it.

To do this, I'd like to parse and output the data for each SNP (rs number) in turn rather than for each animal in turn. But the data is given to us in the wrong order (it's sorted by animal).

So the first thing we need to do is to sort the data by SNP (the first column). I will also remove the header row at the same time as it is not needed for the data transformation.

I'm assuming that the data is stored in the file data.in:

$ sed '1d' data.in | sort -o data.tmp

We now have:

$ cat data.tmp
rs01            215               AB            1            100
rs01            300               AB            1            100
rs01            666               BB            1            100
rs02            215               AA            2            200
rs02            300               BB            2            200
rs02            666               AA            2            200
rs03            215               BA            3            300
rs03            300               AA            3            300
rs03            666               AB            3            300
rs04            215               AA            4            400
rs04            300               AB            4            400
rs04            666               AB            4            400

Then I run the following to produce the result:

$ awk -f script.awk data.tmp >data.new

The awk script is quite long, so it makes sense to have it in it's own script file rather than as a "one-liner":

FNR == 1    {
    rsid        = $1;
    chr         = $4;
    pos         = $5;

    c           = 0;
    aid[c]      = $2;
    all[c++]    = $3;

    do_header   = 1;

    next;
}

rsid == $1 {
    aid[c]      = $2;
    all[c++]    = $3;
    next;
}

{
    if (do_header) {
        printf("SNP_name\tChr\tPosition\t");
        for (c in aid) {
            printf("%d\t", aid[c]);
        }
        printf("\n");

        do_header = 0;
    }

    printf("%s\t%d\t%d\t", rsid, chr, pos);
    for (c in all) {
        printf("%s\t", all[c]);
    }
    printf("\n");

    rsid        = $1;
    chr         = $4;
    pos         = $5;

    c           = 0;
    aid[c]      = $2;
    all[c++]    = $3;
}

END {
    printf("%s\t%d\t%d\t", rsid, chr, pos);
    for (c in all) {
        printf("%s\t", all[c]);
    }
    printf("\n");
}

This produces the tab-delimited file data.new with the following contents, for the given input:

SNP_name    Chr Position    215 300 666
rs01    1   100 AB  AB  BB
rs02    2   200 AA  BB  AA
rs03    3   300 BA  AA  AB
rs04    4   400 AA  AB  AB