Oddish Oddish - 2 months ago 9
Bash Question

Constructing an edge list from the second column of a two columnar file with relationships derived from the first column

I have a problem, I have a file with several million lines arranged like so:

1 Protein_A
1 Protein_B
2 Protein_A
3 Protein_C
4 Protein_A
4 Protein_B
4 Protein_C
4 Protein_D
5 Protein_C
5 Protein_D


Where column 1 indicates an interaction pathway and column 2 indicates the protein's ID. Can anyone recommend an effective way I can sort this into an edge list of only (non reciprocal) interactions per network eg:

1 Protein_A,Protein_B
4 Protein_A,Protein_B
4 Protein_A,Protein_C
4 Protein_A,Protein_D
4 Protein_B,Protein_C
5 Protein_C,Protein_D
5 Protein_C,Protein_D


Or give me an indication of where to look for such data?

I tried a shell script which slowly iterates through the file and deletes the new line at the end of the file which results in the following:

1 Protein_A 1 Protein_B


This can then be processed into an edge, however this doesn't work if there is more than 2 proteins in a network. I'm drawing a blank. Can anyone please help?

Thank you in advance.

Answer

Rather easy using python and some smart modules. I have embedded the file contents in a string. Just replace by data = open("input.txt") to read from a file (iterable as well).

I create a dictionary with number as key and list of proteins matching the number as values.

Once built, I use itertools.combinations of size 2 to generate the list, printing the key along the way.

import re
import collections,itertools

data="""1    Protein_A
1    Protein_B
2    Protein_A
3    Protein_C
4    Protein_A
4    Protein_B
4    Protein_C
4    Protein_D
5    Protein_C
5    Protein_D""".split("\n")

d = collections.defaultdict(lambda : list())

for l in data:
    fields = re.split("\s+",l.strip())
    d[int(fields[0])].append(fields[1])

for k,v in d.items():
    for a,b in itertools.combinations(v,2):
        print(k,a,b)

result:

(1, 'Protein_A', 'Protein_B')
(4, 'Protein_A', 'Protein_B')
(4, 'Protein_A', 'Protein_C')
(4, 'Protein_A', 'Protein_D')
(4, 'Protein_B', 'Protein_C')
(4, 'Protein_B', 'Protein_D')
(4, 'Protein_C', 'Protein_D')
(5, 'Protein_C', 'Protein_D')