Clarissa Clarissa - 1 month ago 5
Python Question

How to split files according to a field and edit content

I am not sure if I can do this using unix commands or I need a more complicated code, like python.

I have a big input file with 3 columns - id, different sequences (second column) grouped in different groups (3rd column).

Seq1 MVRWNARGQPVKEASQVFVSYIGVINCREVPISMEN Group1
Seq2 PSLFIAGWLFVSTGLRPNEYFTESRQGIPLITDRFDSLEQLDEFSRSF Group1
Seq3 HQAPAPAPTVISPPAPPTDTTLNLNGAPSNHLQGGNIWTTIGFAITVFLAVTGYSF Group20


I would like:
split this file according the group id, and create separate files for each group; edit the info in each file, adding a ">" sign in the beginning of the id; and then create a new row for the sequence

Group1.txt file
>Seq1
MVRWNARGQPVKEASQVFVSYIGVINCREVPISMEN
>Seq2
PSLFIAGWLFVSTGLRPNEYFTESRQGIPLITDRFDSLEQLDEFSRSF

Group20.txt file
>Seq3
HQAPAPAPTVISPPAPPTDTTLNLNGAPSNHLQGGNIWTTIGFAITVFLAVTGYSF


How can I do that?

Answer

This shell script should do the trick:

#!/usr/bin/env bash

filename="data.txt"
while read line; do
    id=$(echo "${line}" | awk '{print $1}')
    sequence=$(echo "${line}" | awk '{print $2}')
    group=$(echo "${line}" | awk '{print $3}')
    printf ">${id}\n${sequence}\n" >> "${group}.txt"
done < "${filename}"

where data.txt is the name of the file containing the original data.

Importantly, the Group-files should not exist prior to running the script.