Haroon Haroon - 2 months ago 13
R Question

Print lines after grep till next pattern

I have a file in the form (Input_fasta.txt)

>tr|A0A089QH62|A0A089QH62_MYCTU Histidine kinase OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_00865 PE=4 SV=1
MTATASGIAATAPNCGEASINDVPIAESERRYLGARSASEYGQEIPLW
>tr|I6WXB4|I6WXB4_MYCTU 30S ribosomal protein S6 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=rpsF PE=3 SV=1
MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY
VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH
>tr|A0A089SBT4|A0A089SBT4_MYCTU Glycosyl transferase OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_19775 PE=4 SV=1
MDTETHYSDVWVVIPAFNEAAVIGKVVTDVRSVFDHVVCVDDGSTDGTGDIARRSGAHLV
RHPINLGQGAAIQTGIEYARKQPGAQVFATFDGDGQHRVKDVAAMVDRLGAGDVDVVIGT
RFGRPVGKASASRPPLMKRIVLQTGARLSRRGRRLGLTDTNNGLRVFNKTVADGLNITMS
GMSHATEFIMLIAENHWRVAEEPVEVLYTEYSKSKGQPLLNGVNIIFDGFLRGRMPR
>tr|A0A089QKT1|A0A089QKT1_MYCTU TetR family transcriptional regulator OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_00800 PE=4 SV=1
MSLTAGRGPGRPPAAKADETRKRILHAARQVFSERGYDGATFQEIAVRADLTRPAINHYF
ANKRVLYQEVVEQTHELVIVAGIERARREPTLMGRLAVVVDFAMEADAQYPASTAFLATT
VLESQRHPELSRTENDAVRATREFLVWAVNDAIERGELAADVDVSSLAETLLVVLCGVGF
YIGFVGSYQRMATITDSFQQLLAGTLWRPPT
>tr|I6YAB3|I6YAB3_MYCTU Iron ABC transporter permease OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_07380 PE=4 SV=1
MARGLQGVMLRSFGARDHTATVIETISIAPHFVRVRMVSPTLFQDAEAEPAAWLRFWFPD
PNGSNTEFQRAYTISEADPAAGRFAVDVVLHDPAGPASSWARTVKPGATIAVMSLMGSSR
FDVPEEQPAGYLLIGDSASIPGMNGIIETVPNDVPIEMYLEQHDDNDTLIPLAKHPRLRV
RWVMRRDEKSLAEAIENRDWSDWYAWATPEAAALKCVRVRLRDEFGFPKSEIHAQAYWNA
GRAMGTHRATEPAATEPEVGAAPQPESAVPAPARGSWRAQAASRLLAPLKLPLVLSGVLA
ALVTLAQLAPFVLLVELSRLLVSGAGAHRLFTVGFAAVGLLGTGALLAAALTLWLHVIDA
RFARALRLRLLSKLSRLPLGWFTSRGSGSIKKLVTDDTLALHYLVTHAVPDAVAAVVAPV
GVLVYLFVVDWRVALVLFGPVLVYLTITSSLTIQSGPRIVQAQRWAEKMNGEAGSYLEGQ
PVIRVFGAASSSFRRRLDEYIGFLVAWQRPLAGKKTLMDLATRPATFLWLIAATGTLLVA
THRMDPVNLLPFMFLGTTFGARLLGIAYGLGGLRTGLLAARHLQVTLDETELAVREHPRE
PLDGEAPATVVFDHVTFGYRPGVPVIQDVSLTLRPGTVTALVGPSGSGKSTLATLLARFH
DVERGAIRVGGQDIRSLAADELYTRVGFVLQEAQLVHGTAAENIALAVPDAPAEQVQVAA
REAQIHDRVLRLPDGYDTVLGANSGLSGGERQRLTIARAILGDTPVLILDEATAFADPES
EYLVQQALNRLTRDRTVLVIAHRLHTITRADQIVVLDHGRIVERGTHEELLAAGGRYCRL
WDTGQGSRVAVAAAQDGTR
>tr|L0T545|L0T545_MYCTU PPE family protein PPE7 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=PPE7 PE=4 SV=1
MSVCVIYIPFKGCVKHVSVTIPITTEHLGPYEIDASTINPDQPIDTAFTQTLDFAGSGTV
GAFPFGFGWQQSPGFFNSTTTPSSGFFNSGAGGASGFLNDAAAAVSGLGNVFTETSGFFN
AGGVGIRASKTSATCCRAGRT


and another file containing the pattern like(Pattern.txt)

I6WXB4

I6WXC3

I6WXK8


I need an output like

>tr|I6WXB4|I6WXB4_MYCTU 30S ribosomal protein S6 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=rpsF PE=3 SV=1
MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY
VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH


what I have done till now is
grep -f Pattern.txt Input_fasta.txt


How to extend the output to next lines till I hit next ">" after the match ?

tried
awk '/I6WXB4/{copy=1;next} />/{copy=0;next} copy' Input_fasta.txt

which gave an output MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY
VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH
but header is missing here.

Answer

In awk:

$ awk 'NR==FNR{a[$0]; next} $2 in a' pattern.txt FS=\| RS=">" input_fasta.tzt
tr|I6WXB4|I6WXB4_MYCTU 30S ribosomal protein S6 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=rpsF PE=3 SV=1
MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY
VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH