Bio21 Bio21 - 8 days ago 7
Python Question

Use Sed/Awk to extract first three unique instances of the line

I have a list with 20000 probes, is there a way to extract the first three lines/occurences for each probe using sed/awk?

Example of dataset:
Probe1 A GTTAGAGGAGGTGGAAGAGC
Probe1 B CTGAGGTCGGGACGGAGCAC
Probe1 C GATGTAGGCGGTTGGCGTGG
Probe1 D GTTGGCGAAGTCACATCTAG
Probe1 E CATGTCGCCGACTCCGTCGA
Probe1 F GTGATGTTCTGAGTACATAG

Probe3 A GATTGTAGGTTTCCTGCCAG
Probe3 L ACCCAGCCAGGGGAAAACCA
Probe3 Z GGAGATGTAGGCGGTTGGCG
Probe3 Y GGAGATGTAGGCCTTAAAAA
Probe3 D GATTGTAGGGGTCCTGCCAG


Desired output:

Probe1 A GTTAGAGGAGGTGGAAGAGC
Probe1 B CTGAGGTCGGGACGGAGCAC
Probe1 C GATGTAGGCGGTTGGCGTGG
Probe3 A GATTGTAGGTTTCCTGCCAG
Probe3 L ACCCAGCCAGGGGAAAACCA
Probe3 Z GGAGATGTAGGCGGTTGGCG

Answer

awk to the rescue!

$ awk '++a[$1]<4' file

to remove the empty lines

$ awk '++a[$1]<4 && NF' file
Comments