user2300940 user2300940 - 4 months ago 10
Perl Question

select specific columns from complex lines

I have a file that contains lines with the following format. I would like to keep only the first column and the column containing the string with the following format NC_XXXX.1

484-2117 16 gi|9634679|ref|NC_002188.1| 188705 23 21M * 0 0 CGCGTACCAAAAGTAATAATT IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0G20 YT:Z:UU
787-1087 16 gi|21844535|ref|NC_004068.1| 7006 23 20M * 0 0 CTATACAACCTACTACCTCA IIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19T0 YT:Z:UU
.....
....
...


output:

484-2117 NC_002188.1
787-1087 NC_004068.1

Answer

Something like this in perl:

#!/usr/bin/env perl
use strict;
use warnings;

while (<DATA>) {
   my ( $id, $nc ) = m/^([\d\-]+).*(NC_[\d\.]+)/;
   print "$id $nc\n";
}

__DATA__
484-2117    16  gi|9634679|ref|NC_002188.1| 188705  23  21M *   0   0   CGCGTACCAAAAGTAATAATT   IIIIIIIIIIIIIIIIIIIII   AS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:0G20   YT:Z:UU
787-1087    16  gi|21844535|ref|NC_004068.1|    7006    23  20M *   0   0   CTATACAACCTACTACCTCA    IIIIIIIIIIIIIIIIIIII    AS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:19T0   YT:Z:UU

Output:

484-2117 NC_002188.1
787-1087 NC_004068.1

Which reduces to a one liner of:

perl -ne 'm/^([\d\-]+).*(NC_[\d\.]+)/ and print "$1 $2\n"'   yourfile

Note - this specifically matches a first column made up of number and dash - you could do this with a wider regex match.