Vinay Vinay - 2 months ago 13
Perl Question

Split array in Perl from new line to tab separated

I have data in following format:

4.8e-38 140.9 4.3 5.8e-38 140.6 4.3 1.1 1 NP_001275340.1 ELF4-like protein [Solanum tuberosum]XP_0063
4.8e-38 140.9 4.3 5.8e-38 140.6 4.3 1.1 1 XP_015080718.1 PREDICTED: protein ELF4-LIKE 3-like isoform X
5.3e-38 140.7 4.4 6.3e-38 140.5 4.4 1.1 1 XP_016481343.1 PREDICTED: protein ELF4-LIKE 4-like [Nicotian
5.4e-38 140.7 5.1 6.6e-38 140.4 5.1 1.1 1 XP_009784404.1 PREDICTED: protein ELF4-LIKE 4-like [Nicotian


I have created an Perl array where elements are stored in new line. For example, If I print
$ARRAY[0]
, it gives output as:

4.8e-38 140.9 4.3 5.8e-38 140.6 4.3 1.1 1 NP_001275340.1 ELF4-like protein [Solanum tuberosum]XP_0063.


What I need is to split array in columns, so that if I
print "$ARRAY[8]"
, the output should be the list of identifiers/accession numbers (
NP_001275340.1, XP_015080718.1
).

I have tried using split function, but as the data is not uniformly separated (e.g., by tab or space), I am not able to do that. Any suggestions?

Answer

If it's not uniformly separated, then what's useful to know is that:

split by default does "any whitespace"

So you can just do:

#!/usr/bin/env perl

use strict;
use warnings;

while ( <DATA> ) {
    my @array = split;
    print $array[8],"\n";
}

__DATA__
4.8e-38  140.9   4.3    5.8e-38  140.6   4.3    1.1  1  NP_001275340.1  ELF4-like protein [Solanum tuberosum]XP_0063
4.8e-38  140.9   4.3    5.8e-38  140.6   4.3    1.1  1  XP_015080718.1  PREDICTED: protein ELF4-LIKE 3-like isoform X
5.3e-38  140.7   4.4    6.3e-38  140.5   4.4    1.1  1  XP_016481343.1  PREDICTED: protein ELF4-LIKE 4-like [Nicotian
5.4e-38  140.7   5.1    6.6e-38  140.4   5.1    1.1  1  XP_009784404.1  PREDICTED: protein ELF4-LIKE 4-like [Nicotian

But split also allows you do specify a regex.

my @array = split /(?:\t| +)/; 

Which would let you split on tab or one or more spaces, but not getting caught out double-tab for an empty field. Note - that you need ?: because split ... will capture, and add it to the list it's returning.

And it also lets you specify a field limit - because your 'last' field looks like it's a description:

my @array = split ' ', $_, 10;

This will work the same for $array[8] but $array[9] will contain: "PREDICTED: protein ELF4-LIKE 3-like isoform X"

The real root of your problem though, is that if you've read all of the file into an array already - what you have is an array of lines.

You can transform this - either at input time (as in the above examples) or via map:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;

my @input = <DATA>;
print Dumper \@input;
print join "\n", map { (split)[8] } @input;

__DATA__
4.8e-38  140.9   4.3    5.8e-38  140.6   4.3    1.1  1  NP_001275340.1  ELF4-like protein [Solanum tuberosum]XP_0063
4.8e-38  140.9   4.3    5.8e-38  140.6   4.3    1.1  1  XP_015080718.1  PREDICTED: protein ELF4-LIKE 3-like isoform X
5.3e-38  140.7   4.4    6.3e-38  140.5   4.4    1.1  1  XP_016481343.1  PREDICTED: protein ELF4-LIKE 4-like [Nicotian
5.4e-38  140.7   5.1    6.6e-38  140.4   5.1    1.1  1  XP_009784404.1  PREDICTED: protein ELF4-LIKE 4-like [Nicotian

In the above example, map iterates each element of @input, does a split, and selects field 8 - and returns that as a list.

So you could:

my @identifiers = map { (split)[8] } @input; 

Note - split is still working the same, e.g. defaulting to splitting the current element on whitespace.

Comments