I have the following AWK script that counts occurences of elements in field 1 and when finishes to read entire file, prints each element and the times of repetitions.
awk '{a[$1]++} END{ for(i in a){print i"-->"a[i]} }' file
perl -lane '$a{$F[1]}++ END{foreach $a {print $a} }' file
awk '{a[$1]++}END{for(i in a){print i"-->"a[i]}}' file #--> 2:45 aprox
perl -lane '$a{$F[0]}++;END{foreach my $k (keys %a){ print "$k --> $a{$k}" } }' file #--> 7 min aprox
perl -lanE'$a{$F[0]}++; END { say "$_ => $a{$_}" for keys %a }' file # -->9 min aprox
Equivalent to your awk
line
perl -lanE'$a{$F[0]}++; END { say "$_ => $a{$_}" for keys %a }' file
By -a
the line is broken into fields in @F
so you want $F[0]
as a key in a hash %h
with the value of the counter handled by ++
. The hash is iterated over keys and printed in the END
block.
However, the efficiency comparison comes up. A way to improve this is to not fetch all fields on the line, done with -a
, since only the first one is needed. Between two ways that come to mind
perl -nE'$a{(/(\S+)/)[0]}++; END { ... }'
and
perl -nE'$a{(split " ", $_, 2)[0]}++; END { ... }'
the split
is significantly faster with its 3.63s
vs 4.41s
for regex, on a 8M-line file.
This is still behind 1.99s
for your awk
line. So it seems that awk is faster for this task.
Summary of my timings for an 8-million line file (average of a few runs)
awk (question) 1.99s perl (split) 3.63s perl (regex) 4.41s perl (like awk) 5.61s
These timings vary over runs by a few tens of miliseconds (0.01s).