JDnoble JDnoble - 4 months ago 6
Perl Question

How to count the number of occurences of n-length combinations of characters in string

I am using the following one liner to list the occurrences of combinations of

ATCG
, forming string of length 6. It works fine aside from not printing the occurrence of 0 matches. Is there a way to change the regex, or another part, to where it will print something like "0 ATTTAG"?

#!/bin/bash
for file in e_coli.fa
do
base=$(basename $file .fa)
cat $file | perl -nE 'say for /(?<=([ATCG]{6}))/g' \
| sort | uniq -c >> ${base}_hexhits_6mer.txt
done

stdout:
465 AAAAAA
607 AAAAAC
661 AAAAAG
581 AAAAAT
563 AAAACA
807 AAAACC
770 AAAACG
373 AAAACT
663 AAAAGA
1213 AAAAGC

Answer

Since uniq -c counts the number of times a line occurs, it can't possibly return 0. The requested change requires a complete rewrite.

perl -e'
   while (<>) {
      ++$counts{$_} for /(?=([ATCG]{6}))/g;
   }

   for my $seq (glob("{A,C,G,T}" x 6)) {
      printf("%7d %s\n", $counts{$seq}, $seq);
   }
' "$file" >"${base}_hexhits_6mer.txt"