Bio21 Bio21 - 12 days ago 6
Python Question

Awk: how to compare two strings in one line

I have a dataset with 20 000 probes, they are in two columns, 21nts each. From this file I need to extract the lines in which last nucleotide in Probe1 column matches last nucleotide in in Probe 2 column. So far I tried AWK (substr) function, but didn't get the expected outcome. Here is one-liner I tried:

awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'


Another option would be to anchor last character in columns 2 and 4 (
awk '$2~/[A-Z]$/
), but I can't find a way to match the probes in two columns using regex. All suggestions and comments will be very much appreciated.

Example of dataset:

Probe 1 Probe 2
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4738 GGAGGATTTGGCCGGAGAGGC C GGAGGAGGAGGAGGACGAGGT
4739 GGAGGAAGAGGAGGGGGAGGT D GGAGGACGAGGAGGAGGAGGC
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC


Desired output:

4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC

Answer

This will filter the input, matching lines where the last character of the 2nd column is equal to the last character of the 4th column:

awk 'substr($2, length($2), 1) == substr($4, length($4), 1)'

What I changed compared to your sample script:

  • Move the if statement out of the { ... } block into a filter
  • Use length($2) and length($4) instead of hardcoding the value 21
  • The { print $0 } is not needed, as that is the default action for the matched lines