jonwells - 6 months ago 32

Python Question

I'm trying to implement the Smith-Waterman algorithm for local sequence alignment using the affine gap penalty function. I think I understand how to initiate and compute the matrices required for calculating alignment scores, but am clueless as to how to then traceback to find the alignment. To generate the 3 matrices required I have the following code

`for j in range(1, len2):`

for i in range(1, len1):

fxOpen = F[i][j-1] + gap

xExtend = Ix[i][j-1] + extend

Ix[i][j] = max(fxOpen, xExtend)

fyOpen = F[i-1][j] + gap

yExtend = Iy[i-1][j] + extend

Iy[i][j] = max(fyOpen, yExtend)

matchScore = (F[i-1][j-1] + simMatrixDict[seq1[i-1]+seq2[j-1]])

xScore = Ix[i-1][j-1] + simMatrixDict[seq1[i-1]+seq2[j-1]]

yScore = Iy[i-1][j-1] + simMatrixDict[seq1[i-1]+seq2[j-1]]

F[i][j] = max(0, matchScore, xScore, yScore)

I am unsure if I need a single matrix for traceback, or just 1? Any clarification on how to go about tracing back from the max score in F would be much appreciated.

Answer

The important thing to remember about traceback in Smith-Waterman is that the matrix a value is in determines the direction that you move. So, if you are in `F`

you're moving diagonally, if you're in `Ix`

, you're moving horizontally, and if you're in `Iy`

, you're moving vertically. This means that all you need to store in the pointer matrix is the matrix that you arrived at a square from. The matrix that you are coming from, not the one that you are going to, determines the dirction which to go.

**For example:**

Say you are at `F[5][5]`

:

- If pointer matrix says to go to
`Ix`

, go to`Ix[4][4]`

- If pointer matrix says to go to
`Iy`

, go to`Iy[4][4]`

- If pointer matrix says to go to
`F`

, go to`F[4][4]`

Whereas if you are at `Ix[5][5]`

:

- If pointer matrix says to go to
`Ix`

, go to`Ix[4][5]`

- If pointer matrix says to go to
`F`

, go to`F[4][5]`

Or if you are at `Iy[5][5]`

:

- If pointer matrix says to go to
`Iy`

, go to`Iy[5][4]`

- If pointer matrix says to go to
`F`

, go to`F[5][4]`

Assuming that the first index is the x coordinate and the second is the y coordinate.

Continue tracing back until you reach a cell with a maximum value of 0.

**Building the pointer matrix:**
You need one pointer matrix each for `F`

, `Ix`

, and `Iy`

. These matrices only need to indicate which matrix a value came from, because that tells you which direction you were moving in. So, as you're running through the dynamic programming phase of the algorithm, you should also be building up the pointer matrices. Every time you store a new maximum value in a cell in `F`

, `Ix`

, or `Iy`

, you should update the corresponding matrix to indicate where it came from. If, for instance, the highest value you can have in `F[5][5]`

comes by aligning the two next bases when you're in `F[4][4]`

, the Fpointer[5][5] should be set to `F`

, because you got there from the `F`

matrix.