|
The Needleman-Wunsch algorithm performs a global alignment on two sequences(called A and B here). It is commonly used by bioinformaticists to align protein or nucleotide sequences. The algorithm was proposed in 1970 by Saul Needleman and Christian Wunsch in their paper A general method applicable to the search for similarities in the amino acid sequence of two proteins.
The Needleman-Wunsch algorithm is based on the concept of Dynamic programming, and so is guaranteed to find the alignment with the maximum score. Needleman-Wunsch the first time it had been applied to biological sequence comparison.
The scoring system used gives scores for aligned characters, based on a similarity matrix. Here, <math>S(i, j)<math> is the similarity of characters i and j. It uses a linear gap penalty, here called d.
For example, if the similarity matrix was
| - | A | G | C | T |
| A | 10 | -1 | -3 | -4 |
| G | -1 | 7 | -5 | -3 |
| C | -3 | -5 | 9 | 0 |
| T | -4 | -3 | 0 | 8 |
then the alignment:
AGACTAGTTAC
CGA---GACGT
with a gap penalty of -5, would have the following score...
<math>S(A,C) + S(G,G) + S(A,A) + 3\times d + S(G,G) + S(T,A) + S(T,C) + S(A,G) + S(C,T)<math>
<math>= -3 + 7 + 10 - 3\times 5 + 7 + -4 + 0 + -1 + 0 = 1<math>
To find the alignment with the highest score, a two dimensional array(or matrix) is allocated. This matrix is often called the F matrix, and its i,jth entry is often denoted <math>F_{ij}<math> There is one column for each character in sequence A, and one row for each character in sequence B. This means that if we are aligning two sequences, one of size n and one of size m, this algorithm uses O(nm) of space(although there is a modification called Hirschberg's insight, which uses linear space and more computing time).
As the algorithm progresses, the <math>F_{ij}<math> will be assigned to be the optimal score for the alignment of the first i characters in A and the first j characters in B. The principle of optimality is then applied as follows...
Basis:
<math>F_{11} = 0<math>
<math>F_{1j} = 0<math>
<math>F_{i1} = 0<math>
Recursion, based on the principle of optimality:
<math>F_{ij} = \max(F_{i-1,j-1} + S(A_{i-1}, B_{j-1}), F_{i,j-1} + d, F_{i-1,j} + d)<math>
The pseudo-code for the algorithm to compute the F matrix therefore looks like this(array indexes start at 0)...
for i=0 to length(A)-1
F(i,0) <- 0
for j=0 to length(B)-1
F(0,j) <- 0
for i=1 to length(A)-1
for j = 1 to length(B)-1
{
Choice1 <- F(i-1,j-1) + S(A(i-1), B(j-1))
Choice2 <- F(i-1, j) - d
Choice3 <- F(i, j-1) - d
F(i,j) <- max(Choice1, Choice2, Choice3)
}
Once the F matrix is computed, note that the bottom right hand corner of the matrix is the maximum score for any alignments. To compute which alignment actually gives this score, you can start from the bottom left cell, and compare the value with the three possible sources(Choice1, Choice2, and Choice3 above) to see which it came from. If it was Choice1, then A(i) and B(i) are aligned, if it was Choice2 then A(i) is aligned with a gap, and if it was Choice3, then B(i) is aligned with a gap.
AlignmentA <- ""
AlignmentB <- ""
i <- length(A) - 1
j <- length(B) - 1
while (i > 0 AND j > 0)
{
Score <- F(i,j)
ScoreDiag <- F(i - 1, j - 1)
ScoreUp <- F(i, j - 1)
ScoreLeft <- F(i - 1, j)
if (Score - S(A(i), B(j)) == ScoreDiag)
{
AlignmentA <- A(i) + AlignmentA
AlignmentB <- B(j) + AlignmentB
i <- i - 1
j <- j - 1
}
else if (Score - d == ScoreLeft)
{
AlignmentA <- A(i) + AlignmentA
AlignmentB <- "-" + AlignmentB
i <- i - 1
}
otherwise (Score - d == ScoreUp)
{
AlignmentA <- "-" + AlignmentA
AlignmentB <- B(j) + AlignmentB
j <- j - 1
}
}
while (i >= 0)
{
AlignmentA <- A(i) + AlignmentA
AlignmentB <- "-" + AlignmentB
i <- i - 1
}
while (j >= 0)
{
AlignmentA <- "-" + AlignmentA
AlignmentB <- B(j) + AlignmentB
j <- j - 1
}
|