The art of aligning protein sequences
Part 1 Matrices

<-content | mod4_1 ->


Printer friendly Version Part1 -not yet-

Scoring matrices are essential in all analysis involving sequence comparisons icluding the database search algorithms BLAST and FASTA. The choice of matrix strongly influences the outcome of a pairwise or multiple sequence analysis. Implicitely modern scoring matrices such as the BLOSUM series of matrices represent a particular theory of evolution. Matrix selection very much depends upon an inferred evolutionary distance. Understanding the theory behind a given scoring matrix can help you making the right choice.


Part of this text is adapted from Geoffrey J. Barton in: Protein Structure prediction - a practical approach
Edited by M. J. E. Sternberg, IRL Press at Oxford University Press, 1996, ISBN 0 19 963496 3.

There are two main reasons for comparing and aligning protein sequences:


Matrices Amino acid scoring schemes

All algorithms that compare protein sequences rely on some scheme to score individual alignments of each of the 210 possible pairs of amino acids. (i.e. 190 pairs of different amino acids + 20 pairs of identical amino acids). 

Most scoring schemes represent the 210 pairs of scores as a matrix of similarities where identical amino acids and those of similar character (for instance an  isoleucine in a certain position compared to leucine) give higher scores compared to those of different character (e.g. isoleucine compared with  aspartate). 

Since the first protein sequences were obtained, many different types of scoring scheme have been devised. The most commonly used are those based on observed substitution between (obviously) similar proteins and of these, the 1976 Dayhoff matrix for 250 PAMS [1] has until recently dominated. This scheme and other schemes are discussed in the following sections.

To start with a couple of simple but in every day practice not very useful scoring schemes:

Table: The Fitch substitution model - the genetic code matrix, the number of point mutations necessary to transform one codon into the other.

 A   B   C   D   E   F   G   H   I   K   L    M    N    P    Q    R    S    T    V    W    Y    Z   
3.0 2.0 1.0 2.0 2.0 1.0 2.0 1.0 1.0 1.0 1.0  1.0  1.0  2.0  1.0  1.0  2.0  2.0  2.0  1.0  1.0  2.0 A
    3.0 1.0 3.0 2.0 1.0 2.0 2.0 2.0 2.0 1.0  1.0  3.0  1.0  2.0  1.0  2.0  2.0  2.0  0.0  2.0  2.0 B
        3.0 1.0 0.0 2.0 2.0 1.0 1.0 0.0 1.0  0.0  1.0  1.0  0.0  2.0  2.0  1.0  1.0  2.0  2.0  0.0 C
            3.0 2.0 1.0 2.0 2.0 1.0 1.0 1.0  0.0  2.0  1.0  1.0  1.0  1.0  1.0  2.0  0.0  2.0  2.0 D
                3.0 0.0 2.0 1.0 1.0 2.0 1.0  1.0  1.0  1.0  2.0  1.0  1.0  1.0  2.0  1.0  1.0  3.0 E
                    3.0 1.0 1.0 2.0 0.0 2.0  1.0  1.0  1.0  0.0  1.0  2.0  1.0  2.0  1.0  2.0  0.0 F
                        3.0 1.0 1.0 1.0 1.0  1.0  1.0  1.0  1.0  2.0  2.0  1.0  2.0  2.0  1.0  2.0 G
                            3.0 1.0 1.0 2.0  0.0  2.0  2.0  2.0  2.0  1.0  1.0  1.0  0.0  2.0  2.0 H
                                3.0 2.0 2.0  2.0  2.0  1.0  1.0  2.0  2.0  2.0  2.0  0.0  1.0  1.0 I
                                    3.0 1.0  2.0  2.0  1.0  2.0  2.0  1.0  2.0  1.0  1.0  1.0  2.0 K
                                        3.0  2.0  1.0  2.0  2.0  2.0  2.0  1.0  2.0  2.0  1.0  2.0 L
                                             3.0  1.0  1.0  1.0  2.0  1.0  2.0  2.0  1.0  0.0  1.0 M
                                                  3.0  1.0  1.0  1.0  2.0  2.0  1.0  0.0  2.0  2.0 N
                                                       3.0  2.0  2.0  2.0  2.0  1.0  1.0  1.0  2.0 P
                                                            3.0  2.0  1.0  1.0  1.0  1.0  1.0  3.0 Q
                                                                 3.0  2.0  2.0  1.0  2.0  1.0  2.0 R
                                                                      3.0  2.0  1.0  2.0  2.0  1.0 S
                                                                           3.0  1.0  1.0  1.0  1.0 T
                                                                                3.0  1.0  1.0  2.0 V
                                                                                     3.0  1.0  1.0 W
                                                                                          3.0  1.0 Y
                                                                                               3.0 Z

The scheme has been used both in the construction of phylogenetic trees and in the determination of homology between protein sequences having similar three dimensional structures. However; for obvious? reasons, (see genetic code table) today it is rarely the first choice for scoring alignments of protein sequences.

  T C A G  


Phe Ser Tyr Cys T
Leu Stop Stop A
Trp G


Leu Pro His Arg T
Gln A
A Ile Thr Asn Ser T
Met Lys Arg A
G Val Ala Asp Gly T
Glu A

Universal genetic code. click here to view the structure of all 20.

Question. In the table the amino acids are color coded. What kind of properties are grouped?
In the genetic code matrix what is score what is the score from Leu to Val?

Chemical similarity scoring: The aim with chemical similarity scoring schemes is to give greater weight to the alignment of amino acids with similar physico-chemical properties , see . This is desirable since major changes in amino acid type could reduce the ability of the protein to perform its biological role and hence the protein would be selected against during the course of evolution. The intuitive scheme developed by McLachlan. He classified amino acids on the basis of polar or non-polar character, size, shape and charge and gives a score of 6 to interconversions between identical rare amino acids (e.g. F to F) reducing to 0 for substitutions between amino acids of quite different character (e.g.. F to E). Unfortunately, all these biophysical quantities suffer from the fact that they provide only a partial view of the picture - there is no guarantee, that any particular property  is a good predictor for conservation of amino acids between related proteins. 


Observed substitutions: Nature knows best. Proteins evolve through a succession of independent point mutations, that are accepted in a population and subsequently can be observed in a multiple alignment of homologous sequences. Remember by defininition homologous sequences have evolved from a commom ancestor.

Thus scoring schemes based on observed substitutions, are derived by analyzing the substitution frequencies seen in alignments of sequences. This is something of a chicken and egg problem, since in order to generate optimal alignments, one really needs a scoring scheme but in order to derive the scoring scheme one needs the alignments! Early schemes based on observed substitutions therefore were made from closely related sequences that could easily be aligned by eye. More recent schemes have had the benefit of the earlier substitution matrices to generate alignments between more deviated but still homologous proteins. Long experience with scoring schemes based on observed substitutions suggests that they are superior to simple identity, genetic code, or intuitive physico-chemical property schemes.

Log odds matrices with observed substitutions: transition probability matrices

Some theory in simple terms:
Suppose we are comparing a protein from an ape and a bird and we believe that the to the dino on the left is the common ancestor.
We determined the protein sequence of what we think is the homologous sequence in the the two present day species. Since the dino is extinct we do not have that sequence, however if we did have that sequence we could track all the mutations that occured during evolution. Since we do not have this information we must estimate in the most optimal alignment of the the two protein sequences how likely a difference, an observed substitution is between the two sequences is in terms of our evolutionary model. To calculate this probability of a mutation we must have theory of how evolution works.

As an example suppose we believe that mutations at the amino acid level occurs by random nucleotide changes at the DNA level. We can than apply the Fitch scheme (see above) to calculate the probabilities. In this scheme we need 3 succesive mutations if we change a Methionine (M) (ATG) into Tyrosine (Y)(TAC) but only 1 mutation is needed to change the Methionine in a Leucine (TTG).
In other words in our model of evolution a change from Methionine ATG to Leucine (and Isoleucine and Valine etc) is probable event, a "change" from Methionine to Methionine (it remains the same) is the most probably event since it requires 0 substitutions and a change from Methionine to Tyrosine has the lowest probability.
In this way we can possibly calculate probabilities for all 20 possible substitutions. The sum of all 20 probabilities is equal to 1.
So what is the probability of changing a Methionine to Leucine? Can we calculate this or are we still missing a parameter?

Before we go to some modern models describing biological relationships between amino acids there is another model to consider.
Suppose that the two protein sequences that we align are not homologous. In other words they do not have a common ancestor. Our optimal alignment is then a chance alignment. In a simple model (and in a random world) for a chance alignment we could simply state that since we have 20 amino acids the chance probability P for each aligned amino acid pair is 0.05 x 0.05.

In short in our alignment we are always comparing two models (the evolutionary model vs the chance alignment model) and if we want to know which of our models is more likely in a given alignment we apply the following formula for each of the amino acids aligned.

S is the log odds ratio of two probabilities: the probability that two residues. i and j are aligned by evolutionary descent and the probability that they are aligned by chance.

Some Log odds matrices with observed substitutions:
1 The Dayhoff mutation data matrix

Possibly the most widely used scheme for scoring amino acid pairs is that developed by Dayhoff and co-workers. The system arose out of a general model for the evolution of proteins.

Dayhoff and co workers examined alignments of closely similar sequences where the likelihood of a particular mutation  being the result of a set of successive mutations (eg. A-> X-> Y-> D) was low.

However since relatively few families were considered, the resulting matrix of accepted point mutations included a large number of entries equal to 0 or 1 (no or one mutation scored!). A complete picture of the mutation process including those amino acids which did not change was determined by calculating the average ratio of the number of changes a particular amino acid type underwent to the total number of amino acids of that type present in the database. This was combined with the point mutation data to give the mutation probability matrix where each element gives the probability of the amino acid in column mutating to the amino acid in row after a particular evolutionary time, for example after 2 PAM (Percentage of Acceptable point Mutations per years).

The mutation probability matrix is specific for a particular evolutionary distance, but may be used to generate matrices for greater evolutionary distances by multiplying it repeatedly by itself. At the level of 2,000 PAM Schwartz and Dayhoff suggest that all the information present in the matrix has degenerated except that the matrix element for Cys-Cys is 10% higher than would be expected by chance. Note that an evolutionary distance is not fixed in time. Some proteins evolve at a greater speed than others.

Fig. An example of a Dayhoff  PAM 256 (log odd) matrix. Note that identity scoring in a PAM matrix is the highest but not equal for the different amino acids. Color codes: Amino acids are grouped according to the chemistry of the side group: C-sulfhydryl, STPAG-small hydrophilic, NDEQ-acid, acid amide and hydrophilic, HRK-basic, MILV-small hydrophobic and FYW-aromatic. Each matrix value is calculated by first dividing the frequency of change, for each amino acid pair, in related proteins separated by one step in an evolutionary tree by the probability of a chance alignment based on the frequency of the amino acids. The ratios are expressed as logarithms to the base 2 multiplied by 2 (so-called 1/2 bit values) or [2x (ln ratio/ ln2)= 2x 2log ratio]. Use of the log-odds matrix provides a prediction of the reliability of a sequence alignment above chance. Using such a matrix, an alignment score of 6 between two amino acids means that the probability of an evolutionary correct alignment is 8 (23) times higher than a the probability that the alignment occurred by chance. The probability of alignment between two protein sequences above chance, called the sequence similarity score, is calculated by summing the scores for each aligned amino acid pair and subtracting a penalty score for gaps. The scoring matrix is often modified in an attempt to improve the alignment obtained. All scores for matching a particular amino acid may be normalized to the same mean and standard deviation, and all amino acid identities may be given the same score in order to provide an equal contribution for each amino acid in a sequence alignment.

Recipe for Calculating a PAM matrix

For a certain amino acid, we must know mutability and background frequency. 

So what are normal background frequencies. In a totally random world the relative frequencies of the 20 amino acids would be 5% or 0.05. The table below is a summation over background frequencies in proteins such as they were known in 1978 when the PAM matrices were created and a recount in 1991(PET91 see below) when a lot more proteins of the same family was known. In formula

So if we have three proteins with 1000 AA in total and we count 50 glutamate residues in total, the frequency for E is 0.050 In this table you can see that W (tryptophan) and C (cysteine) are relatively rare while L (leucine) and A (alanine) are over represented. This means that in any alignment (Optimal or Suboptimal), a chance alignment between two leucines (or a leucine aligned with a alanine) occurs much more regularly than a chance alignment between two cysteines.

Amino acid frequencies (1978: original values, 1991: updated values):

Questions: The table is one letter code. Can you convert the table to three letter code?
What is the frequency of the aliphatic amino acids in total? Study the Venn diagram!

Relative mutabilities of amino acids:Alanine=100%

Pair Exchange Frequencies: align each pair of sequences with more than 85% identity to identify accepted point mutations. The amino acid pair exchanges are tabulated into a 20x20 matrix. Aij is the number of accepted mutations observed where amino acid i replaces amino acid j.

Relative mutabilities : the chance, that a given amino acid i will mutate on average (independent of the context):

mi = fi *(number of times i is observed to change in a multiple alignment)

are also scored for each individual amino acid.

In the table above all; mi values are taken relative to alanine, which is arbitrarily set at 100. Note that the rare amino acids C and W are mutated at a low frequency while A (alanine) and S (serine) are readily mutated or substituted. Now we can construct the matrix. The probability that an amino acid in row i of the PAM matrix will replace the amino acid in column j = the mutability of amino acid j , multiplied by the pair exchange frequency for ij divided by the sum of all pair exchange frequencies for amino acid i:

The diagonal elements represent the probability, that the amino acid will remain unchanged:


To asses whether two sequences are related we create an (optimal) alignment and compare two models. 

In model 1 we assume an evolutionary relationship between the aligned amino acid pairs with a probability defined by the mutation probability matrix.

In model 2 we assume a random alignment: The probability that that same event is observed by random chance (pi ran) is simply given by the frequency of occurrence of amino acid i.
piran = fi


The relative odds that a given event is due to evolution, rather than chance are:

Model 1/Model2 


If a evolutionary relationship if more likely R>1 otherwise 1>R>0

Last step: the log-odds matrix: log Rij to base 2 or 10 or whatever is handy.  The relatedness odds for one aligned pair are given by the corresponding matrix element. The relatedness odds for a second pair are multiplied with those of the first pair to give the joint probability. This is done for all aligned pairs of the whole sequence. Since multiplication is a computationally expensive process, it is preferable to add the logarithms of the matrix elements. Since the Dayhoff matrix was taken as the log to base 10, a value of +1 would mean that the corresponding pair has been observed 10 times more frequently than expected by chance. A value of -0.2 would mean that the observed pair was observed 1.6 times less frequently than chance would predict.

PET91 - An updated Dayhoff matrix


BLOSUM - A matrix; derived from ungapped alignments.

The BLOSUM matrices: Henikoff & Henikoff (Henikoff, S. & Henikoff J.G. (1992) PNAS 89:10915-10919

Question: Are the evolutionary rates uniform over the whole of the protein sequence?
The answer is No.
Blosum matrices are made from Aligned sequence segments (motifs) from the Blocks database The most conserved ungapped regions in aligned sequences form the basis of this matrix (2000 blocks from 500 protein families)
Amino acid pair frequencies can be compiled from these blocks

Different evolutionary distances are incorporated into this scheme with a clustering procedure: two sequences that are identical to each other for more than a certain threshold T of positions are clustered.

More sequences are added to the cluster if they are identical to any sequence already in the cluster at the same level.
All sequences within a cluster are then simply averaged. (A consequence of this clustering is that the contribution of closely related sequences to the frequency table is reduced, if the identity requirement is reduced.)
This leads to a series of matrices, analogous to the PAM series of matrices. BLOSUM80: derived at the 80% identity level.

The BLOSUM substitution matrices are used for scoring protein sequence alignments. The BLOSUM matrices are based on an entirely different principle and a much larger data set than the Dayhoff PAM (percent accepted substitution) matrices, which are derived from the observed rate of mutation during the predicted evolutionary changes in a smaller number of protein families. The Blosum62 matrix detects more distant relationships in a BLAST search, and produced an alignment of diverged proteins more in agreement with three-dimensional structures, than did the corresponding PAM60 matrix. The Blosum62 matrix is, therefore, superior for sequence alignment and database searching.

The matrix values are based on the observed amino acid substitutions in a large set of approximately 2000 conserved amino acid patterns, called blocks. These blocks have been extracted from a database of protein sequences representing over 500 groups of related proteins (Henikoff and Henikoff 1992), and act as signatures of these protein families.  Each column in the aligned sequences  provided a set of possible amino acid substitutions. The types of substitutions were then scored for all aligned patterns in the database. The idea is that more common substitutions  represent a closer relationship between two amino acids in related proteins and thus receive a higher score in sequence alignment. Conversely, rare substitutions should have a low score.

Within each alignment in the database, the sequences were clustered into groups where the sequences are similar at some threshold value of percentage identity. This procedure, however, can lead to an over-representation of amino acid substitutions which occur in the most closely related members of the protein families. To reduce this dominant contribution from the most alike proteins, the sequences of these proteins were grouped together into one sequence before scoring the amino acid substitutions in the aligned blocks. The amino acid changes within these clustered sequences were then averaged.

Substitution frequencies for all pairs of amino acids were then calculated between the groups and this used to calculate a log odds BLOSUM (blocks substitution matrix) matrix.

Different matrices are obtained by varying the clustering threshold. Patterns which were 62% identical were grouped together to make the blosum62 substitution matrix (see figure), and those which were 80% identical were used to make the blosum80 matrix etc.This can be confusing if PAM and BLOSUM matrices are compared. A PAM250 matrix where approximately 80% of the amino acid positions are observed to have changed would be the equivalent of a BLOSUM20 matrix !

Figure: Blosum62 Amino Acid Substitution Matrix.
Color code: The amino acids in the table are grouped according to the chemistry of the side group: C-sulfhydryl, STPAG-small hydrophilic, NDEQ-acid, acid amide and hydrophilic, HRK-basic, MILV-small hydrophobic and FYW-aromatic. Each entry is the actual frequency of occurrence of the amino acid pair in the blocks database, clustered at the 62% level, divided by the expected probability of occurrence. The expected value is calculated from the frequency of occurrence of each of the two individual amino acids in the blocks database, and provides a measure of a chance alignment of the two amino acids. The actual/expected ratio is expressed as a log odds score in so-called half bit units, obtained by converting the ratio to a logarithm to the base 2, and then multiplying by 2. A zero score means that the frequency of the amino acid pair in the database was as expected by chance, a positive score that the pair was found more often than by chance, and a negative score that the pair was found less often than by chance. The accumulated score of an alignment of several amino acids in two sequences may be obtained by adding up the respective scores of each individual pair of amino acids.

The highest scoring matches are between amino acids which are in the same chemical group and the very highest scoring matches are for cysteine-cysteine matches and for matches among the aromatic amino acids. A similar variation is seen in the PAM matrices. Compared to the PAM160 matrix, however, the Blosum62 matrix gives a more positive score to mismatches with the rare amino acids e.g. cysteine, a more positive score to mismatches with hydrophobic amino acids, but a more negative score to mismatches with hydrophilic amino acids.


Matrices derived from tertiary structure alignments
The most reliable protein sequence alignments may be obtained when all the proteins have had their tertiary structures experimentally determined. Comparison of three-dimensional structures also allows much more distantly related proteins to be aligned accurately. Analysis of such alignments should therefore give the best substitution matrices. Accordingly, Risler et al. [10] derived substitution frequencies from 32 proteins structurally aligned in 11 groups. On similar lines, Overington et al. [11] aligned 7 families for which 3 or more proteins of known three-dimensional structure were known and derived a series of substitution matrices. Overington et al. also subdivided the substitution data by the secondary structure and environment of each amino acid, however this led to rather sparse matrices due to the lack of examples. Bowie et al. [12] have also derived substitution tables specific for different amino acid environments and secondary structures.


Which matrix should I use?

The general consensus is that matrices derived from observed substitution data, the Dayhoff or BLOSUM matrices) are superior to identity, genetic code or physical property matrices.
However, there are Dayhoff matrices of different PAM values and BLOSUM matrices of different percentage identity and which of these should be used?




Next: Part 2 sequence comparison algorithms.