Help - Interpreting FASTA output


Unlike BLAST, FASTA will only report a single match between your sequence and each database sequence, however it will allow gaps in the alignments that it generates. FASTA increases it's speed by performing an initial scan of the database, and only performing a rigorous alignment on the best matches. For this reason it is not as sensitive as programs such as BLAST and BLITZ.

The output from FASTA is divided into four sections. First is some information on the query sequence and the database searched. Second is a histogram that shows graphically the score distribution. Third is a list of the sequences matched with some statistical information about the strength of the match, and finally, the alignments themselves are shown.

The histogram

       opt      E()
< 20   188     0:==
  22     0     0:           one = represents 109 library sequences
  24     0     0:
  26     2     1:*
  28     7    15:*
  30    28    91:*
  32   200   353:== *
  34   841   958:========*
  36  2217  1968:==================*==
  38  3746  3253:=============================*=====
  40  5360  4538:=========================================*========
  42  6055  5547:==================================================*=====
  44  6496  6119:========================================================*===
  46  5820  6232:======================================================   *
  48  5469  5966:===================================================   *
  50  4820  5444:=============================================    *
  52  4202  4787:=======================================    *
  54  3815  4089:===================================  *
  56  3271  3415:===============================*
  58  2755  2804:=========================*
  60  2268  2271:====================*
  62  1813  1821:================*
  64  1500  1448:=============*
  66  1233  1145:==========*=
  68   951   900:========*
  70   746   706:======*
  72   699   551:=====*=
  74   460   430:===*=
  76   337   335:===*
  78   287   260:==*
  80   244   202:=*=
  82   185   154:=*
  84   115   122:=*
  86   114    95:*=
  88    75    73:*          inset = represents 1 library sequences
  90    70    57:*
  92    48    44:*         :=======================================*
  94    26    34:*         :==========================       *
  96    33    26:*         :=========================*=======
  98    14    20:*         :==============     *
 100    10    16:*         :==========     *
 102     7    12:*         :=======    *
 104     6     9:*         :======  *
 106     5     7:*         :===== *
 108     2     6:*         :==   *
 110     2     4:*         :== *
 112     1     3:*         := *
 114     0     3:*         :  *
 116     0     2:*         : *
 118     0     2:*         : *
>120    27     1:*         :*==========================
FASTA works out an initial score for each of the database sequences matched against your sequence. The histogram gives a graphical representation of the distribution of these scores. It should be expected that these scores would fall approximately into a normal distribution, and that any significant matches will fall outside the normal curve. You can see that at the bottom of the histogram there are 27 sequences that fall outside the curve (represented by the asterisks).

The list of hits

The best scores are:                               initn init1 opt z-sc   E(66345)
MERR_PSEAE mercuric resistance operon regu  ( 144)  928  928  928 1129.8     0
MERR_SHIFL mercuric resistance operon regu  ( 144)  871  871  871 1061.3     0
MERR_SERMA mercuric resistance operon regu  ( 144)  810  810  810  988.1     0
MERR_STAAU mercuric resistance operon regu  ( 135)  292  172  298  373.6  3.5e-14
MERR_BACSR (strain rc607). mercuric resist  ( 132)  241  198  289  363.0  1.4e-13
YHDM_ECOLI hypothetical transcriptional re  ( 141)  175  175  276  347.0  1.1e-12
The first part of the line gives the database name of the matched sequence, followed by the first part of the description. After this are three scores, initn, init1 and opt..

Some quick rules

The last two numbers are a statistical measure of the significance of the match. The first (z-score) is a measure (in standard deviations) of how far the score falls away from the mean. The second is an estimate of the likelihood of a similar match occuring by chance. Obviously, the lower the second number, the more unlikely it is that the match is random. Generally, a figure of 0.01 or below is statistically very significant, and a figure of between 0.01 and 0.05 is borderline.
It is important to remember that a statistically significant match is not necessarily biologically significant, and conversely a match may be of biological significance without passing the statistical test. You should therefore use your knowledge of the biology of the system to help interpret these results.

The alignments

>>MERR_STAAU mercuric resistance operon regulatory protei (135 aa)
 initn: 292 init1: 172 opt: 298 Z-score: 373.6 expect() 3.5e-14
Smith-Waterman score: 298;  36.923% identity in 130 aa overlap

               10        20        30        40        50        60
MerR   MENNLENLTIGVFAKAAGVNVETIRFYQRKGLLLEPDKPYGSIRRYGEADVTRVRFVKSA
              . :. .:::  :: ::.:.:.::::.  : .  .. : :.:  . ::::.:  
MERR_S      MGMKISELAKACDVNKETVRYYERKGLIAGPPRNESGYRIYSEETADRVRFIKRM
                    10        20        30        40        50     

               70          80        90       100       110        
MerR   QRLGFSLDEIAELLRL--EDGTHCEEASSLAEHKLKDVREKMADLARMEAVLSELVCACH
       ..: ::: ::  :. .  .:: .:..  ... .: :....:.  : :.. .: ::   : 
MERR_S KELDFSLKEIHLLFGVVDQDGERCKDMYAFTVQKTKEIERKVQGLLRIQRLLEELKEKCP
          60        70        80        90       100       110     

      120       130       140    
MerR   ARRGNVSCPLIASLQGGASLAGSAMP
        ...  .::.: .:.::         
MERR_S DEKAMYTCPIIETLMGGPDK      
         120       130        
Above is an example of the alignments. The database name is given, along with the one-line description of the database entry. Also given are some more statistical data, including percentages of identical amino acids and the length of the match. Below this are the alignments themselves. Your sequence is shown at the top and the database sequence is given below. Identities are shown with the : symbol, and similarities with the . symbol. Where FASTA has introduced gaps to optimise the alignment, these are shown with -- symbols in the sequence.