version 3.5c

PROTDIST -- Program to compute distance matrix from protein sequences

(c) Copyright 1993 by Joseph Felsenstein. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed.

This program uses protein sequences to compute a distance matrix, under three different models of amino acid replacement. The distance for each pair of species estimates the total branch length between the two species, and can be used in the distance matrix programs FITCH, KITSCH or NEIGHBOR. This is an alternative to use of the sequence data itself in the parsimony program PROTPARS.

The program reads in protein sequences and writes an output file containing the distance matrix. The three models of amino acid substitution are one which is based on the PAM matrixes of Margaret Dayhoff, one due to Kimura (1983) which approximates it based simply on the fraction of similar amino acids, and one based on a model in which the amino acids are divided up into groups, with change occurring based on the genetic code but with greater difficulty of changing between groups. The program correctly takes into account a variety of sequence ambiguities.

The three methods are: (1) The Dayhoff PAM matrix. This uses Dayhoff's PAM 001 matrix from Dayhoff (1979), page 348. The PAM model is an empirical one that scales probabilities of change from one amino acid to another in terms of a unit which is an expected 1% change between two amino acid sequences. The PAM 001 matrix is used to make a transition probability matrix which allows prediction of the probability of changing from any one amino acid to any other, and also predicts equilibrium amino acid composition. The program assumes that these probabilities are correct and bases its computations of distance on them. The distance that is computed is scaled in units of expected fraction of amino acids changed. (2) Kimura's distance. This is a rough-and-ready distance formula for approximating PAM distance by simply measuring the fraction of amino acids, p, that differs between two sequences and computing the distance as (Kimura, 1983)

D = - loge ( 1 - p - 0.2 p2).
This is very quick to do but has some obvious limitations. It does not take into account which amino acids differ or to what amino acids they change, so some information is lost. The units of the distance measure are fraction of amino acids differing, as also in the case of the PAM distance. If the fraction of amino acids differing gets larger than 0.8541 the distance becomes infinite. (3) The Categories distance. This is my own concoction. I imagined a nucleotide sequence changing according to Kimura's 2-parameter model, with the exception that some changes of amino acids are less likely than others. The amino acids are grouped into a series of categories. Any base change that does not change which category the amino acid is in is allowed, but if an amino acid changes category this is allowed only a certain fraction of the time. The fraction is called the "ease" and there is a parameter for it, which is 1.0 when all changes are allowed and near 0.0 when changes between categories are nearly impossible.

In this option I have allowed the user to select the Transition/Transversion ratio, which of several genetic codes to use, and which categorization of amino acids to use. There are three of them, a somewhat random sample:

(a) The George-Hunt-Barker (1988) classification of amino acids,

(b) A classification provided by my colleague Ben Hall when I asked him for one,

(c) One I found in an old "baby biochemistry" book (Conn and Stumpf, 1963), which contains most of the biochemistry I was ever taught, and all that I ever learned.

Interestingly enough, all of them are consisten with the same, linear, ordering of amino acids, which they divide up in different ways. For the Categories model I have set as default the George/Hunt/Barker classification with the "ease" parameter set to 0.457 which is approximately the value implied by the empirical rates in the Dayhoff PAM matrix.

The method uses, as I have noted, Kimura's (1980) 2-parameter model of DNA change. The Kimura "2-parameter" model allows for a difference between transition and transversion rates. Its transition probability matrix for a short interval of time is:

              To:     A        G        C        T
                   ---------------------------------
               A  | 1-a-2b     a         b       b
       From:   G  |   a      1-a-2b      b       b
               C  |   b        b       1-a-2b    a
               T  |   b        b         a     1-a-2b
where a is u dt, the product of the rate of transitions per unit time and dt is the length dt of the time interval, and b is v dt, the product of half the rate of transversions (i.e., the rate of a specific transversion) and the length dt of the time interval.

Each distance that is calculated is an estimate, from that particular pair of species, of the divergence time between those two species. The Kimura distance is straightforward to compute. The other two are considerably slower, and they look at all positions, and find that distance which makes the likelihood highest. This likelihood is in effect the length of the internal branch in a two-species tree that connects these two species. Its likelihood is just the product, under the model, of the probabilities of each position having the (one or) two amino acids that are actually found. This is fairly slow to compute.

The computation proceeds from an eigenanalysis (spectral decomposition) of the transition probability matrix. In the case of the PAM 001 matrix the eigenvalues and eigenvectors are precomputed and are hard-coded into the program in over 400 statements. In the case of the Categories model the program computes the eigenvalues and eigenvectors itself, which will add a delay. But the delay is independent of the number of species as the calculation is done only once, at the outset.

The actual algorithm for estimating the distance is in both cases a bisection algorithm which tries to find the point at which the derivative os the likelihood is zero. Some of the kinds of ambiguous amino acids like "glx" are correctly taken into account. However, gaps are treated as if they are unkown nucleotides, which means those positions get dropped from that particular comparison. However, they are not dropped from the whole analysis. You need not eliminate regions containing gaps, as long as you are reasonably sure of the alignment there.

Note that there is an assumption that we are looking at all positions, including those that have not changed at all. It is important not to restrict attention to some positions based on whether or not they have changed; doing that would bias the distances by making them too large, and that in turn would cause the distances to misinterpret the meaning of those positions that had changed.

INPUT FORMAT AND OPTIONS

Input is fairly standard, with one addition. As usual the first line of the file gives the number of species and the number of sites. There follows the character W if the Weights option is being used.

Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats described in the Molecular Sequence Programs document. The I option selects between them. The sequences can have internal blanks in the sequence but there must be no extra blanks at the end of the terminated line. Note that a blank is not a valid symbol for a deletion.

After that are the lines (if any) containing the information for the W option, as described below.

The options are selected using an interactive menu. The menu looks like this:

Protein distance algorithm, version 3.5c

Settings for this run:
  P  Use PAM, Kimura or categories model?  Dayhoff PAM matrix
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes

Are these settings correct? (type Y or the letter for one to change)
The user either types "Y" (followed, of course, by a carriage-return) if the settings shown are to be accepted, or the letter or digit corresponding to an option that is to be changed.

The options M and 0 are the usual ones. They are described in the main documentation file of this package. Option I is the same as in other molecular sequence programs and is described in the documentation file for the sequence programs.

The P option selects one of the three distance methods. It toggles among the three methods. The default method, if none is specified, is the Dayhoff PAM matrix model. If the Categories distance is selected another menu option, T, will appear allowing the user to supply the Transition/Transversion ratio that should be assumed at the underlying DNA level, and another one, C, which allows the user to select among various nuclear and mitochondrial genetic codes.i The transition/transversion ratio can be any number from 0.5 upwards.

The W (Weights) option is invoked in the usual way, with only weights 0 and 1 allowed. It selects a set of sites to be analyzed, ignoring the others. The sites selected are those with weight 1. If the W option is not invoked, all sites are analyzed.

OUTPUT FORMAT

As the distances are computed, the program prints on your screen or terminal the names of the species in turn, followed by one dot (".") for each other species for which the distance to that species has been computed. Thus if there are ten species, the first species name is printed out, followed by one dot, then on the next line the next species name is printed out followed by two dots, then the next followed by three dots, and so on. The pattern of dots should form a triangle. When the distance matrix has been written out to the output file, the user is notified of that.

The output file contains on its first line the number of species. The distance matrix is then printed in standard form, with each species starting on a new line with the species name, followed by the distances to the species in order. These continue onto a new line after every nine distances. The distance matrix is square with zero distances on the diagonal. In general the format of the distance matrix is such that it can serve as input to any of the distance matrix programs.

If the option to print out the data is selected, the output file will precede the data by more complete information on the input and the menu selections. The output file begins by giving the number of species and the number of characters, and the identity of the distance measure that is being used.

In the Categories model of substitution, the distances printed out are scaled in terms of expected numbers of substitutions, counting both transitions and transversions but not replacements of a base by itself, and scaled so that the average rate of change is set to 1.0. For the Dayhoff PAM and Kimura models the distance are scaled in terms of the expected numbers of amino acid substitutions per site. Of course, when a branch is twice as long this does not mean that there will be twice as much net change expected along it, since some of the changes may occur in the same site and overlie or even reverse each other. The branch lengths estimates here are in terms of the expected underlying numbers of changes. That means that a branch of length 0.26 is 26 times as long as one which would show a 1% difference between the protein (or nucleotide) sequences at the beginning and end of the branch. But we would not expect the sequences at the beginning and end of the branch to be 26% different, as there would be some overlaying of changes.

One problem that can arise is that two or more of the species can be so dissimilar that the distance between them would have to be infinite, as the likelihood rises indefinitely as the estimated divergence time increases. For example, with the Kimura model, if the two sequences differ in 85.41% or more of their positions then the estimate of divergence time would be infinite. Since there is no way to represent an infinite distance in the output file, the program regards this as an error, issues a warning message indicating which pair of species are causing the problem, and computes a distance of -1.0.

PROGRAM CONSTANTS

The constants that are available to be changed by the user at the beginning of the program include The other constants include "namelength", the length of species names in characters, and "epsilon", a parameter which controls the accuracy of the results of the iterations which estimate the distances. Making "epsilon" smaller will increase run times but result in more decimal places of accuracy. This should not be necessary.

The program spends most of its time doing real arithmetic. Any software or hardware changes that speed up that arithmetic will speed it up by a nearly proportional amount. For example, microcomputers that have a numeric co- processor (such as an 8087, 80287, or 80387 chip) will run this program much faster than ones that do not, if the software calls it.

--------------------------------TEST DATA SET--------------------------

   5   13
Alpha     AACGTGGCCACAT
Beta      AAGGTCGCCACAC
Gamma     CAGTTCGCCACAA
Delta     GAGATTTCCGCCT
Epsilon   GAGATCTCCGCCC

------ CONTENTS OF OUTPUT FILE (with all numerical options on ) -----------
Name            Sequences
----            ---------

Alpha        AACGTGGCCA CAT
Beta         ..G..C.... ..C
Gamma        C.GT.C.... ..A
Delta        G.GA.TT..G .C.
Epsilon      G.GA.CT..G .CC



    5
Alpha          0.00000  0.47285  0.88304  1.29841  2.12269
Beta           0.47285  0.00000  0.45192  1.34185  0.84009
Gamma          0.88304  0.45192  0.00000  1.30693  1.21582
Delta          1.29841  1.34185  1.30693  0.00000  0.27536
Epsilon        2.12269  0.84009  1.21582  0.27536  0.00000