next up previous contents
Next: 4 SAM-T99 Up: SAM (Sequence Alignment and Previous: 2 New to this   Contents

Subsections


3 Quick overview

The Sequence Alignment and Modeling (SAM) suite of programs includes several tools for modeling, aligning, and discriminating related DNA, RNA, and protein sequences. Given a set of related sequences, the system can automatically train and use a linear HMM representing the family. The SAM-T99 method, available in SAM 3.0 and higher, is a more automated method for building and using HMMs. Readers may wish to read the T99 example before the current section. See Section 4.

Figure 1: A linear hidden Markov model and example alignment.
\begin{figure}\begin{center}
\parbox{0pt}{\psfig{figure=hmmexample.eps,width=0.9\textwidth}}\end{center}\end{figure}

SAM uses a linear hidden Markov model (Figure 1) to represent biological sequences. The model is a linear sequence of nodes, each of which includes match (square), insert (diamond), and delete (circle) states. Each match state has a distribution over the appropriate alphabet indicating which characters are most likely. The chain of match states forms a model of the family, or of columns of a multiple alignment. Some sequences may not have characters in specific positions -- delete states enable them to skip through a node without `using up' any characters. Other sequences may have extra characters, which are modeled with the insert states. Insertions are thus used when a small number of sequences have positions not found in most other sequences, while delete states are used when a small number of sequences do not have a character in a position found in most other sequences. As with the match states, all transition probabilities (the chance of having a delete or moving to an insertion state) are local, enabling, for example, the system to strongly penalize sequences that delete conserved regions.

The primary programs include:

buildmodel
Create a new model from a family of sequences, or refine an existing model.
align2model
Create a multiple alignment of sequences to an existing model. The prettyalign program will make align2model output more readable.
hmmscore
Calculate the negative log-likelihood (NLL) scores for a file of sequences given a model, as well log-odds scores and E-values in the case of reverse null models. This program is used for discrimination experiments. Sequences that score better than (or worse than) a threshold can be saved, as can their alignments or multiple domain alignments.
modelfromalign
Use an existing multiple alignment to create an initial model. Such a model is usually then refined using buildmodel.
target99
A script that uses SAM to iteratively create a model from a single protein sequence and its close homologues.

Figure 2: Overview of SAM.
\begin{figure}\centerline{\psfig{figure=overview.eps,width=0.8\textwidth}}\end{figure}

A basic flowchart for using SAM is shown in Figure 2.

As a simple example, consider the task of modeling the 10 tRNAs included in the file trna10.seq of the distribution. For this experiment, default program settings will be used: the many adjustable parameters are described Sections 6 and 12.

3.1 Building a model

To start, we need to create a model from the sequence file using buildmodel. This program always requires a name for the run: if the name is test, the system will create the model output file test.mod, which will include parameter settings, iteration statistics, and CPU usage, as well as the initial and final model.

Parameters to buildmodel are specified with hyphens. For this experiment, first we try the command:


buildmodel test -train trna10.seq -randseed 0
This specifies the run name and where the sequences for training can be found. Here, to hopefully make this example reproducible, a seed for the random number generator has also been specified.

Buildmodel then prints out a line on standard output such as:


-29.40  -26.34  -27.62   1.09  32 1 76 
This is a brief summary of the statistics provided in the output model file, and is discussed in more detail in Section 11.2. If no random seed had been specified, buildmodel would use the process number, and the statistics line would be different for multiple runs.

The run has generated a file called test.mod. This file contains various statistics, described later, as well as the final model. Statistics are printed to the file after each re-estimation so that the progress of a run can be readily checked.

3.2 Aligning sequences

To generate a multiple alignment, a command such as the following is used:


align2model trna10 -i test.mod -db trna10.seq
prettyalign trna10.a2m -l90 > trna10.pretty
This aligns each sequence to the model, places the alignment in the file trna10.a2m, and then cleans up the output and places it in the file trna10.pretty, with 90 characters per line. The alignment will look something like:


;  SAM: prettyalign v3.3.1 (December 20, 2001) compiled 12/27/01_15:15:58
;  (c) 1992-2001 Regents of the University of California, Santa Cruz
;
;          Sequence Alignment and Modeling Software System
;         http://www.cse.ucsc.edu/research/compbio/sam.html
;
; --------- Citations (SAM, SAM-T99, HMMs) ----------
; R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
;  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
; K. Karplus, C. Barrett, R. Hughey, Hidden Markov models for detecting
;  remote protein homologies, Bioinformatics 14(10):846-856, 1998.
; A. Krogh et al., Hidden Markov models in computational biology:
;  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
; -----------------------------------
               10        20         30        40         50          60        70       
                |         |          |         |          |           |         |       
TRNA1  GGGGAUGUAGCUCAG-UGGUA.GAGCGCA-UGCUUCGCAUGUAU.GAGGCCCcG.GGUUCGAUCCCC--GGCAUCUCCA
TRNA2  GCGGCCGUCGUCUAGUCUGGAuUAGGACGCUGGCCUCCCAAGCCA.GCAAUCCcG.GGUUCGAAUCCCGGCGGCC--GCA
TRNA3  GGCCCUGUGGC-UAGCUGGU-.CAAAGCGCCUGUCUAGUAAACAG.GAGAUCCuG.GGUUCGAAUCCCAGCGGGGCCUCCA
TRNA4  GGGCGAAUAGUGUCAGCGGG-.-AGCACACCAGACUUGCAAUCUG.GUAGGGA.G.GGUUCGAGUCCCUCUUUGUCCACCA
TRNA5  GCCGGGAUAGCUCAGUUGGUA.GAGCAGA-GGACUGAAAAUCCUcGUGUCAC.C.AGUUCAAAUC--UGGUUCCUGGCA
TRNA6  GGGGCCUUAGCUCAGCUGGGA.GAGCG-CCUGCUUUGCACGCAG.GAGGUCA.G.CGGUCGA-CCCGCUAGGCUCCACCA
TRNA7  GGGCACAUGGCGCAGUUGGU-.-AGCGCGCUUCCCUUGCAAGGAA.GAGGUCAuC.GGUUCGAUUCCG--GUUGCGUCCA
TRNA8  GGGCCCGUGGCCUAGUCUGGA.UACGGCACCGGCCUUCUAAGCCG.GGGAUCGgG.GGUUCAAAUCCCUCCGGGUCCG--
TRNA9  CGGCACGUAGCGCAGCCUGG-.UAGCGCACCGUCCUGGGGUUGCG.GGGGUCG.GaGGUUCAAAUCCUCUCGUGCCGACCA
TRNA10 UCCGUCGUAGUCUAGGUGGU-.UAGGAUACUCGGCUUUCACCCGA.GAGACCC.G.GGUUCAAGUCCCGGCGACGGAACCA

Here, hyphens indicate deletes while lower-case letters, and the corresponding periods (`.') indicate inserts. The column numbers refer to match states in the model, not to column numbers, so that insertions are disregarded in calculating these index points. It is important to remember that insertions are not aligned among them selves: the fact that two insertion characters are in the same printed column only means that they were generated by the same insertion state, not that they should be aligned.

Figure 3: Output of makelogo. Bar height corresponds to information in bits, relative character heights correspond to relative frequencies.
\begin{figure}\begin{center}
\makebox[0pt][c]{\parbox{6.5in}{\psfig{figure=test.eps,width=6.5in}}}
\end{center}\end{figure}

3.3 Examining models

Model structure can be quite interesting. SAM includes two programs for examing models: makelogo and drawmodel

The first, makelogo, shows the negative entropy in bits of each match state and the relative frequencies of the different letters in a sequence logo format, invented by Tom Schneider and Mike Stephens (NAR 18:6097-6100, 1990) http://www-lmmb.ncifcrf.gov/ toms/index.html. The command:


makelogo test -modelfile test.mod
creates the test.eps file of Figure 3. The program has many options. See Section 10.9.3.

Figure 4: Output of drawmodel. Match state histograms are in AGCU order.
\begin{figure}\begin{center}
\par This figure is relatively large. See the
\htm...
....cse.ucsc.edu/pub/protein/sam1.01_doc.ps.Z}
for a copy.
\end{center}\end{figure}

The second, drawmodel, program generates postscript drawings of models that include match-state histograms and transition line styles that correspond to their frequency of use. These drawings are most useful when derived from frequency counts, values that can be optionally included in the output file:


buildmodel test -train trna10.seq -randseed 0 -print_frequencies 1
The command drawmodel test.mod test.ps will run the drawmodel program, which scans the file finding a model and frequency count data. By selecting the frequency count data in `overall' mode, the postscript drawing in Figure 4 is generated.1 The histograms in the match states correspond to the columns of the multiple alignment above, the numbers in the diamond insert states correspond to the average number of insertions for each sequence that uses that state, and the node numbers are given in the circular delete states. Transitions that are not used by a significant number of the sequences are not drawn. See Section 10.9.1.


3.4 Scoring sequences

The scoring program, hmmscore, generates a file of the negative log-likelihood minus NULL model (NLL-NULL, or log-odds) scores for each sequence given the model. Let's see how the 10 tRNA sequences fit the model (test.mod):


hmmscore test -i test.mod -db trna10.seq -sw 2
The arguments are the name of the run, the model file, the database sequence file, and a specifier to use fully-local scoring similar to that of the Smith & Waterman algorithm (this is suggested for all scoring runs unless semi-local, domain, or global scoring is specifically required).

hmmscore produces the test.dist file of Figure 5.

Figure 5: Distance output from hmmscore
\begin{figure}\begin{showfile}
<tex2html_file> ...

The score file contains six columns. The first is the sequence identifier, followed by sequence length, the `NLL-NULL' score using a simple null model, the reverse sequence `NLL-NULL' and an E-value based on the size of the database and the reverse sequence score. The simple null model is just the product of the character probabilities in each sequence multiplied together. Thus, the NLL-NULL score show how much improvement modeling with an HMM gained versus modeling with just a single insertion state from the HMM. The reverse sequence null model is more expensive to calculate, but a much better indicator of sequence similarity to the model. The reverse sequence NLL-NULL score is the difference between the raw negative log-likelihood (NLL) sequence score from that of the raw NLL score of the reversed sequence. Thus, there for sequence null model matches exactly the length and composition of the sequence, but does not match the ordering of the residues.

Other options of hmmscore enable the selective output of sequences according to their NLL scores, NLL per base scores, or E-value. The hmmscore program enters an interactive mode when called with no command line arguments. See Section 10.2.

3.5 Parameter selection

As you will see, the SAM system, being an active research tool, has a vast number of parameters. This section briefly mentions a few of the parameters and ways we have found to make SAM work reasonably well.

At UCSC, our primary use of SAM is for modelling proteins. In this case, the SAM-T99 method (described next) is the method to use for creating a multiple alignment of database sequences, which can then be turned into an HMM using the, for example, w0.5 script.

We have had less experience with DNA and RNA uses of SAM, though we have had many reports of excellent results in these fields. The SAM DNA alignment page uses the following parameters to create an HMM from a file of aligned sequences:


buildmodel run -train train.a2m -modellength 0 -alphabet DNA     -alignfile train.a2m -aweight_method 2 -aweight_bits 0.3 -aweight_exponent 0.5

For the final alignment the command is:


hmmscore run -i run.mod -dpstyle 0 -adpstyle 5 -sw 2           -select_align 8  -db train.a2m -db database.seq  -alphabet DNA

A few of the guiding principles in these two commands are:


next up previous contents
Next: 4 SAM-T99 Up: SAM (Sequence Alignment and Previous: 2 New to this   Contents
SAM
sam-info@cse.ucsc.edu
UCSC Computational Biology Group