next up previous contents
Next: 8 Regularizers and models Up: SAM (Sequence Alignment and Previous: 6 Parameter specification   Contents

Subsections

7 Sequence formats

The SAM system understands several alphabets and many sequence formats.


7.1 Alphabets

The SAM system currently supports two nucleotide alphabets (`DNA' and `RNA'), one amino acid alphabet (`protein'), and four secondary structure alphabets (`DSSP', `EHL', `EHL2', and `EHTL'), as well as user-defined alphabets of up to 25 letters. The predefined alphabets can be specified by setting the alphabet variable. If no alphabet is chosen, the first sequence in a specified file will be examined using readseq (discussed below) to determine if a nucleotide or protein alphabet should be used. If this method does not work, the protein alphabet is the default. The SAM software includes several warning messages if it appears that an incorrect alphabet has been chosen.

The alphabets use standard characters. DNA sequences are composed of the characters ``AGCTRYN'' and RNA of ``AGCURYN,'' where `R' is a purine (`G' or `A'), `Y' is a pyrimidine (`C,' or `T' or `U,' as appropriate), and `N' is a wildcard character that could be any of the four normal characters. SAM's sequence I/O routines can convert between DNA and RNA alphabets if the alphabet is specified incorrectly.

The protein alphabet is ``ACDEFGHIKLMNPQRSTVWYBZX.'' In addition to the twenty amino acids, `X' is the general wildcard character, `B' matches `N' or `D', and `Z' matches `Q' or `E.' Protein alignments (specified with alignfile to buildmodel or modelfromalign) converted to models can additionally use the letter `O' to indicate insertion of a free-insertion module. Except for these two instances, the `O' character is converted to the all-matching `X' wildcard.

The DSSP alphabet is ``EHTSGBICX'', including `E' (beta strand), `H' (alpha helix), `T' (turn), `S' (bend), `G' (3-10 helix), `B' (short beta bridge), `I' (pi helix), and `C' (random coil). The character `L' (loop) is an alias for `C'. The character `C' is used to indicate coils rather than the space character used by DSSP. The remaining secondary structure alphabets are subsets of the DSSP alphabets with various groups merged.

The secondary strucutre alphabets are not automatically detected; they must be specified with a alphabet command line option.

In all alphabets, unknown characters are converted to wildcards and a warning message is printed.

When a model is created, a wildcard character's probability is the sum of the probabilities of the component letters. Thus, the `X' character will have unity probability, giving it no preference to one state over another. During the training process, wildcard character frequency counts are proportioned among the appropriate true characters according to the relative probabilities of those characters.


7.1.1 User-defined alphabets

SAM also supports user-defined alphabets of 2 to 25 user-selected letters (`A'-`Z') and one (required) wildcard letter. The restriction to alphabetic characters is a result of the need for both uppercase and lowercase letters in the sequence alignment format. As the system always requires an all-matching wildcard, only 25 letters are allowed.

User-defined alphabets are specified with the alphabet_def variable. As with the standard alphabets, the definition will be included in all resulting models, so future specification of the alphabet on the command line is not required.

For example, performing the commands


buildmodel text -train text.seq -alphabetdef "text QWERTYUIOPASDFGHJKLZCVBNMX"
align2model text -i text.mod -db text.seq
results in the alignment file:

>sentence1
THEQUICK--BROWNFOXJUMPEDOVERTHESLOW-LAZYDOG
>sentence1
THEQUICK--BROWNFOXJUMPEDOVERTHESLOW-LAZYDOG
>sentence2
THEQUICKERG-REENFOXHOPPEDOVERTHESLOWLUCKYPIG
>sentence3
THESLOWLAZYPIGW-ADDLEDINTOTHEQUICKPURPLEFOX
>sentence4
THEF--ASTBROWNFOXHOPPEDINTOTHEQUICKLAZYDOG

Note that the above example does not model the letter `X' because it is a wildcard: the `X' character was not trained and does not have a preference for any state over any other state.

A minimum of three characters, 2 normal and one wildcard, is required to define an alphabet. Default flat regularizers are created automatically, but users may wish to create their own alphabet-specific regularizers with regularizer_file.

As with alphabets, models are tagged with the alphabet_def line, for example


MODEL -  Final model for run text
alphabet_def text QWERTYUIOPASDFGHJKLZCVBNMx
GENERIC
1.886984 0.254944 0.376488
.....
See Section 8.4.

7.2 Sequences

SAM has three ways of reading sequences. SAM's FASTA (and a2m-format alignment) format reader is by far the quickest. SAM also includes an HSSP alignment file reader and, for many other formats, D. G. Gilbert's readseq package. Because SAM's FASTA reader is tuned both to SAM and to a single format, it can be up to 10 times faster than readseq. Users are advised to convert large databases into FASTA format using the readseq program.

SAM's modified version of the readseq package by D. G. Gilbert of the Indiana University. The code is based on the February 1, 1993 release, and is included as a subdirectory of the SAM source directory. We are grateful that Gilbert has provided this useful package that may be used by anyone.

The readseq package can read most common formats: examples of all these formats are included in the readseq directory. The formats include

We usually use FASTA format, which looks like this:


; Comments are ignored.
>IDENTIFIER
LMLDQQTINI IKATVPVLKE HGVTITTTFY KNLFAKHPEV
RPLFDMGRQE SLEQPKALAM T
>SEQ2       Annotations after identifier are preserved
AKHPEVRPLFDMGRQESLEQPKALAMT

For information on other formats, please look through the test files and the Formats file in the readseq directory.

Sequence output will be in FASTA format regardless of the input file format.

Alignment output by align2model and hmmscore is in a FASTA-compatible format in which uppercase letters indicate match states and lowercase letters indicate insertion states and hyphens indicate deletion states (model positions for which the given sequence has no corresponding character). The prettyalign program can be used to line up the match columns of an a2m-format alignment file.

Additionally, align2model can include periods so that its sequence outputs can be visually aligned without the use of prettyalign. If, for example, the longest sequence in a collection is 2000 characters long, all sequences will be filled (using periods) to that longest sequence's alignment length, which will be more than 2000 if any deletion states are used. Thus, allowing the periods to be printed can greatly expand the size of the alignment file. If periods are not desired, the paramater a2mdots can be set to 0. The prettyalign program will work whether or not the a2m format alignment has periods.

SAM can also read HSSP files.


7.3 Training sets, test sets, and databases

The buildmodel program uses two sets of sequences: the training and the test set. Training is performed exclusively on the training set, and at the end of the model creation, all sequences in the test set are checked against the model, and the average NLL distance is reported for both the training and the test set.

Training and test sets can be specified in up to two files each: train, train2, test, and test2. At most Nseq sequences will be read from any one file, so that at most 4Nseq sequences will be read in if four files are specified. The buildmodel program ignores zero-length sequences in the training set file(s) after printing a warning.

The system can also randomly partition sequences into the training and the test set. If Ntrain is set, the system will randomly pick Ntrain sequences from all files specified (training and testing) using the random seed trainseed, and reserve the rest for the test set. By default, the seed is set to the process ID number, which is printed on the output file so that the partition can be reproduced. Sequence partitioning and model training use different random seeds, though both default to the process ID.

Several other programs, such as hmmscore and align2model, take an arbitrary number of sequence database files specified as db. Unlike most variables, repeating the db declaration adds a new file to the list, rather than replacing the previous database file. Zero-length sequences are processed the same way as all other sequences.


next up previous contents
Next: 8 Regularizers and models Up: SAM (Sequence Alignment and Previous: 6 Parameter specification   Contents
SAM
sam-info@cse.ucsc.edu
UCSC Computational Biology Group