The SAM system understands several alphabets and many sequence formats.
The SAM system currently supports two nucleotide alphabets (`DNA' and `RNA'), one amino acid alphabet (`protein'), and four secondary structure alphabets (`DSSP', `EHL', `EHL2', and `EHTL'), as well as user-defined alphabets of up to 25 letters. The predefined alphabets can be specified by setting the alphabet variable. If no alphabet is chosen, the first sequence in a specified file will be examined using readseq (discussed below) to determine if a nucleotide or protein alphabet should be used. If this method does not work, the protein alphabet is the default. The SAM software includes several warning messages if it appears that an incorrect alphabet has been chosen.
The alphabets use standard characters. DNA sequences are composed of the characters ``AGCTRYN'' and RNA of ``AGCURYN,'' where `R' is a purine (`G' or `A'), `Y' is a pyrimidine (`C,' or `T' or `U,' as appropriate), and `N' is a wildcard character that could be any of the four normal characters. SAM's sequence I/O routines can convert between DNA and RNA alphabets if the alphabet is specified incorrectly.
The protein alphabet is ``ACDEFGHIKLMNPQRSTVWYBZX.'' In addition to the twenty amino acids, `X' is the general wildcard character, `B' matches `N' or `D', and `Z' matches `Q' or `E.' Protein alignments (specified with alignfile to buildmodel or modelfromalign) converted to models can additionally use the letter `O' to indicate insertion of a free-insertion module. Except for these two instances, the `O' character is converted to the all-matching `X' wildcard.
The DSSP alphabet is ``EHTSGBICX'', including `E' (beta strand), `H' (alpha helix), `T' (turn), `S' (bend), `G' (3-10 helix), `B' (short beta bridge), `I' (pi helix), and `C' (random coil). The character `L' (loop) is an alias for `C'. The character `C' is used to indicate coils rather than the space character used by DSSP. The remaining secondary structure alphabets are subsets of the DSSP alphabets with various groups merged.
The secondary strucutre alphabets are not automatically detected; they must be specified with a alphabet command line option.
In all alphabets, unknown characters are converted to wildcards and a warning message is printed.
When a model is created, a wildcard character's probability is the sum of the probabilities of the component letters. Thus, the `X' character will have unity probability, giving it no preference to one state over another. During the training process, wildcard character frequency counts are proportioned among the appropriate true characters according to the relative probabilities of those characters.
SAM also supports user-defined alphabets of 2 to 25 user-selected letters (`A'-`Z') and one (required) wildcard letter. The restriction to alphabetic characters is a result of the need for both uppercase and lowercase letters in the sequence alignment format. As the system always requires an all-matching wildcard, only 25 letters are allowed.
User-defined alphabets are specified with the alphabet_def variable. As with the standard alphabets, the definition will be included in all resulting models, so future specification of the alphabet on the command line is not required.
For example, performing the commands
buildmodel text -train text.seq -alphabetdef "text QWERTYUIOPASDFGHJKLZCVBNMX" align2model text -i text.mod -db text.seqresults in the alignment file:
>sentence1 THEQUICK--BROWNFOXJUMPEDOVERTHESLOW-LAZYDOG >sentence1 THEQUICK--BROWNFOXJUMPEDOVERTHESLOW-LAZYDOG >sentence2 THEQUICKERG-REENFOXHOPPEDOVERTHESLOWLUCKYPIG >sentence3 THESLOWLAZYPIGW-ADDLEDINTOTHEQUICKPURPLEFOX >sentence4 THEF--ASTBROWNFOXHOPPEDINTOTHEQUICKLAZYDOG
Note that the above example does not model the letter `X' because it is a wildcard: the `X' character was not trained and does not have a preference for any state over any other state.
A minimum of three characters, 2 normal and one wildcard, is required to define an alphabet. Default flat regularizers are created automatically, but users may wish to create their own alphabet-specific regularizers with regularizer_file.
As with alphabets, models are tagged with the alphabet_def line, for example
MODEL - Final model for run text alphabet_def text QWERTYUIOPASDFGHJKLZCVBNMx GENERIC 1.886984 0.254944 0.376488 .....See Section 8.4.
SAM has three ways of reading sequences. SAM's FASTA (and a2m-format alignment) format reader is by far the quickest. SAM also includes an HSSP alignment file reader and, for many other formats, D. G. Gilbert's readseq package. Because SAM's FASTA reader is tuned both to SAM and to a single format, it can be up to 10 times faster than readseq. Users are advised to convert large databases into FASTA format using the readseq program.
SAM's modified version of the readseq package by D. G. Gilbert of the Indiana University. The code is based on the February 1, 1993 release, and is included as a subdirectory of the SAM source directory. We are grateful that Gilbert has provided this useful package that may be used by anyone.
The readseq package can read most common formats: examples of all these formats are included in the readseq directory. The formats include
We usually use FASTA format, which looks like this:
; Comments are ignored. >IDENTIFIER LMLDQQTINI IKATVPVLKE HGVTITTTFY KNLFAKHPEV RPLFDMGRQE SLEQPKALAM T >SEQ2 Annotations after identifier are preserved AKHPEVRPLFDMGRQESLEQPKALAMT
For information on other formats, please look through the test files and the Formats file in the readseq directory.
Sequence output will be in FASTA format regardless of the input file format.
Alignment output by align2model and hmmscore is in a FASTA-compatible format in which uppercase letters indicate match states and lowercase letters indicate insertion states and hyphens indicate deletion states (model positions for which the given sequence has no corresponding character). The prettyalign program can be used to line up the match columns of an a2m-format alignment file.
Additionally, align2model can include periods so that its sequence outputs can be visually aligned without the use of prettyalign. If, for example, the longest sequence in a collection is 2000 characters long, all sequences will be filled (using periods) to that longest sequence's alignment length, which will be more than 2000 if any deletion states are used. Thus, allowing the periods to be printed can greatly expand the size of the alignment file. If periods are not desired, the paramater a2mdots can be set to 0. The prettyalign program will work whether or not the a2m format alignment has periods.
SAM can also read HSSP files.
The buildmodel program uses two sets of sequences: the training
and the test set. Training is performed exclusively on the training
set, and at the end of the model creation, all sequences in the test
set are checked against the model, and the average NLL distance is
reported for both the training and the test set.
Training and test sets can be specified in up to two files each: train, train2, test, and test2. At most Nseq sequences will be read from any one file, so that at most 4Nseq sequences will be read in if four files are specified. The buildmodel program ignores zero-length sequences in the training set file(s) after printing a warning.
The system can also randomly partition sequences into the training and the test set. If Ntrain is set, the system will randomly pick Ntrain sequences from all files specified (training and testing) using the random seed trainseed, and reserve the rest for the test set. By default, the seed is set to the process ID number, which is printed on the output file so that the partition can be reproduced. Sequence partitioning and model training use different random seeds, though both default to the process ID.
Several other programs, such as hmmscore and align2model, take an arbitrary number of sequence database files specified as db. Unlike most variables, repeating the db declaration adds a new file to the list, rather than replacing the previous database file. Zero-length sequences are processed the same way as all other sequences.