General help for CLUSTAL W (1.8)

Clustal W is a general purpose multiple alignment program for DNA or proteins.

SEQUENCE INPUT: all sequences must be in 1 file, one after another. 7 formats are automatically recognised: NBRF-PIR, EMBL-SWISSPROT, Pearson (Fasta), Clustal (*.aln), GCG-MSF (Pileup), GCG9-RSF and GDE flat file. All non-alphabetic characters (spaces, digits, punctuation marks) are ignored except "-" which is used to indicate a GAP ("." in MSF-RSF).

To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to INPUT them; go to menu item 2 to do the multiple alignment.

PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments. Use this to add a new sequence to an old alignment, or to use secondary structure to guide the alignment process. GAPS in the old alignments are indicated using the "-" character. PROFILES can be input in ANY of the allowed formats; just use "-" (or "." for MSF-RSF) for each gap position.

PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in with "-" characters to indicate gaps) OR after a multiple alignment while the alignment is still in memory.

The program tries to automatically recognise the different file formats used and to guess whether the sequences are amino acid or nucleotide. This is not always foolproof.

FASTA and NBRF-PIR formats are recognised by having a ">" as the first character in the file.

EMBL-Swiss Prot formats are recognised by the letters ID at the start of the file (the token for the entry name field).

CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.

GCG-MSF format is recognised by one of the following:

GCG-RSF format is recognised by the word !!RICH_SEQUENCE at the beginning of the file.

If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the sequence will be assumed to be nucleotide. This works in 97.3% of cases but watch out!