Overview

Introduction

Multiple sequence alignment (MSA) is a prerequisite for virtually all comparative sequence analyses, including phylogeny reconstruction, functional motif or domain characterization, sequence-based structural alignment, inference of positive selection, and profile based homology searches. All such analyses take the MSA input for granted, regardless of uncertainties in the alignment. Since errors in upstream stages tend to cascade downstream, alignment errors are an important concern in molecular data analysis.

The GUIDANCE web-server is a powerful and user-friendly tool for assigning a confidence score for each residue, column, and sequence in an alignment and for projecting these scores onto the MSA [1]. The server points to columns and sequences that are unreliably aligned and enables their automatic removal from the MSA, in preparation for downstream analyses.

Three algorithms for quantifying MSA uncertainties are implemented in the server. The GUIDANCE score is based on the robustness of the MSA to guide-tree uncertainty and relies on the bootstrap approach [2]. The Heads-or-Tails (HoT) score measures alignment uncertainty due to co-optimal solutions [3].

GUIDANCE2 is an integrative methodology that accounts for: (1) uncertainty in the process of indel formation; (2) uncertainty in the assumed guide tree (as GUIDANCE); (3) co-optimal solutions in the pair-wise alignments, used as building blocks in progressive alignment algorithms (as HoT).

What is GUIDANCE good for?

GUIDANCE is meant to be used for weighting, filtering or masking unreliably aligned positions in sequence alignments before subsequent analysis. For example, align codon sequences (nucleotide sequences that code for proteins) using PAGAN, remove columns with low GUIDANCE scores, and use the remaining alignment to infer positive selection using the branch-site dN/dS test. Other analyses where GUIDANCE filtering could be useful include phylogeny reconstruction, reconstruction of the history of specific insertion and deletion events, inference of recombination events, etc.

GUIADNCE2 also provides a set of alternative alignments which can be used when adopting statistical point of view, i.e. performing statistical analyses that rely on many possible alignments that are supported by the data.

What is GUIDANCE not good for?

GUIDANCE cannot tell you which alignment is better. For example, align the same sequences using either PRANK or MAFFT and assign GUIDANCE scores to both. If the PRANK alignment has an average score of 0.8 while the MAFFT alignment got 1 this does not mean that the MAFFT alignment is more accurate. GUIDANCE measures the robustness of the alignment, so a perfect score means that MAFFT will always consistently aligns the sequences in the same way, regardless of perturbations in GUIDANCE. Still, this one way may be consistently wrong. So GUIDANCE cannot be used to choose between alternative alignments. It can only be used to evaluate one given alignment by a certain alignment program and identify columns where this aligner is less confident relative to other columns in the same alignment.

GUIDANCE is also not appropriate to evaluate an alignment produced by a different approach from the ones supported in GUIDANCE (MAFFT, MUSCLE, PRANK, PAGAN and CLUSTALW). For example, you should not run GUIDANCE on an alignment produced by T-COFFEE. Also, do not upload to GUIDANCE an alignment that you corrected manually, even if it was originally produced by one of the supported aligners. Similarly, alignments that used special features (e.g. MAFFT alignment that uses RNA structure information) cannot be evaluated by GUIDANCE. In general, we recommend to always upload the sequences un-aligned and avoid using the option to upload aligned sequences.

Input

The minimal input to the GUIDANCE server

DNA, RNA or protein sequences.: The sequences should be in FASTA format only. Other sequence file formats such as Clustal and Phylip may be converted to FASTA using software such as READSEQ. The type of the sequences (nucleotides, codons, or amino-acids) should be indicated.
MSA algorithm according to which the sequences will be aligned.: The same algorithm is then used to align the sequences while using bootstrap trees as guide-trees (see methodology). The server supports three progressive alignment algorithms: ClustalW [4], MAFFT [5], and PRANK [6].
The preferred methodology for quantifying MSA uncertainty: GUIDANCE2, GUIDANCE or HoT.: The default is GUIDANCE2, which tend to outperform other methods.

Advanced Options

Number of bootstrap repeats (not relevant to the HoT measure).: The methodology is based on the bootstrap approach (see below). The higher this number is, the more accurate the confidence score is, but also the running time increases linearly. The default value is set to 100.
Output order.: This option defines the order of the sequences in the output alignment. Some alignment algorithms (e.g. ClustalW [4]) changes the order of the sequences. By default, the order of the sequences corresponds to their order after being aligned using the MSA algorithm. The user may choose to set the order of the sequences in the output alignment according to the input sequences file.
Input MSA.: The server allows users to upload their own MSA file instead of the unaligned sequence file. In this case, the input MSA is used as the base MSA and the confidence scores are calculated in the same way as usual (see Methodology below). This option should be used with caution. It is useful for analyzing an MSA of interest, for example, an MSA that was generated using a more accurate guide-tree than the standard neighbor joining tree. However, it is important to remember that even when the base MSA is given as input, the alignment algorithm chosen is applied many times in order to generate each of the perturbed MSAs. Therefore, supplying an MSA created by one program and inferring its confidence using another program may result in false predictions.
Advanced MAFFT\PRANK options.: Advanced users can also alter the parameters passed on to the alignment program used. For example, by default, the server runs PRANK with the .+F. flag, but the experienced user may wish to remove that option in some cases (see http://wasabiapp.org/software/prank/). For MAFFT the user may enable the iterative refinement option and set the number of iterations in the MAXITERATE parameter. Additionally, an option to choose between the iterative refinement strategies genafpair, localpair, and globalair is provided when running MAFFT. See the MAFFT website for a description of these options (http://mafft.cbrc.jp/alignment/software/algorithms/algorithms.html).

Methodology

What are the GUIDANCE2 and GUIDANCE scores?

GUIDANCE scores reflect the robustness of an alignment to perturbations.

For this goal, a standard MSA is first generated, hereby termed "base MSA". The user may choose between ClustalW [4], MAFFT (the FFT-NS-1 variant) [5], and PRANK [6]. The main idea behind the GUIDANCE2 and GUIDANCE methodologies is to construct a set of MSAs. GUIDANCE uses bootstrap trees as guide-trees to the alignment algorithm, and compare them to the base MSA in order to estimate its confidence level (Figure 1). Similarly, GUIDANCE2 uses bootstrp trees, vary the gap penalty score of the alignment program scoring scheme, and employs HoT methodology (see details below). Comparing the base alignment to the set of alternative alignments results in scores between 0-1 for each residue, residue-pair, column and sequence of the MSA.

An in-depth description of the algorithm behind GUIDANCE can be found in ref. [2].

Constructing the set of MSAs.: Neighbor joining [7] bootstrap trees [8] are first constructed from the base MSA. Next, each bootstrap tree is given as an input guide tree to the alignment algorithm.
Calculation of the GUIDANCE scores.: The method assigns a confidence score for each residue-pair in the base MSA, which is the proportion of MSAs where this pair is aligned together. The confidence score of each column/sequence is simply the average of the GUIDANCE scores over all pairs in it. The confidence score of each residue is calculated by averaging the GUIDANCE residue-pair scores over all pairs that include the residue in question.

FIGURE 1 A schematic flowchart of the GUIDANCE algorithm. A base MSA is produced by any progressive alignment method. Bootstrap neighbor joining (NJ) trees are reconstructed and given as guide trees to the progressive alignment program, producing a set of MSAs. GUIDANCE scores are then calculated by comparing each MSA to the base MSA, and are color coded on each residue in the alignment.

What are the HoT scores?

HoT (Heads-or-Tails) scores measure the alignment uncertainty by generating a set of co-optimal MSAs and comparing them to the standard alignment. Co-optimal MSAs are a set of alignments that are given the same maximal score by the alignment algorithm. The co-optimal MSAs set is constructed by reversing the sequences at each of the pairwise-profiles-alignment steps of the progressive alignment algorithm [3]. The comparison results in scores between 0-1 for each residue, residue-pair, column and sequence of the MSA.

Running time

Running time depends on the dataset size (number and length of sequences) and (for GUIDANCE2 and GUIDANCE scores) on the number of bootstrap repeats. The major component of the running time is the multiple alignment program used, thus MAFFT runs will be fastest and PRANK runs slowest. In Figure 2, a comparision between the running time of GUIDANCE2 and other MSA reliability methods when using MAFFT as the alignment method. Note that GUIDANCE2 and GUIDANCE were run with the default 100 bootstrap repeats, but this number can be reduced to shorten the running time. HoT running time depends on the number of branches in the guide tree, which increases linearly with the number of sequences. Zorro, TCS, ALiScore, TrimAl and NOISY were run on a pre-calculated MSA.
Please note that the stand-alone versions of GUIDANCE and GUIDANCE2 support parallel computing, thus a significant reduction in running times is possible.

FIGURE 2: Time performance as a function of the sequence length. Sets of 40 simulated protein sequences with different lengths were aligned using MAFFT and analyzed by alignment reliability methods.

Output

GUIDANCE directs you to a web page called "GUIDANCE Job Status Page". This web page is automatically updated every 30 seconds, showing messages regarding the different stages of the server activity. When the calculation finishes, several links appear. For simplicity, we only describe the output of the GUIDANCE method. Similar output is produced by the HoT method, also implemented in this server. (for an example output page click here)

MSA colored according to the confidence score: This link is the main link for the GUIDANCE output, which is a projection of the confidence scores of each residue onto the MSA, using a color-scale. Shades of magenta indicate confidently aligned residues while shades of blue indicate uncertainly aligned residues. In addition, GUIDANCE column scores are plotted below the alignment.
MSA file: This links to a plain-text file of the base MSA, on which the colored results are being displayed.
GUIDANCE column score: Here you can find a table of GUIDANCE scores obtained for each column of the MSA. Note that the score of columns containing only one sequence can not be estimated and thus not presented.
GUIDANCE residue score: Here you can find a table of GUIDANCE scores obtained for each residue in the MSA. Note that the score of residues that are aligned to gaps only can not be estimated. They are not listed in the table.
GUIDANCE sequence score: Here you can find a table of GUIDANCE scores obtained for each sequence in the MSA.
GUIDANCE residue-pair score: Here you can find a table of GUIDANCE scores obtained for each residue-pair in the MSA.
Remove unreliable columns below a certain cutoff: The server provides a reduced MSA by removing unreliable columns according to this given cutoff. This MSA contains only columns with GUIDANCE score (see "what are the GUIDANCE scores") higher than this cutoff, and is recommended to be used in subsequent analyses in order to reduce errors caused by alignment errors. There is no specific recommended value for this cutoff because its effect on the alignment varies considerably among datasets. After the GUIDANCE calculation is finished the user may select from a drop-down list to remove unreliable columns below a certain confidence score. When selecting a confidence score the user can see what percentage of the original columns remain in the MSA. After choosing the appropriate confidence level and clicking the "remove columns" button the GUIDANCE server provides a hyperlink to a new reduced MSA comprised of the confidently aligned columns only. The default value, 0.93, was optimized for the BAliBASE benchmark database as well as for simulation studies, and corresponds to 12% false positive rate and 78% true positive rate. The user is allowed to change this cutoff, to retain more\less columns. The tradeoff, as for many other predictive tests, is between the sensitivity and specificity levels. Using a low cutoff is recommended for applications that require leaving as many accurate MSA columns as possible (i.e., high sensitivity). Other applications may require the use of confident columns only (i.e. high specificity) and thus using a high cutoff that removes many columns from the original MSA is recommended. A table describing the false-positive rate and the true-positive rate found in simulation studies for different cutoffs can be found here: "Table 1".
Remove unreliable sequences below a certain cutoff: According to this cutoff, the server enables the removal of sequences that cause errors in the MSA because their alignment with the rest of the sequences is unreliable. The reduced MSA contains only sequences with GUIDANCE score (see "what are the GUIDANCE scores") higher than this cutoff and can be used for subsequent analyses in order to reduce errors caused by alignment errors. It is possible to change this cutoff according to the proportion of sequences that the user wishes to retain. There is no specific recommended value for this cutoff because its effect on the alignment varies considerably among datasets. The web server provides a list of cutoffs with their respective effects on the remaining proportion of sequences and users are encouraged to experiment with several cutoffs. We recommend running GUIDANCE again using these sequences as input, in order to follow the improvement of the confidence level. This can be done by simply pressing the "run GUIDANCE on the confidently-aligned sequences only" button.
Mask specific residues below a certain cutoff: The GUIDANCE residue scores indicate specific residues whose alignment is unreliable (see "what are the GUIDANCE scores"). This allows masking of specific residues instead of the removal of whole columns or sequences. All residues with scores lower than the cutoff are replaced with "N" (for nucleotides) or "X" (for amino acids). This is useful, for example, to mask codons in a codon alignment before running a Ka/Ks analysis to look for positive selection (see application in ref. 9)

Simulated benchmark

Among other benchmarks used to evaluate the performance of GUIDANCE2 we used a benchmark of 541 simulated protein sequences. Sequences were simulated using INDELible. In order to have realistic parameters for the simulations, we first selected 541 MSAs from the OrthoMaM database, for which CDS are available for all 40 mammals included in the database. This parameter setup resulted in MSAs similar to OrthoMaM alignments (visual comparison of alignment length, number and length of indels). The control files for INDELible and the resulted simulated MSAs used as benchmark can be downloaded here.

References

1. Penn, O., E. Privman, H. Ashkenazy, G. Landan, D. Graur, and T. Pupko. (2010). GUIDANCE: a web server for assessing alignment confidence scores. Nucleic Acids Research, 2010 Jul 1; 38 (Web Server issue):W23-W28; doi: 10.1093/nar/gkq443
2. Penn, O., E. Privman, G. Landan, D. Graur, and T. Pupko. (2010). An alignment confidence score capturing robustness to guide-tree uncertainty. Molecular Biology and Evolution, 2010 Aug;27(8):1759-67; doi:10.1093/molbev/msq066
3. Landan, G. and D. Graur, Local reliability measures from sets of co-optimal multiple sequence alignments. Pac Symp Biocomput, 2008. 13: p. 15-24.
4. Thompson, J.D., D.G. Higgins, and T.J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 1994. 22(22): p. 4673-80.
5. Katoh, K., et al., MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res, 2005. 33(2): p. 511-8.
6. Loytynoja, A. and N. Goldman, Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science, 2008. 320(5883): p. 1632-5.
7. Saitou, N. and M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 1987. 4(4): p. 406-25.
8. Felsenstein, J., Confidence limits on phylogenies: an approach using the bootstrap. Evolution, 1985. 39(4): p. 783-791.
9. Privman, E., O. Penn, and T. Pupko. Improving the performance of positive selection inference by filtering unreliable alignment regions. MBE, 2011. doi: 10.1093/molbev/msr177.