GUIDANCE Overview



  • Introduction


  • What is GUIDANCE good for?
  • What is GUIDANCE not good for?


  • Input


  • Methodolgy


  • Output


  • Introduction
    Multiple sequence alignment (MSA) is a prerequisite for virtually all comparative sequence analyses, including phylogeny reconstruction, functional motif or domain characterization, sequence-based structural alignment, inference of positive selection, and profile based homology searches. All such analyses take the MSA input for granted, regardless of uncertainties in the alignment. Since errors in upstream stages tend to cascade downstream, alignment errors are an important concern in molecular data analysis.

    The GUIDANCE web-server is a powerful and user-friendly tool for assigning a confidence score for each residue, column, and sequence in an alignment and for projecting these scores onto the MSA [1]. The server points to columns and sequences that are unreliably aligned and enables their automatic removal from the MSA, in preparation for downstream analyses.

    Two algorithms for quantifying MSA uncertainties are implemented in the server. The GUIDANCE score is based on the robustness of the MSA to guide-tree uncertainty and relies on the bootstrap approach [2]. The Heads-or-Tails (HoT) score measures alignment uncertainty due to co-optimal solutions [3].



    What is GUIDANCE good for?

    GUIDANCE is meant to be used for filtering or masking unreliably aligned positions in sequence alignments before subsequent analysis. For example, align codon sequences (nucleotide sequences that code for proteins) using PAGAN, remove columns with low GUIDANCE scores, and use the remaining alignment to infer positive selection using the branch-site dN/dS test. Other analyses where GUIDANCE filtering could be useful include phylogeny reconstruction, reconstruction of the history of specific insertion and deletion events, inference of recombination events, etc.



    What is GUIDANCE not good for?

    GUIDANCE cannot tell you which alignment is better. For example, align the same sequences using either PRANK or MAFFT and assign GUIDANCE scores to both. If the PRANK alignment has an average score of 0.8 while the MAFFT alignment got 1 this does not mean that the MAFFT alignment is more accurate. GUIDANCE measures the robustness of the alignment, so a perfect score means that MAFFT will always consistently aligns the sequences in the same way, regardless of perturbations in GUIDANCE. Still, this one way may be consistently wrong. So GUIDANCE cannot be used to choose between alternative alignments. It can only be used to evaluate one given alignment by a certain alignment program and identify columns where this aligner is less confident relative to other columns in the same alignment.

    GUIDANCE is also not appropriate to evaluate an alignment produced by a different approach from the ones supported in GUIDANCE (MAFFT, MUSCLE, PRANK, PAGAN and CLUSTALW). For example, you should not run GUIDANCE on an alignment produced by T-COFFEE. Also, do not upload to GUIDANCE an alignment that you corrected manually, even if it was originally produced by one of the supported aligners. Similarly, alignments that used special features (e.g. MAFFT alignment that uses RNA structure information) cannot be evaluated by GUIDANCE. In general, we recommend to always upload the sequences un-aligned and avoid using the option to upload aligned sequences.



    Input
    The minimal input to the GUIDANCE server consists of:
    1. DNA, RNA or protein sequences. The sequences should be in FASTA format only. Other sequence file formats such as Clustal and Phylip may be converted to FASTA using software such as READSEQ. The type of the sequences (nucleotides, codons, or amino-acids) should be indicated.
    2. MSA algorithm according to which the sequences will be aligned. The same algorithm is then used to align the sequences while using bootstrap trees as guide-trees (see methodology). The server supports three progressive alignment algorithms: ClustalW [4], MAFFT [4], and PRANK [5].
    3. The preferred methodology for quantifying MSA uncertainty: GUIDANCE or HoT. The default is GUIDANCE, but see the section "when to use each methodology".


    Advanced Options
    • Number of bootstrap repeats (not relevant to the HoT measure)
      The methodology is based on the bootstrap approach (see below). The higher this number is, the more accurate the confidence score is, but also the running time increases linearly. The default value is set to 100.
    • Output order
      This option defines the order of the sequences in the output alignment. Some alignment algorithms (e.g. ClustalW [4]) changes the order of the sequences. By default, the order of the sequences corresponds to their order after being aligned using the MSA algorithm. The user may choose to set the order of the sequences in the output alignment according to the input sequences file.

    • Input MSA
      The server allows users to upload their own MSA file instead of the unaligned sequence file. In this case, the input MSA is used as the base MSA and the confidence scores are calculated in the same way as usual (see Methodology below). This option should be used with caution. It is useful for analyzing an MSA of interest, for example, an MSA that was generated using a more accurate guide-tree than the standard neighbor joining tree. However, it is important to remember that even when the base MSA is given as input, the alignment algorithm chosen is applied many times in order to generate each of the perturbed MSAs. Therefore, supplying an MSA created by one program and inferring its confidence using another program may result in false predictions.

    • Advanced MAFFT\PRANK options
      Advanced users can also alter the parameters passed on to the alignment program used. For example, by default, the server runs PRANK with the .+F. flag, but the experienced user may wish to remove that option in some cases (see http://www.ebi.ac.uk/goldman-srv/prank/). For MAFFT the user may enable the iterative refinement option and set the number of iterations in the MAXITERATE parameter. Additionally, an option to choose between the iterative refinement strategies genafpair, localpair, and globalair is provided when running MAFFT. See the MAFFT website for a description of these options (http://mafft.cbrc.jp/alignment/software/algorithms/algorithms.html).


    Methodology

    What are the GUIDANCE scores?

    GUIDANCE scores reflect the robustness of an alignment to guide-tree uncertainty.

    For this goal, a standard MSA is first generated, hereby termed "base MSA". The user may choose between ClustalW [4], MAFFT (the FFT-NS-1 variant) [5], and PRANK [6]. The main idea behind the GUIDANCE method is to construct a set of MSAs, using bootstrap trees as guide-trees to the alignment algorithm, and compare them to the base MSA in order to estimate its confidence level (Figure 1). This comparison results in scores between 0-1 for each residue, residue-pair, column and sequence of the MSA, similar in practice to the bootstrap values that are used in phylogeny reconstruction.

    An in-depth description of the algorithm behind GUIDANCE can be found in ref. [2].

    • Constructing the set of MSAs
      Neighbor joining [7] bootstrap trees [8] are first constructed from the base MSA. Next, each bootstrap tree is given as an input guide tree to the alignment algorithm.

    • Calculation of the GUIDANCE scores
      The method assigns a confidence score for each residue-pair in the base MSA, which is the proportion of MSAs where this pair is aligned together. The confidence score of each column/sequence is simply the average of the GUIDANCE scores over all pairs in it. The confidence score of each residue is calculated by averaging the GUIDANCE residue-pair scores over all pairs that include the residue in question.



      • FIGURE 1 A schematic flowchart of the GUIDANCE algorithm. A base MSA is produced by any progressive alignment method. Bootstrap neighbor joining (NJ) trees are reconstructed and given as guide trees to the progressive alignment program, producing a set of MSAs. GUIDANCE scores are then calculated by comparing each MSA to the base MSA, and are color coded on each residue in the alignment.



    What are the HoT scores?

    HoT (Heads-or-Tails) scores measure the alignment uncertainty by generating a set of co-optimal MSAs and comparing them to the standard alignment. Co-optimal MSAs are a set of alignments that are given the same maximal score by the alignment algorithm. The co-optimal MSAs set is constructed by reversing the sequences at each of the pairwise-profiles-alignment steps of the progressive alignment algorithm . The comparison results in scores between 0-1 for each residue, residue-pair, column and sequence of the MSA.

    When to use each methodology?

    Generally, we recommend using GUIDANCE scores, which were demonstrated to be superior over HoT scores [2]. HoT is useful for comparison, as well as for analyzing specific and rather rare cases, in which the guide-tree is highly robust. For example, theoretically, it is possible to have 100% bootstrap support for all branches of the guide tree, so the GUIDANCE support will be 100% for every alignment column, while the HoT scores can still indicate alignment uncertainty. This is also the case when there is a single guide tree, such as in the alignment of two or three sequences. As a rule of thumb, we recommend to use HoT for data sets of less than 8 sequences because of the typically small number of bootstrap trees for such cases

    Running time

    Running time depends on the dataset size (number and length of sequences) and (for GUIDANCE scores) on the number of bootstrap repeats. The major component of the running time is the multiple alignment program used, thus MAFFT runs will be fastest and PRANK runs slowest. To aid users with estimating running time for their datasets, we include a plot of average GUIDANCE and HoT running times using either MAFFT or PRANK for several dataset sizes, from 100 to 350 sequences, roughly 300 amino acids in length (Figure 2). Note that GUIDANCE was run with the default 100 bootstrap repeats, but this number can be reduced to shorten the running time. HoT running time depends on the number of branches in the guide tree, which increases linearly with the number of sequences.


    FIGURE 2: -time performance as a function of the number of sequences. Simulated protein sequences roughly 300 amino acids long were aligned using MAFFT and analyzed by GUIDANCE (blue diamonds) or HoT (red squares). In addition, running time for GUIDANCE on PRANK alignments is plotted with green triangles. Each data point represents ten replicates.


    Output
    GUIDANCE directs you to a web page called "GUIDANCE Job Status Page". This web page is automatically updated every 30 seconds, showing messages regarding the different stages of the server activity. When the calculation finishes, several links appear. For simplicity, we only describe the output of the GUIDANCE method. Similar output is produced by the HoT method, also implemented in this server. (for an example output page click here)

  • MSA colored according to the confidence score This link is the main link for the GUIDANCE output, which is a projection of the confidence scores of each residue onto the MSA, using a color-scale. Shades of magenta indicate confidently aligned residues while shades of blue indicate uncertainly aligned residues. In addition, GUIDANCE column scores are plotted below the alignment.
  • MSA file
    This links to a plain-text file of the base MSA, on which the colored results are being displayed.
  • GUIDANCE column score
    Here you can find a table of GUIDANCE scores obtained for each column of the MSA. Note that the score of columns containing only one sequence can not be estimated and thus not presented.

  • GUIDANCE residue score
    Here you can find a table of GUIDANCE scores obtained for each residue in the MSA. Note that the score of residues that are aligned to gaps only can not be estimated. They are not listed in the table.

  • GUIDANCE sequence score
    Here you can find a table of GUIDANCE scores obtained for each sequence in the MSA.

  • GUIDANCE residue-pair score
    Here you can find a table of GUIDANCE scores obtained for each residue-pair in the MSA.

  • Remove unreliable columns below a certain cutoff
    The server provides a reduced MSA by removing unreliable columns according to this given cutoff. This MSA contains only columns with GUIDANCE score (see "what are the GUIDANCE scores") higher than this cutoff, and is recommended to be used in subsequent analyses in order to reduce errors caused by alignment errors.

    There is no specific recommended value for this cutoff because its effect on the alignment varies considerably among datasets. After the GUIDANCE calculation is finished the user may select from a drop-down list to remove unreliable columns below a certain confidence score. When selecting a confidence score the user can see what percentage of the original columns remain in the MSA. After choosing the appropriate confidence level and clicking the "remove columns" button the GUIDANCE server provides a hyperlink to a new reduced MSA comprised of the confidently aligned columns only.

    The default value, 0.93, was optimized for the BAliBASE benchmark database as well as for simulation studies, and corresponds to 12% false positive rate and 78% true positive rate. The user is allowed to change this cutoff, to retain more\less columns. The tradeoff, as for many other predictive tests, is between the sensitivity and specificity levels. Using a low cutoff is recommended for applications that require leaving as many accurate MSA columns as possible (i.e., high sensitivity). Other applications may require the use of confident columns only (i.e. high specificity) and thus using a high cutoff that removes many columns from the original MSA is recommended. A table describing the false-positive rate and the true-positive rate found in simulation studies for different cutoffs can be found here: "Table 1".

  • Remove unreliable sequences below a certain cutoff
    According to this cutoff, the server enables the removal of sequences that cause errors in the MSA because their alignment with the rest of the sequences is unreliable.

    The reduced MSA contains only sequences with GUIDANCE score (see "what are the GUIDANCE scores") higher than this cutoff and can be used for subsequent analyses in order to reduce errors caused by alignment errors. It is possible to change this cutoff according to the proportion of sequences that the user wishes to retain. There is no specific recommended value for this cutoff because its effect on the alignment varies considerably among datasets. The web server provides a list of cutoffs with their respective effects on the remaining proportion of sequences and users are encouraged to experiment with several cutoffs. We recommend running GUIDANCE again using these sequences as input, in order to follow the improvement of the confidence level. This can be done by simply pressing the "run GUIDANCE on the confidently-aligned sequences only" button.

  • Mask specific residues below a certain cutoff
    The GUIDANCE residue scores indicate specific residues whose alignment is unreliable (see "what are the GUIDANCE scores"). This allows masking of specific residues instead of the removal of whole columns or sequences. All residues with scores lower than the cutoff are replaced with "N" (for nucleotides) or "X" (for amino acids). This is useful, for example, to mask codons in a codon alignment before running a Ka/Ks analysis to look for positive selection (see application in ref. 9)



    References


      1. Penn, O., E. Privman, H. Ashkenazy, G. Landan, D. Graur, and T. Pupko. (2010). GUIDANCE: a web server for assessing alignment confidence scores. Nucleic Acids Research, 2010 Jul 1; 38 (Web Server issue):W23-W28; doi: 10.1093/nar/gkq443
      2. Penn, O., E. Privman, G. Landan, D. Graur, and T. Pupko. (2010). An alignment confidence score capturing robustness to guide-tree uncertainty. Molecular Biology and Evolution, 2010 Aug;27(8):1759-67; doi:10.1093/molbev/msq066
      3. Landan, G. and D. Graur, Local reliability measures from sets of co-optimal multiple sequence alignments. Pac Symp Biocomput, 2008. 13: p. 15-24.
      4. Thompson, J.D., D.G. Higgins, and T.J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 1994. 22(22): p. 4673-80.
      5. Katoh, K., et al., MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res, 2005. 33(2): p. 511-8.
      6. Loytynoja, A. and N. Goldman, Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science, 2008. 320(5883): p. 1632-5.
      7. Saitou, N. and M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 1987. 4(4): p. 406-25.
      8. Felsenstein, J., Confidence limits on phylogenies: an approach using the bootstrap. Evolution, 1985. 39(4): p. 783-791.
      9. Privman, E., O. Penn, and T. Pupko. Improving the performance of positive selection inference by filtering unreliable alignment regions. MBE, 2011. doi: 10.1093/molbev/msr177.

      To the top