6.4B: Building Phylogenetic Trees - Biology

6.4B: Building Phylogenetic Trees - Biology

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

A phylogenetic tree sorts organisms into clades or groups of organisms that descended from a single ancestor using maximum parsimony.

Learning Objectives

  • Describe the cladistics as a method used to create phylogenetic trees

Key Points

  • Phylogenetic trees sort organisms into clades: groups of organisms that descended from a single ancestor.
  • Organisms of a single clade are called a monophyletic group.
  • Scientists use the phrase “descent with modification” because genetic changes occur even though related organisms have many of the same characteristics and genetic codes.
  • A characteristic is considered a shared-ancestral character if it is found in the ancestor of a group and all of the organisms in the taxon or clade have that trait.
  • If only some of the organisms have a certain trait, it is called a shared- derived character because this trait derived at some point, but does not include all of the ancestors in the clade.
  • Scientists often use a concept called maximum parsimony, which means that events occurred in the simplest, most obvious way, to aid in the tremendous task of describing phylogenies accurately.

Key Terms

  • monophyletic: of, pertaining to, or affecting a single phylum (or other taxon) of organisms
  • derived: of, or pertaining to, conditions unique to the descendant species of a clade, and not found in earlier ancestral species
  • clades: groups of organisms that descended from a single ancestor
  • ancestral: of, pertaining to, derived from, or possessed by, an ancestor or ancestors; as, an ancestral estate
  • maximum parsimony: the preferred phylogenetic tree is the tree that requires the least evolutionary change to explain some observed data

Building Phylogenetic Trees

After the homologous and analogous traits are sorted, scientists often organize the homologous traits using a system called cladistics. This system sorts organisms into clades: groups of organisms that descended from a single ancestor. For example, all of the organisms in the orange region evolved from a single ancestor that had amniotic eggs. Consequently, all of these organisms also have amniotic eggs and make a single clade, also called a monophyletic group. Clades must include all of the descendants from a branch point.

Clades can vary in size depending on which branch point is being referenced. The important factor is that all of the organisms in the clade or monophyletic group stem from a single point on the tree. This can be remembered because monophyletic breaks down into “mono,” meaning one, and “phyletic,” meaning evolutionary relationship. Notice in the various examples of clades how each clade comes from a single point, whereas the non-clade groups show branches that do not share a single point.

Organisms evolve from common ancestors and then diversify. Scientists use the phrase “descent with modification” because even though related organisms have many of the same characteristics and genetic codes, changes occur. This pattern repeats as one goes through the phylogenetic tree of life:

  1. A change in the genetic makeup of an organism leads to a new trait which becomes prevalent in the group.
  2. Many organisms descend from this point and have this trait.
  3. New variations continue to arise: some are adaptive and persist, leading to new traits.
  4. With new traits, a new branch point is determined (go back to step 1 and repeat).

If a characteristic is found in the ancestor of a group, it is considered a shared-ancestral character because all of the organisms in the taxon or clade have that trait. Now, consider the amniotic egg characteristic in the same figure. Only some of the organisms have this trait; to those that do, it is called a shared-derived character because this trait derived at some point, but does not include all of the ancestors in the tree. The tricky aspect to shared-ancestral and shared-derived characters is the fact that these terms are relative. The same trait can be considered one or the other depending on the particular diagram being used. These terms help scientists distinguish between clades in the building of phylogenetic trees.

Choosing the Right Relationships

Imagine being the person responsible for organizing all of the items in a department store properly; an overwhelming task. Organizing the evolutionary relationships of all life on earth proves much more difficult: scientists must span enormous blocks of time and work with information from long-extinct organisms. Trying to decipher the proper connections, especially given the presence of homologies and analogies, makes the task of building an accurate tree of life extraordinarily difficult. Add to that the advancement of DNA technology, which now provides large quantities of genetic sequences to be used and analyzed. Taxonomy is a subjective discipline: many organisms have more than one connection to each other, so each taxonomist will decide the order of connections.

To aid in the tremendous task of describing phylogenies accurately, scientists often use a concept called maximum parsimony, which means that events occurred in the simplest, most obvious way. For example, if a group of people entered a forest preserve to go hiking, based on the principle of maximum parsimony, one could predict that most of the people would hike on established trails rather than forge new ones. For scientists deciphering evolutionary pathways, the same idea is used: the pathway of evolution probably includes the fewest major events that coincide with the evidence at hand. Starting with all of the homologous traits in a group of organisms, scientists look for the most obvious and simple order of evolutionary events that led to the occurrence of those traits.

6.4B: Building Phylogenetic Trees - Biology

One reliable method of building and evaluating trees, called parsimony, involves grouping taxa together in ways that minimize the number of evolutionary changes that had to have occurred in the characters. The idea here is that, all other things being equal, a simple hypothesis (e.g., just four evolutionary changes) is more likely to be true than a more complex hypothesis (e.g., 15 evolutionary changes). So, for example, based on the morphological data, the tree at left below requires only seven evolutionary changes and, based on the available evidence, is a better hypothesis than the tree at right, which requires nine evolutionary changes.

To find the tree that is most parsimonious, biologists use brute computational force. The idea is to build all possible trees for the selected taxa, map the characters onto the trees, and select the tree with the fewest number of evolutionary changes. It's a simple idea, but the first two steps require a lot of work — or a lot of computing power!

First, what is meant by "build all possible trees?" Imagine that we want to figure out the evolutionary relationships among just four taxa: A, B, C, and D. There are 15 different ways that those taxa could be related, shown below, and that number skyrockets as the number of taxa increases. For just 10 taxa, there are more than 34 million different possible trees! So the first step to building a tree using parsimony is not trivial. Because of the huge number of possible trees — far too many to be dealt with on paper — biologists use computer programs designed for this task.

Next, evolutionary transitions in each character are parsimoniously mapped onto each of the possible trees, and biologists select the tree that requires the fewest number of evolutionary changes. So for example, consider just three of the phylogenies shown above and a data matrix of two characters. The character data are mapped onto each tree in the most parsimonious way possible, but one of the trees is clearly more parsimonious than the others. Tree 1 requires just two changes in characters to account for the data, while Trees 2 and 3 require three changes to account for the data. If we were biologists using parsimony to select among these three trees, we would select the leftmost tree below as the most likely to be accurate because it hypothesizes the simplest evolutionary trajectory that accounts for the evidence we've collected. (Note that for the example above with four taxa and 15 possible trees, there are multiple character reconstructions on different trees that result in just two evolutionary changes, not just the leftmost tree shown above. In practice, biologists use many more than two characters to evaluate trees, and outgroups are used to constrain likely ancestral character states, resulting in fewer "ties" for most parsimonious tree.)

It's easy to see how complex this process could become with a large number of taxa and characters. Biologists often use data matrices with tens or hundreds of taxa and thousands of characters. Computer programs help them keep track of the huge number of possible trees and all the different ways that the character data could be mapped onto each tree.

III. Selecting individual genomes¶

1: Public or private genomes that are in the PATRIC database can be used to build a phylogenetic tree. Up to 100 genomes can be used in this service. To add a private genome, click on the Filter icon at the beginning of the text box underneath Select Genome.

2: This will open a drop-down box with a list of the types of genomes that can be filtered on

3: Click off the check box in front of Reference, Representative and All Other Public Genomes to enable filtering on private genomes that are in the researcher’s workspace.

4: Clicking on the drop-down box at the end of the text box under Select Genome will show the private genomes in the workspace that have most recently been annotated.

5: The list can also be filtered by beginning to type a name in the text box under Select Genome.

6: A genome of interest can be selected by double clicking on it.

7: This will auto-fill the name of the genome into the text box.

8: The genome needs to be added into the Selected Input Genome Table. Click the Add button at the end of the text box, and the genome will appear in the table.

9: To add a different type of genome (Reference, Representative, or All Other Public Genomes), click on the filter icon and click the check boxes to the desired category.

10: The selected genomes will be moved to the Selected Genome Input Table by clicking on the name and then the Add button.

Phylogenetic Trees

What is a phylogenetic tree?
A phylogenetic tree is a visual representation of the relationship between different organisms, showing the path through evolutionary time from a common ancestor to different descendants. Trees can represent relationships ranging from the entire history of life on earth, down to individuals in a population.
The diagram below shows a tree of 3 taxa (a singular taxon is a taxonomic unit could be a species or a gene).

Terminology of phylogenetic trees

This is a bifurcating tree. The vertical lines, called branches, represent a lineage, and nodes are where they diverge, representing a speciation event from a common ancestor. The trunk at the base of the tree, is actually called the root. The root node represents the most recent common ancestor of all of the taxa represented on the tree. Time is also represented, proceeding from the oldest at the bottom to the most recent at the top. What this particular tree tells us is that taxon A and taxon B are more closely related to each other than either taxon is to taxon C. The reason is that taxon A and taxon B share a more recent common ancestor than they do with taxon C. A group of taxa that includes a common ancestor and all of its descendants is called a clade. A clade is also said to be monophyletic. A group that excludes one or more descendants is paraphyletic a group that excludes the common ancestor is said to be polyphyletic.

The image below shows several monophyletic (top row) vs a polyphyletic (bottom left) or paraphyletic (bottom right) trees.

The video below focuses on terminology and explores some misconceptions about reading trees:

Misconceptions and how to correctly read a phylogenetic tree

Trees can be confusing to read. A common mistake is to read the tips of the trees and think their order has meaning. In the tree above, the closest relative to taxon C is not taxon B. Both A and B are equally distant from, or related to, taxon C. In fact, switching the labels of taxa A and B would result in a topologically equivalent tree. It is the order of branching along the time axis that matters. The illustration below shows that one can rotate branches and not affect the structure of the tree, much like a hanging mobile:

Hanging bird mobile by Charlie Harper

It can also be difficult to recognize how the trees model evolutionary relationships. One thing to remember is that any tree represents a minuscule subset of the tree of life.

Given just the 5-taxon tree (no dotted branches), it is tempting to think that taxon S is the most “primitive” or most like the common ancestor represented by the root node, because there are no additional nodes between S and the root. However, there were undoubtedly many branches off that lineage during the course of evolution, most leading to extinct taxa (99% of all species are thought to have gone extinct), and many to living taxa (like the purple dotted line) that are just not shown in the tree. What matters, then, is the total distance along the time axis (vertical axis, in this tree) – taxon S evolved for 5 million years, the same length of time as any of the other 4 taxa. As the tree is drawn, with the time axis vertical, the horizontal axis has no meaning, and serves only to separate the taxa and their lineages. So none of the currently living taxa are any more “primitive” nor any more “advanced” than any of the others they have all evolved for the same length of time from their most recent common ancestor.

The time axis also allows us to measure evolutionary distances quantitatively. The distance between A and Q is 4 million years (A evolved for 2 million years since they split, and Q also evolved independently of A for 2 million years after the split). The distance between A and D is 6 million years, since they split from their common ancestor 3 million years ago.

Phylogenetic trees can have different forms – they may be oriented sideways, inverted (most recent at bottom), or the branches may be curved, or the tree may be radial (oldest at the center). Regardless of how the tree is drawn, the branching patterns all convey the same information: evolutionary ancestry and patterns of divergence.

This video does a great job of explaining how to interpret species relatedness using trees, including describing some of the common incorrect ways to read trees:

Constructing phylogenetic trees

Many different types of data can be used to construct phylogenetic trees, including morphological data, such as structural features, types of organs, and specific skeletal arrangements and genetic data, such as mitochondrial DNA sequences, ribosomal RNA genes, and any genes of interest.

These types of data are used to identify homology, which means similarity due to common ancestry. This is simply the idea that you inherit traits from your parents, only applied on a species level: all humans have large brains and opposable thumbs because our ancestors did all mammals produce milk from mammary glands because their ancestors did.

Trees are constructed on the principle of parsimony, which is the idea that the most likely pattern to is the one requiring the fewest changes. For example, it is much more likely that all mammals produce milk because they all inherited mammary glands from a common ancestor that produced milk from mammary glands, versus multiple groups of organisms each independently evolving mammary glands.

Misconceptions and how to correctly read a phylogenetic tree

Trees can be confusing to read. Below we enumerate some common misconceptions and how to correct your thinking.

  1. A common mistake is to read the tips of the trees and think their order has meaning. In the tree at the top of the page, the closest relative to taxon C is not taxon B. Both A and B are equally distant from, or related to, taxon C. In fact, switching the labels of taxa A and B would result in a topologically equivalent tree. It is the order of branching along the time axis that matters. The illustration below shows that one can rotate branches and not affect the structure of the tree, much like a hanging mobile:

From > Tips for Reading a Tree

Hanging bird mobile by Charlie Harper

2. Misconception: The tree you see includes all taxa in the clade. Reality: Taxa along the branches may be extinct or omitted. Also, the phyletic evolution that occurs along a branch is not usually included in the branching tree. Phyletic evolution is the evolutionary change along a branch that doesn’t result in speciation. Also, the root connects the tree you see out to the rest of the “tree of life.” Any tree represents a minuscule subset of the tree of life.

An ultra-metric tree of 5 taxa (A, Q, D, X, S) with evolutionary time shown in millions of years ago (Mya). The purple dotted line represents an evolutionary lineage with currently living taxa not represented in the 5-taxon tree. The fine dotted lines indicate a few evolutionary lineages that have gone extinct. Diagram is original work of Jung Choi

3. Misconception: The taxon with the longest branch back to a node of common ancestry must be the most primitive taxon in the tree. Reality: None of the currently living taxa are any more “primitive” or any more “advanced” than any of the others they have all evolved for the same length of time from their most recent common ancestor. All tips, or taxa, in the tree have evolved for the same amount of time from their common ancestor. In the 5-taxon tree above, taxon S has the longest branch. While it is tempting to think that S is the most “primitive,” or most like the common ancestor represented by the root node, there were undoubtedly many branches off that lineage during the course of evolution, most leading to extinct taxa (99% of all species are thought to have gone extinct), and many to living taxa (like the purple dotted line) that are just not shown in the tree. Taxon S evolved for 5 million years, the same length of time as any of the other 4 taxa in that tree. As the tree is drawn, with the time axis vertical, the horizontal axis has no meaning, and serves only to separate the taxa and their lineages.

4. Misconception: Time is always oriented from recent at the top to old at the bottom. Reality: Phylogenetic trees can have different forms — they may be oriented sideways, inverted (most recent at bottom), or the branches may be curved, or the tree may be radial (oldest at the center). Regardless of how the tree is drawn, the tips are more recent in time, and branching patterns all convey the same information: evolutionary ancestry and patterns of divergence.

The vertebrate evolution phylogeny linked here shows time running from left to right, with the present day at the right. Because this phylogeny overlays a timescale, it has a special term called an evogram. Notice also that in the phylogeny, some taxa are alive today (extant), but others are not (extinct) extinct taxa don’t extend to the present day, such as Tiktaalik at the bottom of the image. Key character states are indicated with small ticks along the branches. The immediate descendant have these shared, derived character states, and most of their descendants will have them as well, unless the traits are lost in a future branch of the lineage.


A phylogenetic tree is an estimate of the relationships among taxa (or sequences) and their hypothetical common ancestors ( Nei and Kumar 2000 Felsenstein 2004 Hall 2011). Today most phylogenetic trees are built from molecular data: DNA or protein sequences. Originally, the purpose of most molecular phylogenetic trees was to estimate the relationships among the species represented by those sequences, but today the purposes have expanded to include understanding the relationships among the sequences themselves without regard to the host species, inferring the functions of genes that have not been studied experimentally ( Hall et al. 2009), and elucidating mechanisms that lead to microbial outbreaks ( Hall and Barlow 2006) among many others. Building a phylogenetic tree requires four distinct steps: (Step 1) identify and acquire a set of homologous DNA or protein sequences, (Step 2) align those sequences, (Step 3) estimate a tree from the aligned sequences, and (Step 4) present that tree in such a way as to clearly convey the relevant information to others.

Typically you would use your favorite web browser to identify and download the homologous sequences from a national database such as GenBank, then one of several alignment programs to align the sequences, followed by one of many possible phylogenetic programs to estimate the tree, and finally, a program to draw the tree for exploration and publication. Each program would have its own interface and its own required file format, forcing you to interconvert files as you moved information from one program to another. It is no wonder that phylogenetic analysis is sometimes considered intimidating!

MEGA5 ( Tamura et al. 2011) is an integrated program that carries out all four steps in a single environment, with a single user interface eliminating the need for interconverting file formats. At the same time, MEGA5 is sufficiently flexible to permit using other programs for particular steps if that is desired. MEGA5 is, thus, particularly well suited for those who are less familiar with estimating phylogenetic trees.

Step 1: Acquiring the Sequences

Ironically, the first step is the most intellectually demanding, but it often receives the least attention. If not done well, the tree will be invalid or impossible to interpret or both. If done wisely, the remaining steps are easy, essentially mechanical, operations that will result in a robust meaningful tree.

Often, the investigator is interested in a particular gene or protein that has been the subject of investigation and wishes to determine the relationship of that gene or protein to its homologs. The word “homologs” is key here. The most basic assumption of phylogenetic analysis is that all the sequences on a tree are homologous, that is, descended from a common ancestor. Alignment programs will align sequences, homologous or not. All tree-building programs will make a tree from that alignment. However, if the sequences are not actually descended from a common ancestor, the tree will be meaningless and may quite well be misleading. The most reliable way to identify sequences that are homologous to the sequence of interest is to do a Basic Local Alignment Search Tool (BLAST) search ( Altschul et al. 1997) using the sequence of interest as a query.

Step 1.1

When you start MEGA5, it opens the main MEGA5 window. From the Align menu choose Do Blast Search. MEGA5 opens its own browser window to show a nucleotide BLAST page from National Center for Biotechnology Information (NCBI). There is a set of five tabs near the top of that page (blastn, blastp, blastx, tblastn, and tblastx). By default the blastn (Standard Nucleotide BLAST) tab is selected. If your sequence is that of a protein click the blastp tab to show the Standard Protein BLAST page.

Note that NCBI frequently changes the appearance of the BLAST page, so it may differ in some details from that described here.

There is a large text box (Enter accession number … ) where you enter the sequence of interest. You can paste the query sequence directly into that box. However, if your query sequence is already itself in one of the databases, you can paste its accession number or gi number. If your DNA sequence is part of a genome sequence, you can enter the genome's accession number then, in the boxes to the right (Query subrange) enter the range of bases that constitute your sequence. (You really do not want to use a several megabase sequence as your query!)

The middle section of the page allows you to choose the databases that will be searched and to constrain that search if you so desire. The default is Nucleotide collection (nr/nt), but the drop-down text box with triangle allows you to choose among a large number of alternatives, for example, Human Genomic or NCBI genomes.

The optional Organisms text box allows you to limit your search to a particular organism or to exclude a particular organism. For instance, if your sequence is from humans you might want to exclude Humans from the search, so that you do not pick up a lot of human variants when you are really interested in homologs in other species. To include more organisms click the little + sign next to the options box.

The Exclude option allows you to exclude, for instance, environmental samples.

Step 1.2: Which BLAST Algorithm to Use?

The bottom section of the page allows you to choose the particular variant of BLAST that best suits your purposes. For nucleotides, the choices are megablast for highly similar sequences, discontiguous megablast for more dissimilar sequences, or blastn for somewhat similar sequences. The default is blastn, but if you are only interested in identifying closely related homologs tick megablast. This is the first choice that really demands some thought. The sequences that will be on your tree are very much determined by the choice you make at this point.

At the very bottom of the page click the BLAST button to start the search do not tick the “show results in a new window” box. A results window will appear, possibly with a graphic illustrating domains that have been identified, typically with a statement similar to “this page will be automatically updated in 5 seconds.” Eventually, the final results window will appear. The top panel summarizes the properties of the query sequences and a description of the database that was searched. Below that is a graphic that illustrates the alignment for the top 100 “hits” (sequences identified by the search). Scroll down below that to see the list of sequences producing significant alignment scores. For each sequence, there is an Accession number (a clickable link), a description, a Max Score (also a clickable link), a total score, a Query coverage, and E value and a Max ident. You use that information to decide which of those sequences to add to your alignment and thus to include on your tree.

The description helps decide whether you are interested in that particular sequence. There may be several sequences from the same species do you want all of those or perhaps only one representative of a species—or even of a genus? If you are possibly interested in that sequence look at Query coverage. Are you interested in a homolog that only aligns with 69% of the query? If not, ignore that sequence and move on. Are you interested in a sequence that is 100% identical to your query? If you are only interested in more distantly related homologs, you may not be. If you want the most inclusive tree possible, you may be. You must decide there is no algorithm that can tell you what to include.

If you decide that you are interested in a hit sequence, click the “Max score” link to take you down to the series of alignments. What you see depends on whether your query was a DNA sequence or a protein sequence.

Step 1.3: DNA Sequences

The alignment of the query to the hit begins with a link to sequence file via its gi and accession numbers. If that link is to a genome sequence, or even to a large file that includes sequences of several genes, you will not want to include the entire sequence in your alignment. There are two ways to deal with the issue. 1) Look at the alignment itself and note the range of nucleotides in the subject. Be sure to notice whether the query aligns with the subject sequence itself (Strand = plus/plus) or with its complement (Strand = plus/minus). Click the link to bring up the sequence file. At the top right click the triangle in the gray Change region shown box, then enter the first and last nucleotides of the range, then click the Update View button. In the gray Customize view region, below, tick the Show sequence box, and if Strand = plus/minus also tick the Show reverse complement box, then click the Update View button. Finally, click the Add to Alignment button (a red cross) near the top of the window. (2) If your query is a coding sequence or is some other notable feature you may see Features in this part of subject sequence : just below the sequence description with a link to the feature. Click that feature link to bring up the sequence file already showing the region of interest. Check to be sure whether the sequence shown is the reverse complement of the query, and if it is tick the Show reverse complement box in the Customize view region, update the view, then click the Add to Alignment button (a red cross) near the top of the window.

Step 1.31. When you click the Add to Alignment button, MEGA5's Alignment Explorer window opens and the sequence is added to that window. After adding a sequence to the Alignment Explorer use the back arrow in the BLAST window to return to the list of homologous sequences and add another sequence of interest.

Step 1.4: Protein Sequences

The main difference from nucleotide searches is that you may see accession number links to several protein sequence files. These all have the same amino acid sequence, although their underlying coding sequences may differ. Click any one of the links to bring up the protein sequence file, then click the Add to Alignment button.

You may find that all the hits that are returned from your search are from very closely related organisms that is, if your query was an Escherichia coli protein, all the hits may be from E. coli, Salmonella, and closely related species. If the hits all show a high maximum identity and you are pretty sure the sequence occurs in more distantly related sequences you have probably come up against the default maximum of 100 target sequences. Repeat the search, but before you click the BLAST button to start the search notice that immediately below that button is a cryptic line “+ Algorithm Parameters.” Click the plus sign to reveal another section of the BLAST setup page. Set the Max Target Sequences to a larger value and repeat the search. You may also want to exclude some closely related species in the Choose Search Set section above. Enter a taxon, for example, E. coli, in the box and tick the Exclude box. If you want to exclude more than one species click the plus sign to the right of Exclude to add another field. You can exclude up to 20 species.

When you try to return to the list of hits you may get a page that says “How Embarrassing! Error: −400 Cache Miss.” Click the circular arrow next to the Add to Alignment button. You will be sent to the main BLAST page but do not despair. At the top right of that page is a Your Recent Results section. The top link in the list is your most recent search. Just click that link to get back to your results.

When you have added all the sequences that you want to, just close the MEGA5 browser window.

In the Alignment Editor window save the alignment by choosing Save Session from the Data menu. I like to use a name such as Myfile_unaligned just to remind myself that the sequences have not been aligned. The file will have the extension .mas.

Step 1.5: Alternatives to MEGA5 for Identifying and Acquiring Sequences

Step 1.51. You can access NCBI BLAST through any web browser that NCBI supports at In the Basic BLAST section click the nucleotide blast or protein blast link to get to the page identical to that described earlier. Everything is the same as when using MEGA5's browser except that you cannot click a convenient button to add the sequences to the Alignment Editor.

Step 1.52. Open a new file in a text editor. You can use MEGA5's built in text editor by choosing Edit a Text File from the File menu. That editor has several functions for editing molecular sequences, including reverse complementing and converting to several common formats including Fasta. Alternatively, use Notepad for Windows or TextWrangler for Mac ( Save the file with a meaningful name with the extension.fasta, for example, myfile.fasta. Do not use Microsoft Word, Word Pad, TextEdit (Mac), or another word processor!

Step 1.53. When you have identified the sequence that you want to add and clicked the link to take you the page for that sequence file, adjust the Region Shown and Customize View if necessary. Notice the Display Settings link near the top left of the page. The default setting is GenBank (full). Change that to Fasta (text), select everything, copy it then paste into the text editor file. As you add sequences to the file, it is convenient, but not necessary, to leave blank lines between the sequences.

Identifying and acquiring sequences is discussed in more detail in Chapter 3 of Phylogenetic Trees Made Easy, 4th edition (PTME4) ( Hall 2011).

The next section explains how to import those sequences into MEGA5's alignment editor.

Step 2: Aligning the Sequences

If the Alignment Explorer window is not already open, in MEGA5's main window choose Open a File/Session from the File menu. Choose the MEGA5 alignment file (.mas) or the sequence file (.fasta) that you saved in Step 1. In the resulting dialog choose Align.

The Alignment Explorer shows a name for each sequence at the left, followed by the sequence, with colored residues. Typically the name is very long. That name is what will eventually appear on the tree, and long names are generally undesirable. This is the time to edit those names, in fact it is the only practical time to edit the names, so do not miss the opportunity. Simply double click each name and change it to something more suitable.

If your sequence is DNA you will see two tabs: DNA Sequences and Translated Protein Sequences. The DNA sequences tab is chosen by default. Click the Translated Protein Sequences tab to see the corresponding protein sequence.

Step 2.1

Now is the time to align the sequences. Two alignment methods are provided: ClustalW ( Thompson et al. 1994) and MUSCLE ( Edgar 2004a, 2004b). Either can be used, but in general MUSCLE is preferable. In the tool bar, near the top of the window, Clustal alignment is symbolized by the W button, and MUSCLE by an arm with clenched fist to “show a muscle.” Click one of those buttons or choose Clustal or Muscle from the Alignment menu. If your sequence is DNA you will see two choices: Align DNA and Align Codons. If your sequence is a DNA coding sequence it is very important to choose Align Codons. That will ensure that the sequences are aligned by codons, a much more realistic approach than direct alignment of the DNA sequences because that avoids introducing gaps into positions that would result in frame shifts in the real sequences.

Step 2.2

Choosing an alignment method opens a settings window for that method. For MUSCLE, I recommend that you accept the default settings. For ClustalW, the default settings are fine for DNA, but for proteins, I recommend changing the Multiple Alignment Gap Opening penalty to 3 and the Multiple Alignment Gap Extension penalty to 1.8.

Step 2.3

Click the OK button to start the alignment process. Depending on the number of sequences involved and the method you chose, alignment may take anywhere from a few seconds to a few hours. When the alignment is complete Save the session. I like to save the aligned sequences under a different name, thus if my original file was Myfile_unaligned.mas, I would save the aligned sequence as just Myfile.mas.

Step 2.4

MEGA5 cannot use the .mas file directly to estimate a phylogenetic tree, so you must also choose Export Alignment from the Data menu and export the file in MEGA5 format where it will get a .meg extension. You will be asked to input a title for the data. You can leave the title blank if you wish, but it is helpful to add some sort of title that is meaningful to you. If it is an alignment of DNA sequences you will also be asked whether they are coding sequences.

Alignment is discussed in more detail in Chapter 4 of PTME4 ( Hall 2011).

Step 2.5: An Alternative to Aligning with MEGA5

Once the alignment is complete, you will see that gaps have been introduced into the sequences. Those gaps represent historical insertions or deletions, and their purpose is to bring homologous sites into alignment in the same column. It should be appreciated that just as a phylogenetic tree is an “estimate” of relationships among sequences, an alignment is just an estimate of the positions of historical insertions and deletions. The quality of the alignment can affect the quality of a phylogenetic tree, but MEGA5 offers no way to judge the quality of the alignment. The web-based program Guidance ( provides five different methods of alignment, but more importantly, it evaluates the quality of the alignment and identifies regions and sequences that contribute to reducing the quality of the alignment. Discussion of Guidance ( Penn et al. 2010) is beyond the scope of this article, but the topic is covered in detail in Chapter 12 of PTME4 ( Hall 2011).

Guidance requires that the unaligned sequences are provided in a file in Fasta format. See Hall (2011) for a detailed description of the Fasta format. If you downloaded the sequences through your favorite web browser and saved them as a .fasta file that file can be used as the input for Guidance. If you used MEGA5 to download the sequences into the Alignment Explorer you can export the unaligned sequences in FASTA format by choosing Export Alignment from the Data menu, then choosing FASTA format. If you forgot to keep the unaligned sequences you can select all the sequences (Control-A), then choose Delete Gaps from the Edit menu before you export the sequences in FASTA format.

Step 3: Estimate the Tree

There are several widely used methods for estimating phylogenetic trees (Neighbor Joining, UPGMA Maximum Parsimony, Bayesian Inference, and Maximum Likelihood [ML]), but this article will deal with only one: ML.

Step 3.1

In MEGA5's main window choose Open a File/Session from the File menu and open the .meg file that you saved in Step 2.

Step 3.2

ML uses a variety of substitution models to correct for multiple changes at the same site during the evolutionary history of the sequences. The number of models and their variants can be absolutely bewildering, but MEGA5 provides a feature that chooses the best model for you. From the Models menu choose Find Best DNA/Protein Models (ML) … . A preferences dialog will appear, but you are safe enough accepting the default setting. Click the Compute button to start the run. Models can take quite awhile to consider all the available models, but a progress bar shows how things are coming along.

When complete a window appears that lists the models in order of preference. Note the preferred model, then estimate the tree using that model. For the examples below, the WAG + G + I model was the best.

Step 3.3

From the Phylogeny menu choose Construct/Test Maximum Likelihood Tree … . A preferences dialog similar to that in figure 1 will appear.



Phylogenetic trees have become a standard tool in the study of adaptation, and such uses are often referred to as the “comparative method.” First, it is necessary to establish that a particular “adaptation” is distributed as an apomorphy within the group in question and then, if there are multiple origins, to determine if these origins are correlated with other characters and/or environmental variables. While numerous statistical approaches have been suggested for such studies, they all assume that multiple independent origins of characters correlated with environmental or historical factors are evidence of adaptation. Indeed, some workers maintain that it is only possible to discuss adaptation in a historical context, i.e., based on explicit phylogenetic trees. Undoubtedly continued work in these areas will result in improved statistical tests for adaptation based on character distributions on phylogenetic trees.

6.4B: Building Phylogenetic Trees - Biology

How to Draw a Phylogenetic Tree
(Using differences in molecular sequence)

A phylogenetic tree uses data to indicate relatedness of different species. This webpage explains how to construct a phylogenetic tree using differences in molecular sequences (such as differences in amino acids, or differences in nucleotides).

Numbers in the table below represent mutational differences in a particular gene. Higher numbers indicate more genetic differences between two species. The longer two species (or subspecies) are isolated, the more likely there will be an accumulation of mutational differences.

1. Identify the most different, or ancestral, species . This is the one that has the most mutational differences from the other species. In the chart above, the species with the most mutational differences (highest numbers) is Species A .

2. Select the next most different, or ancestral species, the one that shares a common ancestor with the previous species ( Species A ). To do this, look at the Species A column and look for the species that has the fewest mutational differences, which is Species B with 27.

3. Begin drawing the phylogenetic tree. This is commonly done by drawing a line with branches indicating a possible shared common ancestor. The break (or node) of a branch indicates a common ancestor, and the branch itself indicates speciation. In a phylogenetic tree, line length does not necessarily indicate the age of a species, just relatedness and ancestry.

4. Add the next organism . To do this, look at the second organism's data ( Species B ), and look for the most genetically similar organism (for that particular gene). From the table, Species B may share a common ancestor with Species C (13 differences).

5. Add the next organism. Looking at the Species C row and column, find the most genetically similar organism, which is Species D (3 differences).

6. Add the remaining organisms. Looking at Species D , the lowest number is still the "3" from the mutation differences with Species C . What this may indicate is that Species D shares a common ancestor with Species C , but not with the remaining species ( Species E and Species F ). Looking at Species E and Species F data, Species E is very similar to Species F , and Species E is similar to Species C . This suggests that Species E shared a common ancestor with Species C , not Species D . Species F then shares a common ancestor with Species E .

7. Check to confirm that your phylogenetic tree matches the data in the table.


Centre for Life’s Origins and Evolution, Department of Genetics, Evolution and Environment, University College London, London, UK

Paschalia Kapli, Ziheng Yang & Maximilian J. Telford

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar


K.P., Z.Y. and M.J.T. contributed to all aspects of the article.

Corresponding author

Results and discussion

Running BLAST-Explorer

The entry page of BLAST-EXPLORER is a simplified BLAST form that receive a single fasta-formated query sequence as input and allows (i) the selection of BLASTN, BLASTP, TBLASTN, or BLASTX [4] as an alignment algorithm, (ii) the selection of a sequence database (Genbank NT for nucleotides Genbank Non Redundant Protein, Ensembl, PDB, RefSeq, Uniprot and Swissprot for proteins), (iii) the selection of a BLAST E-Value threshold and (iv) the option of filtering out low-complexity sequence segments. BLAST searches report a maximum of 5,000 hits.

Small scale selection mode

By default, the result page only shows the top-100 scoring BLAST hits, while the remaining hits are kept in memory and can be activated using the large-scale selection tools (next section). Small-scale selection tools only apply on the top-100 scoring BLAST hits. The central tool in this mode is the sequence similarity tree that provides an approximate picture of the phylogenetic relationships between the query and the top BLAST hits (Fig. 1A). BLAST hits are renamed according to the species name. The similarity tree is documented with meta-information including hit description (Fig. 1B), alignment coverage (Fig. 1C), taxonomy-based coloring (Fig. 1D). The tree image allows a navigation across the BLAST result page (clicking on an alignment coverage bar [Fig. 1C] leads to the corresponding pairwise alignment [Fig. 1E]), gives access to the database record (by clicking on the hit name), as well as to the selection of individual hits (check-boxes) or in bulk (by clicking on internal branches).

BLAST-Explorer main interface. BLAST-Explorer main interface showing the similarity tree (A), hit descriptions (B), a coverage diagram representing the alignment of the hit sequences on the query (C), the taxonomy color code (D), individual BLAST pairwise alignments (E), the small-scale (F) and large-scale (H) selection tool panels. The "Score histogram" tool (G) and "Selection on taxonomy" tool (I) are given as examples.

A dropdown menu (Fig. 1F) gives access to additional small-scale selection tools:

o The top-panel shows the number of gap-free sites in the BLAST-reconstructed multiple-alignment of selected sequences (see supplementary data). This number is dynamically updated when BLAST hits are added or removed from the selection.

o The "score histogram" tool shows the BLAST score values ranked in decreasing order. A score threshold can be applied by clicking on the histogram (e.g., Fig. 1G).

o Two "Update tree" options allow redrawing the similarity tree by setting the appropriate number of top-scoring BLAST hits or using a user-defined sequence selection. The tree is generated by combining ClustalW [6] and TreeDyn [7] using either all sites of the BLAST-reconstructed multiple-alignment or gap-free sites only (N.B., the initial tree is computed using all sites).

o The "Add sequences to tree" option allow incorporating up to five external sequences (supplied by users) into the current hit sequence selection. The similarity tree is then recalculated to show the phylogenetic position of the external sequences relative to the BLAST hit sequences.

At the end of the selection process, selected sequences can be imported in fasta format ("get selected sequence" button) or passed to one of the phylogenetic reconstruction pipelines available on the platform [5] ("One click mode" or "Advanced mode" buttons).

Large-scale selection mode

In the large-scale selection mode, several tools allow the sampling of homologous sequences among the entire set of BLAST hits (including those that are not shown in the top-100 BLAST subset) using global criterions. They are grouped in a dedicated panel (Fig. 1H) and comprise:

o A pull-down menu that allows changing the e-value threshold on BLAST hits

o Buttons showing the distributions of the BLAST hits according to three BLAST alignment statistics (i.e., BLAST scores, percentage of similarity, and alignment coverage). Bulk selection among the BLAST hits can then be done by selecting intervals of the distribution histogram.

o The "selection on taxonomy" tool enabling the selection of BLAST hits according to their taxonomic rank (e.g., Fig. 1I). The taxonomic information is presented as a hierarchical graph allowing users to adjust the level of details that is relevant to their needs.

Following the application of the selection rules, the result page (i.e., the similarity tree and individual pairwise alignments) is updated to account for changes in the list of the top-100 best BLAST.

Comparison with existing software

Several existing BLAST post-processors combine BLAST searches with automated phylogenetic analysis of the BLAST hits. However most of them do not pursue the same goal and therefore differ in the nature of the results. Also, the functionalities proposed to interact with the results vary greatly. Some of the applications allow filtering of the BLAST hits before phylogenetic reconstruction, others do not.

Phylogena is a standalone application for phylogenetic annotation of unknown sequences [8] and implements an automated intelligent filtering of BLAST hits before phylogenetic reconstruction. In contrast with BLAST-Explorer, the hit filtering method is optimized for sequence annotation and do not enable interactive and progressive refinement of the sequence dataset. Furthermore Phylogena does not allow retrieving the selected sequences for external analysis.

Phylogenie is also a standalone application for automated phylome generation and analysis [9]. Because the principal force of Phylogenie is to automatically produce a large number of phylogenetic analyses in batch, it does not allow interactive filtering of BLAST hits before phylogenetic reconstruction. Phylogenie is a command-line driven pipeline, requiring at least some familiarity with UNIX and command line tools.

Phyloblast [10] and the NCBI BLAST server [11] are two web services that have the most in common with BLAST-Explorer. They produce an enriched BLAST output and allow selection of hits using various criterions. The Phyloblast server is apparently no longer maintained. Phyloblast only allowed comparing a protein sequence against a protein database using BLASTP whereas BLAST-Explorer allows nucleotide/nucleotide, protein/protein and translated nucleotide/protein comparisons. Tools for selecting hits before phylogenetic reconstruction are less versatile than those proposed by BLAST-Explorer (selection based on species names and sequence description). The NCBI BLAST service also provides several tools for selecting and retrieving matching sequences from the BLAST output a distance tree of the BLAST hits can also be calculated. Here again the hit selection tools are more limited than in BLAST-Explorer (simple check boxes beside sequence descriptions). Furthermore the image of the distance tree does not allow interactive selection of the BLAST hits. This makes selection on phylogenetic criterion less straightforward.

The principal strength of BLAST-Explorer is the flexibility of the sequence selection process and the richness of the information displayed on screen. However, BLAST-Explorer does not propose pre-defined automated methods of hit selection such as for example in Phylogena. Rather, BLAST hit selection is multi-dimensional and mainly human-driven though an interactive graphical interface in order to respond to a wide range of sequence selection strategies. Another feature that differentiates BLAST-Explorer from other software is that it is entirely web-based. Thus no installation on personal computer and no regular update of the sequence databases are required.

The BLAST-Explorer output includes a phylogenetic representation of the BLAST hits (i.e., the similarity tree) that aims at helping in the hit selection process. It is important to note that this tree is not optimized for phylogenetic accuracy. Rather, we opted for a fast tree reconstruction strategy that is however sufficiently robust for providing an approximate phylogenetic position of the BLAST hits. Thus we advise users to use external specialized software if they want to improve or confirm the accuracy of the phylogenetic tree.

Finally, it is important to note that in some phylogenetic aspect, the the importance is a correct distinction between orthologous and paralogous sequences