Sequencing the genomes of polyploid organisms

Sequencing the genomes of polyploid organisms

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I've done some transcriptomics work in the past with a polyploid organism, and this presented some unique challenges in the data processing and analysis. Since then, I have been brainstorming about the technical challenges one may face when sequencing and assembling the genomes of a polyploid organism. As far as I am aware, there are no polyploids whose genomes have been sequenced.

If one wanted to sequence, for example, a tetraploid organism, one approach would be to prepare and sequence all of the DNA together and then rely on post-sequencing analysis to tease apart the two co-resident genomes. However, it would be difficult, if not impossible, with this approach to distinguish inter-genome variation from intra-genome variation.

An alternative approach would be to isolate DNA from both co-resident genomes separately, and then sequence and assemble the genomes separately, so that inter-genome variation and homology need not be considered. However, I'm thinking at a very high level and have little intuition as to the technical feasibility of this approach. When there are two or more co-resident genomes, is it be possible to isolate DNA from only one of those genomes? What would this rely on (for example, would thorough cytogenetic/cytogenomic characterization help)? If this task is not possible, what types of limitations must be overcome to enable it?

Take a look at the strategies used to sequence the wheat genome. Wheat is hexaploid. The project is described at

For early work on the maize genome, we employed methyl filtration in order to reduce genome complexity and size - transposons are filtered out and genes + promoters and such remain. The gene sequences are different from the two genomes, so says the theory, and those can be distinguished. See for the reference.

Next-generation biology: Sequencing and data analysis approaches for non-model organisms

As sequencing technologies become more affordable, it is now realistic to propose studying the evolutionary history of virtually any organism on a genomic scale. However, when dealing with non-model organisms it is not always easy to choose the best approach given a specific biological question, a limited budget, and challenging sample material. Furthermore, although recent advances in technology offer unprecedented opportunities for research in non-model organisms, they also demand unprecedented awareness from the researcher regarding the assumptions and limitations of each method.

In this review we present an overview of the current sequencing technologies and the methods used in typical high-throughput data analysis pipelines. Subsequently, we contextualize high-throughput DNA sequencing technologies within their applications in non-model organism biology. We include tips regarding managing unconventional sample material, comparative and population genetic approaches that do not require fully assembled genomes, and advice on how to deal with low depth sequencing data.

The world's largest sequenced genome is just the start

Today the genome of the loblolly pine was published in Genome Biology – the largest yet sequenced. This paper is mostly important because the authors made real improvements to the process that scientists use to sequence large and complex genomes like that of the loblolly pine. Because let’s face it, they’re not likely to hold the record for long. Genome sequencing technologies are moving fast and there are hundreds of sequencing initiatives going on.

So, a little defining of terms. Sequencing is when you work out the exact code of DNA bases A,C,G and T that make up a genome. But you can estimate the number of bases in a genome without knowing what they are, so we have lots of information about the size of various genomes without knowing exactly what’s in them. Like knowing how many pages are in a book, without having read exactly what letters are on each page. These are measured in base pairs, as DNA is double helix, so the bases are always in pairs.

Our infographic comparing the largest to smallest known genomes. Where do we stand? Click the image to see a close up.

Just finding the figures to put together this infographic of some of the most interesting genomes out there was a challenge – it’s changing all the time! We looked for the ‘smallest’ genome listed on a table on Wikipedia, only to find it had been supplanted by the discovery of an even smaller genome. And the largest genome of any organism is contested – the 640,000,000,000 base pairs of the tiny amoeba Polychaos dubiu is contested, because it’s size was estimated before modern techniques were developed, so it might be wrong! So the most likely candidate for biggest genome is actually Paris japonica.

Plants often have huge, complex genomes. This is sometimes because their genomes spontaneously double, so that instead of being in pairs (A diploid organism), their chromosomes are in group of 4 or more – these are called polyploid, and they tend to have enormous genomes. The amazing thing about the loblolly pine, which is currently the largest genome sequenced, at 22.18 billion base pairs, is that it’s actually a diploid, so it’s size and complexity is nothing to do with chromosome doubling. Sequencing the genome revealed that actually, a lot of its bulk is down to repetitive bits of sequence.

While the human genome has had a huge effect on medical treatment and research, and was a huge step forward in sequencing technology, my personal favourite of all of these is the man-made bacterial genome created at the J. Craig Venter Institute in 2010. It was based on the Mycoplasma mycoides genome and is affectionately known as Mycoplasma mycoides JCVI-syn1.0. It is estimated that this synthetic genome cost US$40 million to make and took 20 people more than a decade of work. It was an amazing proof of principle that you can synthesise the genome of an organism and make it work within a living body. For bacteria at least.

The ethical and societal implications of this are huge, and so this was a really controversial development – something the Institute is well known for! Their current work includes amazing things like the Human Microbiome Project, and synthetic bacteria to tackle carbon levels. So they are just one place that is moving beyond what’s in the genome, to what we can do with it.

Genome sequencing is amazing, because it opens up so many new questions and possibilities for science. So while the records in this infographic – though accurate at the time of writing, are sure to be supplanted, we can be sure of one thing – the genome is just the start of the story.

Introduction to Polyploidy

The fusion of two or more genomes within one nucleus results in polyploidy, resulting in each cell containing more than two pairs of homologous chromosomes. Polyploidy occurs in the majority of angiosperms and is important in agricultural crops that humans depend on for survival. Examples of important polyploid plants used for human food include, Triticum aestivum (wheat), Arachis hypogaea (peanut), Avena sativa (oat), Musa sp. (banana), many agricultural Brassica species, Solanum tuberosum (potato), Fragaria ananassa (strawberry), and Coffea arabica (coffee). Autopolyploidy results from whole genome duplication, while an allopolyploid is characterized by interspecific or intergeneric hybridizations followed by chromosome doubling (Doyle et al., 2008 Chen, 2010). Genome duplication (autoployploidy) can be a source of genes with novel functions leading to new phenotypes and novel mechanisms for adaptation (Crow and Wagner, 2005). Autopolyploids typically suffer from reduced fertility whereas allopolyploids have potential for heterosis or hybrid vigor (Ramsey and Schemske, 1998). Polyploidy generates great genetic, genomic, and phenotypic novelty (Soltis et al., 2016) however, the higher complexity between genotype and phenotype in polyploids compared to diploid plants makes linking genotype to phenotype a challenging task. For example, allopolyploid plant cells have complex regulatory mechanisms in order to unify gene expression between the homeologs and define their relative contributions to the final phenotype. Hence, polyploidization is one of the major forces of plant evolution and is intimately linked to speciation and diversity (Bento et al., 2011). It is estimated that around 80% of all living plants are polyploids (Meyers and Levin, 2006), while many plant lineages including monocots (i.e., Oryza) and eudicots (Arabidopsis) have at least one paleo-polyploidy event in their history.


The genome sequence of a human individual can be modeled as 23 pairs of sequences of four nucleotide bases, A, C, G and T, representing the 22 pairs of autosomes and the sex chromosomes. However, ∼99.5% of any two individuals’ genome sequences is shared within a population. The ∼0.5% of the nucleotide bases varying within a population range from single-nucleotide polymorphisms (SNPs) to more complex structural changes, for example, deletions or insertions of genomic material. A sequence of genomic variants, typically SNPs, with the non-varying DNA removed is referred to as a haplotype.

Standard genome sequencing workflows produce contiguous DNA segments of an unknown chromosomal origin. De novo assemblies for genomes with two sets of chromosomes (diploid) or more (polyploid) produce consensus sequences in which the relative haplotype phase between variants is undetermined. The set of sequencing reads can be mapped to the phase-ambiguous reference genome and the diploid chromosome origin can be determined but, without knowledge of the haplotype sequences, reads cannot be mapped to the particular haploid chromosome sequence. As a result, reference-based genome assembly algorithms also produce unphased assemblies. However, sequence reads are derived from a single haploid fragment and thus provide valuable phase information when they contain two or more variants. The haplotype assembly problem aims to compute the haplotype sequences for each chromosome given a set of aligned sequence reads to the genome and variant information. The haplotype phase of variants is inferred from assembling overlapping sequence reads [ Browning and Browning (2011) Halldórsson et al. (2003) Schwartz (2010)].

The input to the haplotype assembly problem is a matrix M whose rows correspond to aligned read fragments and columns correspond to SNPs ( Fig. 1). The quality of M’s construction depends on the parameters of the sequencing workflow and the accuracy of the read alignment algorithms. Misaligned read fragments can introduce erroneous base calls or sampling biases so the careful alignment of sequence reads is necessary for high-quality haplotype assemblies. Without read alignment or sequencing errors, the haplotype assembly problem can be solved in time linear in the size of M by partitioning the fragments in two sets whereby no fragments internal to a set share an SNP and differ in the allele called. To address erroneous base calls or misplaced alignments, three primary haplotype assembly optimizations have been developed: minimum error correction (MEC), minimum SNP removal (MSR) and minimum fragment removal (MFR). The goal is to convert M into a state such that the fragments (rows of M) can be distributed into two sets corresponding to the two haplotypes. All fragments in a set must agree on the allele at each SNP site and this is accomplished using the minimum number of SNP allele flips (0 to 1 or vice versa - MEC), SNP (columns of M) removals (MSR) or fragment (rows of M) removals (MFR).

Construction of the input to the haplotype assembly problem

Construction of the input to the haplotype assembly problem

Lancia et al. (2001) and Rizzi et al. (2002) provide a theoretical foundation for the MFR and MSR optimizations and describe the fundamental SNP and fragment conflict graph structures. The first widely available haplotype assembly software package was presented in Panconesi and Sozio (2004) in which the authors describe the Fast Hare algorithm, which optimizes the ‘Min Element Removal’ problem. Bansal et al. (2008) describe a Markov chain model with Metropolis updating rules to sample a set of likely haplotypes under the MEC optimization. In a follow-up, the authors present a much faster algorithm on a related graph model that relates maximum cuts to SNP allele flips (in the MEC model) [ Bansal and Bafna (2008)]. Still other authors have suggested reductions to the well-known maximum satisfiability problem [ He et al. (2010) Mousavi et al. (2011)] The Levy et al. (2007) algorithm is a well-known heuristic that was used to haplotype assemble the HuRef genome it assigns fragments to haplotypes in a greedy fashion and iteratively refines the solution by comparing the set of fragments to the assembled haplotypes using majority rule phasings. In a recent survey, Geraci (2010) describes the Levy et al. (2007) algorithm as, arguably, the best performing algorithm tested.

The first extension of the haplotype assembly problem that addressed the simultaneous assembly of multiple diploid chromosomes was presented in Li et al. (2006) however, the benefits of multi-haplotype assembly are not clear for a set of unrelated individuals. Halldorsson et al. (2011) continued development of this theory by describing methods for assembling individuals who share a haplotype identical by descent (IBD) using relationships among the reads.

Aguiar and Istrail (2012) introduced a new graph data structure, algorithmic framework and the minimum weighted edge removal (MWER) optimization, which together have several advantages over existing methods. Recall that the rows of M correspond to sequence read fragments with the non-polymorphic bases removed such that only SNPs remain. The HapCompass model defined in Aguiar and Istrail (2012) is composed of the compass graph GC core data structure, which summarizes the rows of M using edges weights and the MWER optimization that aims to remove a minimum weighted set of edges from GC such that a unique phasing may be constructed. The algorithm operates on the spanning-tree cycle basis of GC to iteratively remove errors that are manifested through a particular type of simple cycle [ Deo et al. (1982) Mac Lane (1937)].

In this work, we prove a number of theoretical results for the previously described MWER optimization on compass graphs. The main result proves MWER is NP-hard and motivates the use of our heuristic algorithms. Further, we demonstrate how extensions to the generalized diploid HapCompass model can enable (i) usage of different optimizations, for example, MEC and MWER, to be used in the local optimization step, (ii) simultaneous assembly of two individuals sharing a haplotype tract IBD and (iii) haplotype assembly of a single polyploid organism. Finally, we evaluate our methods on 1000 Genomes Project, Pacific Biosciences and simulated data.


Creating a contiguous chromosome-level genome assembly is clearly the ideal for a genome assembly. One of the major challenges to the assembly of a contiguous chromosome-level genome assembly are the repetitive regions of genomes (Tørresen et al., 2019 ). Repetitive regions include expanded gene families, complex repeats, highly repetitive regions such as centromeres and telomeres, and sex chromosomes, or at least portions of them. Most large genomes (from any clade) are highly repetitive and complex repeats are still a problem for the human genome despite massive resources devoted to this assembly (Chaisson et al., 2015 ). Heterozygosity between haplotypes in diploid and polyploid organisms is another major source of error in genome assemblies.

3.1 Repetitive regions

As technologies improve and read lengths increase, the ability to span across repetitive regions improves. To overcome the challenge of sequencing repetitive regions, reads must be long enough to anchor in nonrepetitive sequence and span across the repeat(s). If the read length is (substantially) longer than the repeat region, the repeats can be spanned and assembly of the region should be possible (for example, see Bongartz, 2019 ). Missing repetitive regions means possibly missing genes in the genome as well (Peona et al., 2018 ). Centromeres and telomeres pose unique challenges yet are important to genome biology (e.g., Bichet et al., 2020 ) in many organisms centromeres and telomeres are long and telomeres cannot be anchored by nonrepetitive sequence on both sides given their location on chromosomes. Another class of chromosomes that present a challenge are sex chromosomes. Some sex chromosomes have degenerated in many species with only highly repetitive sequence and a short pseudo-autosomal region remaining (Kejnovsky et al., 2009 Smith et al., 1987 ). Evolutionarily young sex chromosomes have the same trend (Bachtrog et al., 2019 ), suggesting sex chromosome assembly may be challenge for many organisms. Successful assembly of the Y chromosome of the threespine stickleback, which is less than 26 million years old and at an intermediate stage of degeneracy, involved long-read sequencing, careful curation of and partitioning of X- and Y-linked contigs followed by Hi-C scaffolding (Peichel et al., 2020 ). Adequate assembly of degenerate sex chromosomes will be best approached, and ultimately require, long-read technology that spans the length of an entire chromosome.

3.2 Ploidy

With regards to ploidy, haploid genomes are the most straightforward to assemble. Assuming reads span repeats, there is a single contiguous sequence without heterozygosity for the individual chosen for genome assembly. A large issue with diploids and polyploids is that there is heterozygosity between the two or more copies of the genome in a single individual. In some genome assemblies, there have been attempts to reduce heterozygosity before sequencing, for example by inbreeding (Zhang, Li, et al., 2020 ) or creating doubled haploid individuals (Berthelot et al., 2014 Linsmith et al., 2019 ). However, inbreeding or creating doubled haploids is essentially impossible for the vast majority of species occurring in natural environments. The typical way that genome assembly information is stored in a single strand means that haplotypes are collapsed into a single sequence. Higher levels of heterozygosity between the two homologous chromosomes (in the case of diploids) increases this challenge, with ultimately the inability to collapse the two haplotypes, an overestimation of genome size, and an overestimation of complexity. In polyploid taxa, the scale and complexity of heterozygosity-related assembly issues is further amplified (Kyriakidou et al., 2018 ). Another approach to resolve haplotypes is through trio binning. Trio binning is accomplished by sequencing the parents of an organism with short reads and then assigning long reads for the individual in question to a specific parent (Koren et al., 2018 Yen et al., 2020 ). Trio binning is a promising route for resolving haplotypes especially in interspecific F1 hybrids, however is limited by access to parents and offspring, which is not possible for many species. An alternative based on similar principles, is that of gamete binning, which uses single-cell sequencing of gametes to inform partitioning of reads into distinct haplotype sets for subsequent assembly (Campoy et al., 2020 ).

3.3 Pan and core genomes

Genome assemblies are often limited to one individual from a species. Moreover, genome assemblies from “closely” related species (where lineages may have divergence times of several million years) are often used for mapping and as proxies for a de novo reference genome assembly from the target species. However, there is often a large amount of variation among individuals in a species (Audano et al., 2019 ). Generating multiple de novo assemblies for a species, for example one per population, lineage, or deme would better capture genetic variation in a species but generating a new reference genome for multiple individuals is often cost prohibitive and perhaps computationally prohibitive as well. The major benefit of multiple assemblies is to distinguish between the pan- and core-genomes within a species (Figure 2a). The pan-genome represents all sequences among all of the DNA sequences that occur in a species whereas the core-genome is the DNA that is shared among all sequenced individuals. For example, the comparison of eight chromosome-level assemblies of Arabidopsis thaliana accessions revealed a core-genome, shared by all accessions, of

24,000 genes, whereas the pan-genome was

135 Mb in length and included

30,000 genes (Jiao & Schneeberger, 2020 ), highlighting the vast amount of sequence data, including genes, that are missed by a single reference genome assembly. In the soybean pan genome from 26 accessions, at least 48,249 genes were missing from at least one accession, which equates to approximately 20% of the genes in a single assembly being classified as dispensable or private (Liu et al., 2020 ). Pan-genomes are currently only available for model plant species, humans, and some bacterial species (Bayer et al., 2020 Sherman & Salzberg, 2020 ). The considerable loss of genome diversity and specific genomic regions in crop species following domestication and artificial selection from wild progenitors has been one of the major drivers for pan-genome construction but the approaches are likely to find increasing application in molecular ecology settings, especially where the noncore component of the pan-genome is considerable. A recent study of the Mediterranean mussel found that over 30% of genes were subject to presence/absence variation when individuals from two populations were assessed (Gerdol et al., 2020 ). Generating pan-genomes will most likely be limited by cost and access to diverse specimens from a species (i.e., sampling throughout the range). There are also diminishing returns after the number of genomes sequenced increases, where at some point, depending on diversity, new genomes only add minimal new information (see Figure 2b). Another area of development for pan-genomes is the storage of genome assembly information in a nonlinear genome graph (Eizenga et al., 2020 ).

3.4 Limited input material (quality and quantity) may exclude some species from long-read sequencing

High molecular weight (HMW) DNA is a requirement for long-read and linked-read sequencing technologies and for many of the scaffolding approaches, with Hi-C optimally requiring intact cells for chromosome recovery. The length of reads is often ultimately determined by the length of the DNA from an extraction (Li & Harkess, 2018 ). Therefore, HMW DNA is a major limitation for generating de novo genome assemblies for species with limited input material, especially with respect to quality. Several methods have been developed for genome sequencing with small amounts of input (e.g., PacBio Low DNA input) and highly contiguous genomes have resulted from sequencing a single small individual (Kingan et al., 2019 ). However, methods optimized for small amounts of input DNA still rely on HMW DNA, which may not be possible for a small subset of organisms. For example, endangered species may be limited to noninvasive or minimally-invasive sampling (e.g., faecal, skin or hair samples where host DNA quality, and amount can be low) or existing museum specimens that probably have not been preserved with DNA sequencing in mind (Carroll et al., 2018 ). While it may not be feasible to use long-read sequencing in such cases, even an assembly based on short-read data can provide coverage of a reasonable proportion of coding sequences (Colella et al., 2020 ).

NCBI at CSHL Biology of Genomes, May 11 – 14, 2021

NCBI staff will be presenting virtual posters at the Cold Spring Harbor Laboratory Biology of Genomes Meeting, May 11 -14, 2021. The posters will cover the following topics: 1) a cloud-ready suite of tools (PGAP, RAPT , and SKESA) for assembling and annotating prokaryotic genomes, 2) Datasets — a new set of services for downloading genome assemblies and annotations, and 3) updates on NCBI RefSeq eukaryotic genome annotation, and the Genome Data Viewer (GDV). Read more below for the full abstracts.

The virtual poster gallery opens Tuesday, May 11 at 9:00 a.m. with dedicated time for poster viewing and discussion at 1:00 to 2:00 p.m. through Slack each day. The poster gallery will be open for entire the conference and remain available for six weeks afterwards.

The NCBI tool suite for prokaryotic genomes: how RAPT, SKESA and PGAP can accelerate your research

Thibaud-Nissen F, Agarwala R, Arndt D, Hlavina W, Li W, Lu S, Meric P, Souvorov A, Sweeney D, Wagner L, Yang, M

NCBI has developed a suite of publicly available tools for assembling, annotating and verifying the species assignment of bacterial and archaeal genomes. RAPT brings together SKESA, an efficient de Bruijn graph assembler for Illumina short reads, and PGAP, the pipeline used for the annotation of RefSeq prokaryotic genomes. Recent workflow changes have reduced the PGAP and RAPT runtime by half, so that a user can now assemble a genome from sequencing reads and annotate the structure and function of genes on the resulting assembly in minutes to a couple of hours, using a single command.

Docker images for PGAP and RAPT are available on dockerhub, and can run on a local computer, a private cluster or in a cloud environment, using intuitive command-line interfaces. The images contain the PGAP CWL workflow, all necessary binaries (including SKESA in the case of RAPT) and cwltool, the CWL reference implementation. All necessary reference data, including a variety of manually curated evidence are bundled and distributed with PGAP and RAPT.

A special implementation of RAPT for users of the Google Cloud Platform that makes use of the Google Life Sciences API is also available. With a single command from the Google Cloud Shell, GCP RAPT secures a virtual machine, downloads the Docker image and data needed, assembles, verifies the taxonomic assignment, annotates the genome, places the output in the desired bucket and shuts down the virtual machine.

Finally, we will present a pilot web service for RAPT, aimed at helping biologists without the technical skills or the access to compute resources answer their scientific questions, and at understanding their needs for prokaryotic genomics tools and data.

NCBI Datasets: Get the genome-related data you want, the way you want

VA Schneider, E Cox, PA Meric, JB Holmes, and NA O’Leary

For researchers performing genomic analyses, NCBI is recognized as one of the pre-eminent public archival collections from which sequences, assemblies, annotations, and metadata for organisms across the tree of life can be freely retrieved. As the volume and complexity of data grows, it becomes increasingly important to provide access mechanisms that allow researchers to find the data they need efficiently and effectively. Furthermore, researchers need infrastructure and data that adhere to FAIR principles (Findable, Accessible, Interoperable, Reusable) to ensure the usability of the data and the quality of their analyses. NCBI Datasets is a new resource focused on these needs, developed specifically to make it easy for researchers to get the data they want, so they can use it. We will show how Datasets offers web-based, command line and API access to genome and gene-related sequence content and metadata from all branches of the taxonomic tree. We will review the structure of genome datasets that include genome, transcript, and protein sequence, annotation, and a JSON-lines formatted data report of genome metadata. We will also introduce the dataformat tool that is provided to transform JSON-lines into a tabular report. We will present on other NCBI Datasets packages that are also available, including for genes and ortholog data, and for those studying SARS-CoV-2, a package that includes genomic, protein and CDS sequences, annotation and a comprehensive data report for all complete SARS-CoV-2 genomes. Finally, we’ll present the Datasets python and R libraries that allow researchers to access the APIs, facilitating their use in analysis workflows, and the companion Jupyter and R notebooks that are provided to help researchers get started with these tools. As a resource under active development, we’ll share the latest improvements and features.

Annotating genomes at NCBI RefSeq in the era of 3rd generation sequencing

Terence D Murphy, Françoise Thibaud-Nissen

Advances in sequencing technology over the last decade have led to a cornucopia of genome assemblies for multicellular eukaryotes. Many species have new, high-quality assemblies based on PacBio, Oxford Nanopore (ONT), or other technologies along with abundant RNA-seq datasets, generated by many researchers from around the world. To help maximize the utility of these genomes for the research community, NCBI’s Reference Sequence (RefSeq) project provides genome annotations for over 700 species spanning over 350 vertebrates, 200 invertebrates, and 100 plants. NCBI’s automated annotation pipeline provides rapid, high-quality gene annotations across many taxa, with consistent processing that benefits comparative genomic studies. Annotation sets typically exceed 97% completeness as measured by BUSCOv4, surpassing most other datasets. Annotations are available in NCBI’s Gene resource, BLAST databases, and Genome Data Viewer (GDV). Gene and GDV also provide access to other genomic information including orthologs, RNA-seq expression data, and whole genome alignments to previous assembly versions or assemblies from different strains. This presentation will explore the lessons we’ve learned from annotating a diverse collection of genomes, including the impacts of RNA-seq and assembly quality, demonstrate the high quality of the annotated gene sets, and give an overview of NCBI’s resources. The eukaryotic genome annotation and Genome Data Viewer pages provide more information.


Otto S P. The evolutionary consequences of polyploidy. Cell, 2007, 131: 452–462

Ohno S. Evolution by Gene Duplication. New York: Springer-Verlag, 1970

Holland P W H, Garcia-Fernàndez J, Williams N A, et al. Gene duplications and the origins of vertebrate development. Development (Supplement), 1994, 125–133

Meyer A, Van de Peer Y. From 2R to 3R: evidence for a fish-specific genome duplication (FSGD). Bioessays, 2005, 27: 937–945

Dehal P, Boore J L. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol, 2005, 3: 1700–1708

Blomme T, Vandepoele K, De Bodt S, et al. The gain and loss of genes during 600 million years of vertebrate evolution. Genome Biol, 2006, 7: R43

Amores A, Force A, Yan Y L, et al. Zebrafish hox clusters and vertebrate genome evolution. Science, 1998, 282: 1711–1714

Taylor J S, Van de Peer Y, Braasch I, et al. Comparative genomics provides evidence for an ancient genome duplication event in fish. Philos Trans R Soc, 2001a, 356: 1661–1679

Volff J N. Genome evolution and biodiversity in teleost fish. Heredity, 2005, 94: 280–294

Soltis D E, Soltis P S, Tate J A. Advances in the study of polyploidy since plant speciation. New Phytol, 2003, 161: 173–191

Comai L. The advantages and disadvantages of being polyploid. Nat Rev Genet, 2005, 6: 836–845

Kassahn K S, Dang V T, Wilkins S J, et al. Evolution of gene function and regulatory control after whole-genome duplication: comparative analyses in vertebrates. Genome Res, 2009, 19: 1404–1418

Soltis D E, Soltis P S. Molecular data and the dynamic nature of polyploidy. Crit Rev Plant Sci, 1993, 12: 243–273

Grant V. Plant Speciation, 2nd ed. New York: Columbia University Press, 1981

Ahuja M R, Neale D B. Evolution of genome size in conifers. Silvae Genet, 2005, 54: 126–137

Hair J B. The chromosomes of the Cupressaceae. I. Tetraclineae and Actinostrobeae (Callitroideae). New Zeal J Bot, 1968, 6: 277–284

Gates R R. The stature and chromosomes of Oenothera gigas De Vries. Arch F Zellforsch, 1909, 3: 525–552

Goldblatt P. Polyploidy in Angiosperms: Monocotyledons. In: Lewis W H, ed. Polyploidy: Biological Relevance. New York: Plenum Press, 1980. 219–239

Lewis W H. Polyploidy in Angiosperms: Dicotyledons. In: Lewis W H, ed. Polyploidy: Biological Relevance. New York: Plenum Press, 1980. 241–268

Masterson J. Stomatal size in fossil plants: evidence for polyploidy in majority of angiosperms. Science, 1994, 264: 421–423

Bowers J E, Chapman B A, Rong J K, et al. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature, 2003, 422: 433–438

Soltis D E, Albert V A, Leebens-Mack J, et al. Polyploidy and angiosperm diversification. Am J Bot, 2009, 96: 336–348

Veron A S, Kaufmann K, Bornberg-Bauer E. Evidence of interaction network evolution by whole-genome duplications: a case study in MADS-box proteins. Mol Biol Evol, 2007, 24: 670–678

Albert V A, Soltis D E, Carlson J E, et al. Floral gene resources from basal angiosperms for comparative genomics research. BMC Plant Biol, 2005, 5: 5–16

Cui L, Wall P K, Leebens-Mack J, et al. Widespread genome duplications throughout the history of flowering plants. Genome Res, 2006, 16: 738–749

Zhang F W, Wang Y R. Progress of polyploidy breeding technology applied in medical plants scale (in Chinese). Guiding J TCM, 2006, 12: 83–85

Paterson A H, Bowers J, Burow M, et al. Comparative genomics of plant chromosomes. Plant Cell, 2000, 12: 1523–1539

Severin A J, Cannon S B, Graham M M, et al. Changes in twelve homoeologous genomic regions in soybean following three rounds of polyploidy. Plant Cell, 2011, 23: 3129–3136

Jaillon O, Aury J M, Noel B, et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature, 2007, 449: 463–467

Ming R, Hou S, Feng Y, et al. The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature, 2008, 452: 991–996

Otto S P, Whitton J. Polyploid incidence and evolution. Annu Rev Genet, 2000, 34: 401–437

Liu Z D. Yichuanxue (in Chinese). Beijing: Higher Education Press, 1991

Yang Y G, Zhuang Y, Chen L Z, et al. Vegetable polyploid and polyploidy breeding (in Chinese). Acta Agricult Univ Jiangxi, 2006, 28: 534–538

Hilu K W. Polyploidy and the evolution of domesticated plants. Am J Bot, 1993, 80: 1494–1499

Fedorov A. Chromosome Numbers of Flowering Plants. Leningrad: Acad Sci USSR Komarov Botanical Institute, 1969

Dai S L, Wang W K, Huang J P. Advances of researches on phylogeny of Dendranthema and origin of chrysanthemum (in Chinese). J Beijing Forest Univ, 2002, 24: 230–234

Jin X X, Zhang Q X. Advances in the studies of breeding Primula (in Chinese). Chin Bull Bot, 2005, 22: 738–745

Gregory T R, Mable B K. Polyploidy in Animals. In: Gregory T R, ed. The Evolution of the Genome. San Diego: Elsevier, 2005. 427–517

Wu M. Genetics and evolution of animal polyploid (in Chinese). Chin J Zool, 1988, 23: 48–51

Chen D W, Daoye M Y. Chromosomes and systematic classification of molluscs. Chin J Zool, 1988, 23: 48–51

Ye M W. The polyploidy phenomenon and formation in animals and plants (in Chinese). Bull Biol, 1998, 33: 21–23

Li S W. Polyploid insects (in Chinese). Entomol Knowledge, 2002, 39: 147–151

Naruse K, Tanaka M, Mita K, et al. A medaka gene map: the trace of ancestral vertebrate proto-chromosomes revealed by comparative gene mapping. Genome Res, 2004, 14: 820–828

Woods I G, Wilson C, Friedlander B, et al. The zebrafish gene map defines ancestral vertebrate chromosomes. Genome Res, 2005, 15: 1307–1314

Gui J F, Zhou L. Genetic basis and breeding application of clonal diversity and dual reproduction modes in polyploid Carassius auratus gibelio. Sci China Life Sci, 2010, 53: 409–415

Zhou L, Gui J F. Karyotypic diversity in polyploid gibel carp, Carassius auratus gibelio bloch. Genetica, 2002, 115: 223–232

Xiao J, Zou T M, Chen Y B, et al. Coexistence of diploid, triploid and tetraploid crucian carp (Carassius auratus) in natural waters. BMC Genet, 2011, 12: 20

Luo J R. Polyploid fishes and fish polyploidy breeding (in Chinese). Pearl River Fisheries, 1991, 17: 69–74

Lampert K P, Schartl M. The origin and evolution of a unisexual hybrid: Poecilia formosa. Philos Trans R Soc Lond B Biol Sci, 2008, 363: 2901–2909

Zan R G. The polyploids in fish and their roles in fish evolution (in Chinese). J Yunnan Univ, 1985, 7: 235–243

Abbas K, Li M Y, Wang W M, et al. First record of the natural occurrence of hexaploids loach Misgurnus anguillicaudatus in Hubei Province, China. J Fish Biol, 2009, 75: 435–441

Ráb P, Rábová M, Bohlen J, et al. Genetic differentiation of the two hybrid diploid-polyploid complexes of loaches, genus Cobitis (Cobitidae) involving C. taenia, C. elongatoides and C. spp. in the Czech Republic: karyotypes and cytogenetic diversity. Folia Zool, 2000, 49: S55–S66

Boroň A, Kotusz J. The preliminary data on diploid-polyploid complexes of the genus Cobitis in the Odra River basin, Poland (Pisces, Cobitidae). Folia Zool, 2000, 49: S79–S84

Li S S. Amphibians’s chromosomes and their evolution (in Chinese). Chin J Zool, 1991, 26: 47–52

Li S S. Parthenogenesis in reptiles (in Chinese). Chin J Zool, 1992, 27: 41–44

Li S S. Vertebrate’s polyploid (in Chinese). Chin J Zool, 1980, 2: 52–54

Ramsey J, Schemske D W. Pathways, mechanisms, and rates of polyploid formation in flowering plants. Annu Rev Ecol Syst, 1998, 29: 467–501

Newton W C F, Pellew C. Primula kewensis and its derivatives. J Genet, 1929, 20: 405–467

Liu S J, Qin Q B, Xiao J, et al. The formation of the polyploid hybrids from different subfamily fish crossing and its evolutionary significance. Genetics, 2007, 176: 1023–1034

Liu S J. Distant hybridization leads to different ploidy fishes. Sci China Life Sci, 2010, 53: 416–425

Karpechenko G D. The production of polyploid gametes in hybrids. Hereditas, 1927, 9: 349–368

Liu S J, Liu Y, Zhou G J, et al. The formation of tetraploid stocks of red crucian carp × common carp hybrids as an effect of interspecific hybridization. Aquaculture, 2001, 192: 171–186

Zhang C, Sun Y D, Liu S J, et al. Evidence of the unreduced diploid eggs generated from the diploid gynogenetic progeny of allotetraploid hybrids (in Chinese). Acta Genet Sin, 2005, 32: 136–144

Ullah Z, Lee C Y, DePamphilis M L. Cip/Kip cyclin-dependent protein kinase inhibitors and the road to polyploidy. Cell Div, 2009, 4: 10

Bretagnolle F, Thompson J D. Gametes with the somatic chromosome number: mechanisms of their formation and role in the evolution of autopolyploid plants. New phytol, 1995, 129: 1–22

Werner J E, Peloquin S J. Occurrence and mechanisms of 2n egg formation in 2x potato. Genome, 1991, 34: 975–982

Seehausen O. Hybridization and adaptive radiation. Trends Ecol Evol, 2004, 19: 198–207

Mallet J. Hybridization as an invasion of the genome. Trends Ecol Evol, 2005, 20: 229–237

Mallet J. Hybrid speciation. Nature, 2007, 446: 279–283

Yu X J, Zhou T, Li Y C. Chromosomes of Chinese Fresh-water Fishes (in Chinese). Beijing: Science Press, 1989

Meyer A, Salzburger W, Schartl M. Hybrid origin of a swordtail species (Teleostei: Xiphophorus clemenciae) driven by sexual selection. Mol Ecol, 2006, 15: 721–730

Saitoh K, Chen W J, Mayden R L. Extensive hybridization and tetrapolyploidy in spined loach fish. Mol Phylogenet Evol, 2010, 56: 1001–1010

Harlan J R, deWet J M J. On Ö. Winge and a prayer: the origins of polyploidy. Bot Rev, 1975, 41: 361–390

Belling J. The origin of chromosomal mutations in Uvularia. J Genet, 1925, 15: 245–266

McHale N A. Environmental induction of high frequency 2n pollen formation in diploid Solanum. Can J Genet Cytol, 1983, 25: 609–615

Mable B K. ’Why polyploidy is rarer in animals than in plants’: myths and mechanisms. Biol J Linn Soc, 2004, 82: 453–466

Comai L. Genetic and epigenetic interactions in allopolyploid plants. Plant Mol Biol, 2000, 43: 387–399

Chen Z J, Ni Z F. Mechanisms of genomic rearrangements and gene expression changes in plant polyploids. Bioessays, 2006, 28: 240–252

Song K, Lu P, Tang K, et al. Rapid genome change in synthetic polyploids of Brassica and its implications for polyploid evolution. Proc Natl Acad Sci USA, 1995, 92: 7719–7723

Kenton A, Parokonny A S, Gleba Y Y, et al. Characterization of the Nicotiana tabacum L. Genome by molecular cytogenetics. Mol Gen Genet, 1993, 240: 159–169

Jellen E N, Gill B S, Cox T S. Genomic in situ hybridization differentiates between A/D-and C-genome chromatin and detects intergenomic translocations in polyploid oat species (genus Avena). Genome, 1994, 37: 613–618

Kellogg E A. What happens to genes in duplicated genomes. Proc Natl Acad Sci USA, 2003, 100: 4369–4371

Se’mon M, Wolfe K H. Preferential subfunctionalization of slow-evolving genes after allopolyploidization in Xenopus laevis. Proc Natl Acad Sci USA, 2008, 105: 8333–8338

Lee H S, Chen Z J. Protein-coding genes are epigenetically regulated in Arabidopsis polyploids. Proc Natl Acad Sci USA, 2001, 98: 6753–6758

Chen Z J. Genetic and epigenetic mechanisms for gene expression and phenotypic variation in plant polyploids. Annu Rev Plant Biol, 2007, 58: 377–406

Liu B, Wendel J F. Epigenetic phenomena and the evolution of plant allopolyploids. Mol Phylogenet Evol, 2003, 29: 365–379

Madlung A, Masuelli R W, Watson B, et al. Remodeling of DNA methylation and phenotypic and transcriptional changes in synthetic Arabidopsis allotetraploids. Plant Physiol, 2002, 129: 733–746

Fedoroff N. Transposons and genome evolution in plants. Proc Natl Acad Sci USA, 2000, 97: 7002–7007

Doyle J J, Flagel L E, Paterson A H, et al. Evolutionary genetics of genome merger and doubling in plants. Annu Rev Genet, 2008, 42: 443–461

Liu B, Wendel J F. Non-mendelian phenomena in allopolyploid genome evolution. Curr Genomics, 2002, 3: 489–506

Ma X F, Gustafson J P. Genome evolution of allopolyploids: a process of cytological and genetic diploidization. Cytogenet Genome Res, 2005, 109: 236–249

De Bodt S, Maere S, Van de Peer Y. Genome duplication and the origin of angiosperms. Trends Ecol Evol, 2005, 20: 591–597

Ma H Y, Zhang J F, Li Z D. Research advances on plant polyploidy breeding techniques (in Chinese). Protect Forest Sci Technol, 2008, 1: 43–46

Wang T K, Zhang J Z, Qi Y S, et al. Advances on polyploid breeding of fruit crops in China (in Chinese). J Fruit Sci, 2004, 21: 592–597

Shun M H, Zhang S N. The application of polyploidy breeding in horticultural crops (in Chinese). Jiangsu Agricult Sci, 2004, 1: 68–72

Zhang X Y, Liu J F, Wang L P. Polyploidy breeding and its application research progress of medicinal plants (in Chinese). J Jilin Normal Univ (Nat Sci Ed), 2009, 4: 128–131

Yuan J M, Dang X M, Zhan Y F. Advances on polyploid breeding in watermelon (in Chinese). Chin J Tropical Agricult, 2009, 29: 65–70

Shen A L, Yao W Z. The proceeding on triploid breeding of aquatic animals (in Chinese). Reserv Fish, 2004, 24: 1–3

Liu Y, Liu S J, Sun Y D, et al. Polyploid hybrids of crucian carp× common carp (in Chinese). Rev China Agricult Sci Technol, 2003, 5: 3–6

Wu P. Research progress of fish polyploid breeding in China (in Chinese). J Shanghai Fish Univ, 2005, 14: 72–78

Hu L L, Li J E. The review of fish polyploid breeding research(in Chinese). Fish Sci Technol, 2009, 7–10

Yuan B J, Jiang N C, Lu J P, et al. A review of decapod crustacean multiploid breeding (in Chinese). Donghai Marine Sci, 1998, 16: 64–68

Wang Z P, Li K J, Yu R H, et al. Progress of tetraploid breeding in mollusks (in Chinese). J Ocean Univ China, 2004, 34: 195–200

Song L M, Yang Y, Wang W M, et al. Induction of triploidy in yellow catfish Pelteobagrus fulvidraco by heat shock (in Chinese). Fish Sci, 2010, 29: 352–355

Gui J F, Liang S C, Sun J M, et al. Studies on genome manipulation in fish I. Induction of triploid transparent colored crucian carp (Carassius auratus transparent colored variety) by hydrostatic pressure (in Chinese). Acta Hydrobiol Sin, 1990, 14: 336–344

Wu W X, Li C W, Liu G A, et al. Studies on tetraploid hybrid between red common carp (Cyprinus carpio) and grass carp (Ctenopharyngodon idellus) and its backcross triploid (in Chinese). Acta Hydrobiol Sin, 1988, 12: 355–363

Gui J F, Liang S C, Zhu L F, et al. Discovery and breeding potential of compound tetraploid allogynogenetic silver crucian carp in artificial population (in Chinese). Chin Sci Bull, 1992, 37: 646–648

Wu C, Ye Y, Chen R, et al. An artificial multiple triploid carp and its biological characteristics. Aquaculture, 1993, 111: 255–262

Luo K K, Xiao J, Liu S J, et al. Massive production of all-female diploids and triploids in the crucian carp. Int J Biol Sci, 2011, 7: 487–495

Hu W, Zhu Z Y. Integration mechanisms of transgenes and population fitness of GH transgenic fish. Sci China Life Sci, 2010, 53: 401–408

Hu W, Wang Y P, Zhu Z Y. Progress in the evaluation of transgenic fish for possible ecological risk and its containment strategies. Sci China Life Sci, 2007, 50: 573–579

Yu F, Xiao J, Liang X Y, et al. Rapid growth and sterility of growth hormone gene transgenic triploid carp. Chin Sci Bull, 2011, 56: 1679–1684

Qin Q B, He W G, Liu S J, et al. Analysis of 5S rDNA organization and variation in polyploid hybrids from crosses of different fish subfamilies. J Exp Zool (Mol Dev Evol), 2010, 314: 403–411

Sequencing the genomes of polyploid organisms - Biology

A database providing information on the structure of assembled genomes, assembly names and other meta-data, statistical reports, and links to genomic sequence data.

A collection of genomics, functional genomics, and genetics studies and links to their resulting datasets. This resource describes project scope, material, and objectives and provides a mechanism to retrieve datasets that are often difficult to find due to inconsistent annotation, multiple independent submissions, and the varied nature of diverse data types which are often stored in different databases.

The dbVar database has been developed to archive information associated with large scale genomic variation, including large insertions, deletions, translocations and inversions. In addition to archiving variation discovery, dbVar also stores associations of defined variants with phenotype information.

Contains sequence and map data from the whole genomes of over 1000 organisms. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life (bacteria, archaea, and eukaryota) are represented, as well as many viruses, phages, viroids, plasmids, and organelles.

The Genome Reference Consortium (GRC) maintains responsibility for the human and mouse reference genomes. Members consist of The Genome Center at Washington University, the Wellcome Trust Sanger Institute, the European Bioinformatics Institute (EBI) and the National Center for Biotechnology Information (NCBI). The GRC works to correct misrepresented loci and to close remaining assembly gaps. In addition, the GRC seeks to provide alternate assemblies for complex or structurally variant genomic loci. At the GRC website (, the public can view genomic regions currently under review, report genome-related problems and contact the GRC.

A database of known interactions of HIV-1 proteins with proteins from human hosts. It provides annotated bibliographies of published reports of protein interactions, with links to the corresponding PubMed records and sequence data.

A compilation of data from the NIAID Influenza Genome Sequencing Project and GenBank. It provides tools for flu sequence analysis, annotation and submission to GenBank. This resource also has links to other flu sequence resources, and publications and general information about flu viruses.

A project involving the collection and analysis of bacterial pathogen genomic sequences originating from food, environmental and patient isolates. Currently, an automated pipeline clusters and identifies sequences supplied primarily by public health laboratories to assist in the investigation of foodborne disease outbreaks and discover potential sources of food contamination.

A collection of nucleotide sequences from several sources, including GenBank, RefSeq, the Third Party Annotation (TPA) database, and PDB. Searching the Nucleotide Database will yield available results from each of its component databases.

Database of related DNA sequences that originate from comparative studies: phylogenetic, population, environmental and, to a lesser degree, mutational. Each record in the database is a set of DNA sequences. For example, a population set provides information on genetic variation within an organism, while a phylogenetic set may contain sequences, and their alignment, of a single gene obtained from several related organisms.

A public registry of nucleic acid reagents designed for use in a wide variety of biomedical research applications, together with information on reagent distributors, probe effectiveness, and computed sequence similarities.

A collection of resources specifically designed to support the research of retroviruses, including a genotyping tool that uses the BLAST algorithm to identify the genotype of a query sequence an alignment tool for global alignment of multiple sequences an HIV-1 automatic sequence annotation tool and annotated maps of numerous retroviruses viewable in GenBank, FASTA, and graphic formats, with links to associated sequence records.

A summary of data for the SARS coronavirus (CoV), including links to the most recent sequence data and publications, links to other SARS related resources, and a pre-computed alignment of genome sequences from various isolates.

The Sequence Read Archive (SRA) stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.

A repository of DNA sequence chromatograms (traces), base calls, and quality estimates for single-pass reads from various large-scale sequencing projects.

A wide range of resources, including a brief summary of the biology of viruses, links to viral genome sequences in Entrez Genome, and information about viral Reference Sequences, a collection of reference sequences for thousands of viral genomes.

An extension of the Influenza Virus Resource to other organisms, providing an interface to download sequence sets of selected viruses, analysis tools, including virus-specific BLAST pages, and genome annotation pipelines.


This site contains genome sequence and mapping data for organisms in Entrez Genome. The data are organized in directories for single species or groups of species. Mapping data are collected in the directory MapView and are organized by species. See the README file in the root directory and the README files in the species subdirectories for detailed information.

Contains directories for each genome that include available mapping data for current and previous builds of that genome.

This site contains all nucleotide and protein sequence records in the Reference Sequence (RefSeq) collection. The ""release"" directory contains the most current release of the complete collection, while data for selected organisms (such as human, mouse and rat) are available in separate directories. Data are available in FASTA and flat file formats. See the README file for details.

This site contains SKY-CGH data in ASN.1, XML and EasySKYCGH formats. See the skycghreadme.txt file for more information.

This site contains next-generation sequencing data organized by the submitted sequencing project.

This site contains the trace chromatogram data organized by species. Data include chromatogram, quality scores, FASTA sequences from automatic base calls, and other ancillary information in tab-delimited text as well as XML formats. See the README file for details.

This site contains whole genome shotgun sequence data organized by the 4-digit project code. Data include GenBank and GenPept flat files, quality scores and summary statistics. See the README.genbank.wgs file for more information.


An online form that provides an interface for researchers, consortia and organizations to register their BioProjects. This serves as the starting point for the submission of genomic and genetic data for the study. The data does not need to be submitted at the time of BioProject registration.

A command-line program that automates the creation of sequence records for submission to GenBank using many of the same functions as Sequin. It is used primarily for submission of complete genomes and large batches of sequences.

This link describes how submitters of SRA data can obtain a secure NCBI FTP site for their data, and also describes the allowed data formats and directory structures.

A single entry point for submitters to link to and find information about all of the data submission processes at NCBI. Currently, this serves as an interface for the registration of BioProjects and BioSamples and submission of data for WGS and GTR. Future additions to this site are planned.

This link describes how submitters of trace data can obtain a secure NCBI FTP site for their data, and also describes the allowed data formats and directory structures.


An interactive graphical viewer that allows users to explore variant calls, genotype calls and supporting evidence (such as aligned sequence reads) that have been produced by the 1000 Genomes Project.

Performs a BLAST search for similar sequences from selected complete eukaryotic and prokaryotic genomes.

Performs a BLAST search of the genomic sequences in the RefSeqGene/LRG set. The default display provides ready navigation to review alignments in the Graphics display.

This tool compares nucleotide or protein sequences to genomic sequence databases and calculates the statistical significance of matches using the Basic Local Alignment Search Tool (BLAST) algorithm.

A genome browser for interactive navigation of eukaryotic RefSeq genome assemblies with comprehensive inspection of gene, expression, variation and other annotations. GDV offers easy-to-load analytical track pre-configurations, a menu of data tracks for easy display and customization, and supports upload and analysis of user data. This browser also enables the production of displays for publishing.

An online tool that assists in the production of journal quality figures of annotations on an ideogram or sequence representation of an assembly.

NCBI's Remap tool allows users to project annotation data and convert locations of features from one genomic assembly to another or to RefSeqGene sequences through a base by base analysis. Options are provided to adjust the stringency of remapping, and summary results are displayed on the web page. Full results can be downloaded for viewing in NCBI's Genome Workbench graphical viewer, and annotation data for the remapped features, as well as summary data, is also available for download.

An integrated application for viewing and analyzing sequence data. With Genome Workbench, you can view data in publically available sequence databases at NCBI, and mix these data with your own data.

Supports finding human phenotype/genotype relationships with queries by phenotype, chromosome location, gene, and SNP identifiers. Currently includes information from dbGaP, the NHGRI GWAS Catalog, and GTeX. Displays results on the genome, on sequence, or in tables for download.

A utility for computing alignment of proteins to genomic nucleotide sequence. It is based on a variation of the Needleman Wunsch global alignment algorithm and specifically accounts for introns and splice signals. Due to this algorithm, ProSplign is accurate in determining splice sites and tolerant to sequencing errors.

Sequence Cytogenetic Conversion Service An online tool that converts sequence and cytogenetic coordinates for human, rat, mouse and fruit fly genomic assemblies. Sequence Viewer

Provides a configurable graphical display of a nucleotide or protein sequence and features that have been annotated on that sequence. In addition to use on NCBI sequence database pages, this viewer is available as an embeddable webpage component. Detailed documentation including an API Reference guide is available for developers wishing to embed the viewer in their own pages.

A utility for computing cDNA-to-Genomic sequence alignments. It is based on a variation of the Needleman-Wunsch global alignment algorithm and specifically accounts for introns and splice signals. Due to this algorithm, Splign is accurate in determining splice sites and tolerant to sequencing errors.

Variation Viewer A genomic browser to search and view genomic variations listed in dbSNP, dbVar, and ClinVar databases. Searches can be performed using chromosomal location, gene symbol, phenotype, or variant IDs from dbSNP and dbVar. The browser enables exploration of results in a dynamic graphical sequence viewer with annotated tables of variations. Viral Genotyping Tool

This tool helps identify the genotype of a viral sequence. A window is slid along the query sequence and each window is compared by BLAST to each of the reference sequences for a particular virus.

NHGRI Targets 12 More Organisms for Genome Sequencing

BETHESDA, Md., Tues., Mar. 1, 2005 - The National Human Genome Research Institute (NHGRI), one of the National Institutes of Health (NIH), announced today that the Large-Scale Sequencing Research Network will begin sequencing 12 more strategically selected organisms, including the marmoset, a skate and several important insects, as part of its ongoing effort to expand understanding of the human genome.

The National Advisory Council for Human Genome Research, which is a federally chartered committee that advises NHGRI on program priorities and goals, recently approved a comprehensive plan that identified two groups of new sequencing targets on the basis of their collective scientific merits.

"Our sequencing strategy continues to focus on identifying the sets of organisms with the greatest potential to fill crucial gaps in biomedical knowledge," said Mark S. Guyer, Ph.D., director of NHGRI's Division of Extramural Research. "The most effective approach we currently have to identify the essential functional and structural components of the human genome is to compare it with the genomes of other organisms."

Two of the sequencing projects are aimed at gaining new insights into model organisms utilized in research on drug development and disease susceptibility. They are: sequencing the genome of a fellow primate, the marmoset (Callithrix jacchus) and identification of genetic variations (in the form of single nucleotide polymorphisms) in eight strains of rats.

The marmoset is a key model organism used in neurobiological studies of multiple sclerosis, Parkinson's disease and Huntington's disease. The marmoset is also an important model for research into infectious disease and pharmacology.

The marmoset was chosen also because of its unique position on the evolutionary tree, one step further removed from humans than other non-human primates already being sequenced, such as the chimpanzee (Pan troglodytes), the rhesus macaque (Macaca mulatta) and orangutan (Pongo pygmaeus). Obtaining the marmoset genome sequence will provide a powerful tool to illuminate the similarities and differences among these primate genomes.

The second project chosen for its considerable medical relevance to humans will identify 280,000 single nucleotide polymorphisms, known as "SNPs," in the genomes of eight different strains of laboratory rats. SNPs can be used as markers to zero in on genetic variations that may affect an individual's risk of developing common, complex illnesses such as heart diseases, diabetes and cancer. Building a catalog of rat SNPs will assist researchers trying to find genetic variations associated with common, complex diseases in rats, which can then be used to help identify similar genetic variations that may be involved in human disease.

The eight rat strains selected are the PVG strain, commonly used as a healthy control in studies the F344 strain, used in toxicological and pharmacological studies the SS strain, used for cardiovascular disease studies the LEW strain, often used in studies of transplants and immune response the BB strain, used in studies of diabetes the FHH strain, also used for cardiovascular studies the DA strain, used for studies of arthritis and cancer and the SHR strain, used in studies of hypertension.

"The overriding goal of sequencing the genomes of a diverse set of organisms is to understand the biological processes at work in human health and illness," said NHGRI Director Francis S. Collins, M.D., Ph.D. "It is also gratifying to know that these tools, freely available to the entire biomedical research community, can be used in other scientific fields to further improve animal and human welfare."

Another set of 11 non-mammalian organisms were strategically chosen, each representing a position on the evolutionary timeline marked by important innovations in animal anatomy, physiology, development or behavior. The organisms are: a skate (Raja erinacea) a sea slug (Aplysia californica) a disease-carrying insect (Rhodnius prolixus) a pea aphid (Acyrthosiphon pisum) a wasp (Nasonia vitripennis) and two related insect species (Nasonia giraulti and Nasonia longicornis) a free-living soil amoeba (Acanthamoeba castellanii) and three fungi (Schizosaccharomyces octosporus, Schizosaccharomyces japonicus, Batrachochytridium dendrobatidis).

It has been shown that most sequences of the human genome originated long before humans themselves. Consequently, scientists will use the genome sequences of the 11 non-mammalian animals to learn more about how, when and why the human genome came to be composed of certain DNA sequences, as well as to gain new insights into organization of genomes. In addition, many of the organisms can shed light on human disease.

For instance, the skate (related to many species of shark and cartilaginous fish) was chosen because it belongs to the first group of primitive vertebrates that developed jaws, an important step in vertebrate evolution. Other innovations in this group of animals include an adaptive immune system similar to that of humans, a closed and pressurized circulatory system, and myelination of the nervous system. Understanding these systems of the skate at a genetic level will help scientists identify the minimum set of genes that create a nervous system or develop a jaw, possibly illustrating how these systems have evolved in humans, and how they sometimes go wrong.

Aplysia (Aplysia californica) is a sea slug that has been a very useful model in studying learning and memory in humans. Aplysia have very large neurons which can be manipulated and studied easily by researchers. In 2000, Eric Kandel, M.D., of Columbia University in New York, shared the Nobel Prize in Physiology or Medicine for his work elucidating how memories are formed in the human brain using Aplysia as a model.

The disease-carrying insect, Rhodnius prolixus, spreads Chagas' disease, caused by the parasite Trypanosoma cruzi,which is carried by the insect. Chagas' disease is prominent in Latin America, affecting about 20 million people in South America alone and killing 50,000 of them a year. Having the genome sequence of Rhodnius prolixus presents an opportunity for experts from the United States, Canada and Latin America to collaborate on understanding this widespread infectious disease.

The pea aphid (Acyrthosiphon pisum) is an insect which causes hundreds of millions of dollars of crop damage each year. The pea aphid is a model for studying rapid adaptation because this species is exceptionally able at adapting to and resisting many pesticides. Understanding this resistance at a molecular level can lead to safer and more effective pesticides and improve human nutrition. The genome of the pea aphid, used extensively as an experimental model, will be a valuable comparison with other insects, such as the closely related insect, Rhodnius prolixus.

Another insect, the parasitoid wasp Nasonia vitripennis, is a natural enemy of houseflies, and its relatives are natural enemies of ticks, mites, roaches and other arthropods. It is the genetic model for parasitoids, which lay their eggs on and kill arthropods, thus controlling pest populations. In the United States, the use of parasitoid wasps in agriculture as a biological control of crop damaging insects saves approximately $20 billion annually. The wasp will serve as a good comparison for the honey bee genome, which has been sequenced already. Two related wasp species, Nasonia giraulti and Nasonia longicornis, will be sequenced at less dense coverage to aid in the comparative studies.

Sequencing efforts will be carried out by the five centers in the NHGRI-supported Large-Scale Sequencing Research Network: Agencourt Bioscience Corp., Beverly, Mass. Baylor College of Medicine, Houston the Broad Institute of MIT and Harvard, Cambridge, Mass. The J. Craig Venter Science Institute, Rockville, Md. and Washington University School of Medicine, St. Louis. Assignment of each organism to a specific center or centers will be determined at a later date.

NHGRI's selection process begins with two working groups comprised of experts from across the research community. Each of the working groups is responsible for developing a proposal for a set of genomes to sequence that would advance knowledge in one of two important scientific areas: understanding the human genome and understanding the evolutionary biology of genomes. A coordinating committee then reviews the working groups' proposals, helping to fine-tune the suggestions and integrate them into an overarching set of scientific priorities. The recommendations of the coordinating committee are then reviewed and approved by NHGRI's advisory council, which in turn forwards its recommendations regarding sequencing strategy to NHGRI leadership.

The genomes of a number of organisms have been or are being sequenced by the large-scale sequencing capacity developed by the Human Genome Project. These include the dog, the mouse, the rat, the chicken, the honey bee, two fruit flies, the sea urchin, two puffer fish, two sea squirts, two roundworms, several fungi, baker's yeast and many prokaryotes (bacteria and archaea) including Escherichia coli. Additional organisms already in the NHGRI sequencing pipeline are: the macaque, the orangutan, the kangaroo, the cow, the gray short-tailed opossum, the platypus, the red flour beetle, the domestic cat, the flatworm Schimdtea mediterranea, more species of fruit fly and several species of fungi.