Information

How to search for a DNA sequence in a genome on ENA (European Nucleotide Archive)?

How to search for a DNA sequence in a genome on ENA (European Nucleotide Archive)?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have to do a search for small DNA sequences in the genome of an organism in ENA. I have the accession number and project id. However, I can't download the whole genome because of the download size and would prefer to do it online if possible. I have a rough idea of the part of the genome I wish to scan in, don't know if that's relevant.


While it is much better/easier to download and do the exact search on your own, it is possible to do something similar, online. You can use NCBI-BLAST and increase the word length and mismatch/gap penalty. It also allows restricting the subject space; you can even align two sequences. The BLAST on ENA does not have much flexibility (subject restriction/word length). Perhaps you can look for a duplicate sequence on NCBI.

If this genome is available on UCSC genome browser then you can use BLAT too. However, it only accepts sequences larger than 19nt.


How to search for a DNA sequence in a genome on ENA (European Nucleotide Archive)? - Biology

Bioinformatics tools for genomic and evolutionary analysis of infectious agents

Vivek Dhar Dwivedi 1 , Shiv Bharadwaj 2 , Partha Sarathi Mohanty 1 , Umesh Datta Gupta 3
1 Department of Epidemiology, ICMR-National JALMA Institute for Leprosy and Other Mycobacterial Diseases, Agra, Uttar Pradesh, India
2 Nanotechnology Research and Application Center, Sabanci University, Istanbul, Turkey
3 Department of Animal Experimentation, ICMR-National JALMA Institute for Leprosy and Other Mycobacterial Diseases, Agra, Uttar Pradesh, India

Date of Web Publication6-Sep-2018

Correspondence Address:
Dr. Umesh Datta Gupta
Department of Animal Experimentation, ICMR-National Jalma Institute for Leprosy and Other Mycobacterial Diseases, Agra, Uttar Pradesh
India

Source of Support: None, Conflict of Interest: None

DOI: 10.4103/bbrj.bbrj_74_18

Genome sequence analysis of infectious agents (IAs) reveals many secrets about their life processes and evolutionary history. Increasing the huge amount of genomic sequence data of various IAs in different biological sequence databases, which are being produced through different sequencing projects, is continuously motivating the genome researchers to unlock the mysteries related to the life of IAs. Furthermore, that information may be helpful for treating the serious illness problem caused by IAs. However, all the genome analysis work requires a good knowledge of bioinformatics tools that may be useful for genome researchers to extract the meaningful and accurate information from the genome sequence data of IAs. In this article, the most recent bioinformatics tools for the genomic and evolutionary analysis of infectious agents have been discussed and compared in detail which will help the genome researchers to select the most appropriate tool for genomic and evolutionary analysis of IAs.

Keywords: Bioinformatics tools, evolution, genome, infectious agents


How to cite this article:
Dwivedi VD, Bharadwaj S, Mohanty PS, Gupta UD. Bioinformatics tools for genomic and evolutionary analysis of infectious agents. Biomed Biotechnol Res J 20182:163-7

How to cite this URL:
Dwivedi VD, Bharadwaj S, Mohanty PS, Gupta UD. Bioinformatics tools for genomic and evolutionary analysis of infectious agents. Biomed Biotechnol Res J [serial online] 2018 [cited 2021 Jun 27]2:163-7. Available from: https://www.bmbtrj.org/text.asp?2018/2/3/163/240709

Infectious agents (IAs) such as bacteria, fungi, protozoans, helminths, and viruses cause very serious health issues in human beings. Genome of all the IAs contains DNARNA as their genetic material which possesses a specific order of nucleotides. The specific order of nucleotides in the genome of each IA differentiates their identity from one another. The mystery of the origin, growth, survival, virulence, and evolution of IAs is hidden in the specific order of nucleotides of their genomes. [1],[2] Hence, it is very important to analyze the genome of IAs for understating of their identity, molecular mechanism of infection, and development of new effective drugs for treating their bad effects. Genome sequence data of IAs, which are produced through different sequencing projects around the world and are deposited in various nucleotide sequence databases, require various in silico tools for unlocking the mystery of their life. The huge amount of the nucleotide sequence data experimentally produced is collected, organized, and distributed by the International Nucleotide Sequence Database Collaboration, [3] which is a joint effort of the nucleotide sequence databases such as EMBL-EBI (European Bioinformatics Institute, http://www.ebi.ac.uk), DDBJ (DNA Data Bank of Japan, http://www.ddbj.nig.ac.jp), and GenBank (National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov). [4],[5],[6] In silico tools have been the integral part of the biological research which are designed to reveal the meaningful information from biological data within a very short time. Although in silico tools cannot reveal results as reliable asin vitro orin vivo investigations, which is very costly and time taking, however, the bioinformatics analyses can still facilitate to reach an informed decision for conducting expensive research. [7],[8] But, in so many cases, only in silico tools are capable of answering the questions of biological research. Developments of these tools are the important part of bioinformatics and computational biology field. A large number of in silico tools have been developed for genomic and evolutionary analysis of IAs, but the selection of appropriate tool for the analysis of genomic data requires a strong knowledge of statistics and computational algorithms. Hence, it is very it is very difficult for the researchers of noncomputational biology background to choose the appropriate one.

In light of the above facts, it is very necessary to explore the importance and accuracy of various bioinformatics tools for different types of genomic and evolutionary analysis of IAs. In this review, various bioinformatics tools for the genomic and evolutionary analysis of IAs have been discussed which will be helpful to the IA researchers of nonbioinformatics background for selecting the appropriate tool for their work.

Sequence identification or similarity searching

DNA sequence identification or similarity searching tools (SSTs) are the first most important programs of biological research which helps the scientists for taking the correct decision about the species identity and classification by providing the information about their closely related organisms as a result. These tools search the similar DNA sequences in the databases for a given query DNA sequence. Each nucleotide database contains its own unique SST for performing sequence similarity search. BLAST, FASTA, and ENA search are the most popular sequence SSTs. [9],[10] Among these three tools, BLAST is a very efficient program which contains many options for the sequence similarity searching. Here, only nucleotide sequence similarity searching programs of BLAST have been discussed. BLAST stands for basic local alignment search tool which is group of tools for nucleotide and protein sequence similarity searching. Nucleotide BLAST (BLASTn) is one among those tools which take a nucleotide sequence (genome sequence) as a query sequence and search for the similar DNA sequences in the NCBI database. [11] Researchers have options to choose the type of optimization program such as megablast, discontiguous megablast, and BLASTn. Megablast searches for highly similar sequences which are very useful for the species identification and intraspecies comparison. The selection of discontiguous megablast option searches for more dissimilar sequences and can be used for cross-species comparison. The BLASTn option is used for searching of somewhat similar sequences in the NCBI database. The BLASTx program of BLAST package is used for identifying the potential protein products encoded by a nucleotide query. [12] The tBLASTx program of nucleotide sequence analysis can be used for identifying nucleotide sequences similar to the query based on their coding potential. [12]

Identification of genes in the genomes of IAs is the main goal of their sequencing projects. Authentic prediction of genes and their positions can be useful for understanding the molecular mechanism of IAs growth, survival, and virulence. Furthermore, those information can be utilized to develop the molecular diagnostic kits and potential drugs for thein vitro identification and treating the infections of IAs, respectively. Open reading frame (ORF) is a best hypothesis for the prediction of a protein-coding region in the genome sequence data of an organism. It is the region of genome sequence between a start codon and the next stop codon. [13] Various tools for the prediction of ORF have been developed, but according to the Wikipedia, ORF finder, ORF investigator, and ORF predictor are the most powerful tools for the efficient prediction of ORFs. [14],[15],[16] The Open Reading Frame Finder (ORF Finder) predicts all possible ORFs in given nucleotide sequence. [17] The ORF investigator is a graphical user interface program that finds all ORFs in a given DNA sequence and converts them into their corresponding protein sequence by declaring their respective positions in the sequence. [17]

Alignment of genomes or gene sequences of IAs provide an interesting knowledge about their percentage of relatedness and variations between two or among more than two species. The alignment between two sequences (pairwise sequence alignment) predicts the conserved and variable regions and also provides the percentage similarity. While the alignment between more than two sequences (multiple sequence alignment [MSA]) provides not only information about the conserved and variable regions but also generates data for phylogenetic analysis. Emboss is a most powerful program for the pairwise sequence alignment (global and local) small DNA sequences. Emboss is available at http://www.ebi.ac.uk/Tools/emboss/. wgVISTA is a software package used for comparing the genome data (up to 10 megabases long) of two microbial organisms [18],[19] and is available at http://genome.lbl.gov/cgi-bin/WGVistaInput. Similarly another software package mVISTA is used for the comparison of two or more nucleotide sequences from two or more organisms and is available at http://genome.lbl.gov/cgi-bin/GenomeVista. [18],[19] mVISTA is an online program which provides the significant and clean results of genome sequences alignment, allowing the representation of alignment results at different levels of resolution. It offers the access to global pairwise, multiple, and glocal (global with rearrangements) alignment tools. AVID (for global alignment of DNA sequences of arbitrary length), [21] LAGAN (for pairwise and MSA), [22] and Shuffle-LAGAN (to find rearrangements in a global alignment framework) program have been incorporated into the mVISTA for the better results. [20] DNASTAR (https://www.dnastar.com/t-sub-solutions-molecular-biology-sequence-alignment.aspx) is a software which aligns DNA sequences through different alignment algorithms including MUSCLE, Mauve, MAFFT, Clustal Omega, and many other programs for generating best results. The European Bioinformatics Institute (EBI) offers a number of programs such as Clustal Omega, Kalign, MAFFT, MUSCLE, MView, T-Coffee, and WebPRANK for MSA, available at http://www.ebi.ac.uk/Tools/msa/. Clustal Omega is a tool of MSA which perform medium-large alignments of up to 4000 sequences or a 4 MB of sequence data file. [23] Kalign MSA tool is a very fast tool which can perform alignment of up to 2000 sequences or a 2 MB of sequence data file. [24] MAFT tool for medium-large alignments that have the ability to align up to 500 sequences or a maximum file size of 1 MB. [25] Muscle MSA tool is suitable for medium alignments and align up to 4000 sequences or a 4 MB of sequence data file. [26] Muscle is best for protein sequence alignments. MView tool transforms a sequence similarity search result into an MSA or reformat an MSA. [27] It can align up to 4000 sequences or a 4 MB of sequence data file. For small alignments, T-Coffee program is very suitable that can align up to 500 sequences or a maximum file size of 1 MB. [28] WebPRANK is a new phylogeny-aware MSA tool which makes use of evolutionary information to help place insertions and deletions. [29] All above-described MSA tools are the most popular tools which can be used as per their requirements.

DNA motif discovery and analysis

DNA sequence motifs are the short segment of DNA which contains many prestigious information about the functional attributes of IAs which have been preserved during the time of evolution. Identification of DNA sequence motifs of IAs may contribute the significant information to the scientist to design and develop new effective drugs for various types of IAs infections. [30],[31],[32],[33] Multiple EM for Motif Elicitation (MEME) suits web portal (available at: http://meme-suite.org/) is the collection of motif identification and analysis tools. [34],[35],[36] MEME, Gapped Local Alignment of Motifs (GLAM2), Discriminative Regular Expression Motif Elicitation (DREME), and MEME-ChIP are the tools for motif discovery. [37],[38],[39],[40],[41] MEME is a very powerful tool for the identification of novel, ungapped motifs in a set of interconnected DNA sequences. By default, it searches for minimum three motifs of about 6󈞞, while the user can define their own parameters for motif discovery. GLAM2 finds out the gapped motifs in a group of input DNA data. GLAM2 attempts to discover the best potential motif several times by replicates analysis. Hence, GLAM2 is better than MEME. DREME searches motifs on large DNA sequence data sets derived from ChIP-seq experimentation. MEME-ChIP tool discovers and analyze motifs in large nucleotide datasets derived from ChIP-seq and CLIP-seq experiments. [42] FIMO, GLAM2SCAN, and MAST (Motif Alignment and Search Tool) are the tools for finding the possible occurrences of motif in a sequence database, and hence these are called the motif searching tools. [41],[43],[44] SpaMo and CentriMo are the tools for the motif enrichment analysis. [45],[46],[47] MCAST (Motif Cluster Alignment and Search Tool) is a motif cluster analysis tool that searches a sequence database for statistically significant clusters of nonoverlapping occurrences of a set of motifs. [48] TOMTOM tool is used for the comparison of DNA motif in the database of known DNA sequence motifs data. [34],[35],[49] GOMO (Gene Ontology for Motifs) program had been designed for the functional analysis of the DNA-binding motifs. [50]

Central dogma is directly related with three different molecular processes of the cell transcription (DNA to RNA), translation (RNA to Protein), and reverse transcription (RNA to DNA). Hence, the computational tools which are able to convert DNA into RNA, RNA into protein, and RNA into DNA can be called as central dogma tools. Many programs have been designed for this purpose biological data analysis program is one among them which can perform the central dogma-related calculations. [51]

Mutation and recombination analysis

Mutational analysis of IAs provides an idea to check the possible changes in their genome which are very useful to know their origin, virulence, and evolution. It also helps to find the genetic diversity among a group of IAs. Among the mutation and recombination analysis tools, Molecular Evolutionary Genetics Analysis (MEGA5) and DNASP are the very popular tools for mutation- and recombination-related calculations, respectively. [52],[53],[54],[55],[56],[57],[58],[59]

Evolutionary analysis

Genomic sequences of IAs contain rich information about their origin and the functional constraints on macromolecules such as proteins/enzymes. [2] Evolution in the genomic sequences of IAs can originate new strains/species, which may be more virulent than its parent strains/species. [1],[60],[61],[62] Hence, the phylogenetic analysis of IAs is important to understanding their origin and evolutionary history. For this purpose, a good knowledge of phylogenetic analysis tools are required therefore the most popular tools and their advantages and disadvantages have been discussed. Phylogeny Inference Package (PHYLIP) is the most popular used software package for evolutionary analysis developed by the scientists of Department of Genome Sciences and the Department of Biology, University of Washington, Seattle. It is a freely available software package which analyses molecular sequences using different methods including parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus trees. [63],[64] Hypothesis testing using phylogenies (HyPhy) is a freely distributed software package for phylogenetic analysis of biological sequences, in particular for inferring the strength of selection from sequence data. In addition, HyPhy features a flexible batch language for implementing and customizing discrete state Markov models in a phylogenetic framework. [65] MEGA is a very popular software package for evolutionary analysis of organisms at molecular level. Different versions of this package are freely available for academicians. It implements several methods and programs for the purpose of evolutionary analysis which are most algorithms in the area of evolutionary biology. [59]

Genome and evolutionary analysis of infectious agents reveal many meaningful information for understanding their origin, growth, survival, and virulence nature. It also provides important knowledge for choosing the potential therapeutic targets and also for discovery of novel drugs for treating their infections. The most recent bioinformatics tools, which have been discussed in this article for the genomic and evolutionary analysis of infectious agents, will be helpful for the genome researchers to select the most appropriate tool of genomic and evolutionary analysis of IAs for unlocking their life mysteries.

The authors of this article acknowledge National JALMA Institute for Leprosy and Other Mycobacterial Diseases (ICMR), Agra, India.


Abstract

Denisovans, a sister group of Neandertals, have been described on the basis of a nuclear genome sequence from a finger phalanx (Denisova 3) found in Denisova Cave in the Altai Mountains. The only other Denisovan specimen described to date is a molar (Denisova 4) found at the same site. This tooth carries a mtDNA sequence similar to that of Denisova 3. Here we present nuclear DNA sequences from Denisova 4 and a morphological description, as well as mitochondrial and nuclear DNA sequence data, from another molar (Denisova 8) found in Denisova Cave in 2010. This new molar is similar to Denisova 4 in being very large and lacking traits typical of Neandertals and modern humans. Nuclear DNA sequences from the two molars form a clade with Denisova 3. The mtDNA of Denisova 8 is more diverged and has accumulated fewer substitutions than the mtDNAs of the other two specimens, suggesting Denisovans were present in the region over an extended period. The nuclear DNA sequence diversity among the three Denisovans is comparable to that among six Neandertals, but lower than that among present-day humans.

In 2008, a finger phalanx from a child (Denisova 3) was found in Denisova Cave in the Altai Mountains in southern Siberia. The mitochondrial genome shared a common ancestor with present-day human and Neandertal mtDNAs about 1 million years ago (1), or about twice as long ago as the shared ancestor of present-day human and Neandertal mtDNAs. However, the nuclear genome revealed that this individual belonged to a sister group of Neandertals. This group was named Denisovans after the site where the bone was discovered (2, 3). Analysis of the Denisovan genome showed that Denisovans have contributed on the order of 5% of the DNA to the genomes of present-day people in Oceania (2 ⇓ –4), and about 0.2% to the genomes of Native Americans and mainland Asians (5).

In 2010, continued archaeological work in Denisova Cave resulted in the discovery of a toe phalanx (Denisova 5), identified on the basis of its genome sequence as Neandertal. The genome sequence allowed detailed analyses of the relationship of Denisovans and Neandertals to each other and to present-day humans. Although divergence times in terms of calendar years are unsure because of uncertainty about the human mutation rate (6), the bone showed that Denisovan and Neandertal populations split from each other on the order of four times further back in time than the deepest divergence among present-day human populations occurred the ancestors of the two archaic groups split from the ancestors of present-day humans on the order of six times as long ago as present-day populations (5). In addition, a minimum of 0.5% of the genome of the Denisova 3 individual was derived from a Neandertal population more closely related to the Neandertal from Denisova Cave than to Neandertals from more western locations (5).

Although Denisovan remains have, to date, only been recognized in Denisova Cave, the fact that Denisovans contributed DNA to the ancestors of present-day populations across Asia and Oceania suggests that in addition to the Altai Mountains, they may have lived in other parts of Asia. In addition to the finger phalanx, a molar (Denisova 4) was found in the cave in 2000. Although less than 0.2% of the DNA in the tooth derives from a hominin source, the mtDNA was sequenced and differed from the finger phalanx mtDNA at only two positions, suggesting it too may be from a Denisovan (2, 3). This molar has several primitive morphological traits different from both late Neandertals and modern humans. In 2010, another molar (Denisova 8) was found in Denisova Cave. Here we describe the morphology and mtDNA of Denisova 8 and present nuclear DNA sequences from both molars.


Computational Genomics with R

100 animal genomes sequenced as of 2016. On top these, there are many research projects from either individual labs or consortia that produce petabytes of auxiliary genomics data, such as ChIP-seq, RNA-seq, etc.

There are two requirements to be able to visualize genomes and their associated data: 1) you need to be able to work with a species that has a sequenced genome and 2) you want to have annotation on that genome, meaning, at the very least, you want to know where the genes are. Most genomes after sequencing are quickly annotated with gene-predictions or known gene sequences are mapped on to them, and you can also have conservation to other species to filter functional elements. If you are working with a model organism or human, you will also have a lot of auxiliary information to help demarcate the functional regions such as regulatory regions, ncRNAs, and SNPs that are common in the population. Or you might have disease- or tissue-specific data available. The more the organism is worked on, the more auxiliary data you will have.

1.5.0.1 Accessing genome sequences and annotations via genome browsers

As someone who intends to work with genomics, you will need to visualize a large amount of data to make biological inferences or simply check regions of interest in the genome visually. Looking at the genome case by case with all the additional datasets is a necessary step to develop a hypothesis and understand the data.

Many genomes and their associated data are available through genome browsers. A genome browser is a website or an application that helps you visualize the genome and all the available data associated with it. Via genome browsers, you will be able to see where genes are in relation to each other and other functional elements. You will be able to see gene structure. You will be able to see auxiliary data such as conservation, repeat content and SNPs. Here we review some of the popular browsers.

UCSC genome browser: This is an online browser hosted by University of California, Santa Cruz at http://genome.ucsc.edu/. This is an interactive website that contains genomes and annotations for many species. You can search for genes or genome coordinates for the species of your interest. It is usually very responsive and allows you to visualize large amounts of data. In addition, it has multiple other tools that can be used in connection with the browser. One of the most useful tools is the UCSC Table Browser, which lets you download all the data you see on the browser, including sequence data, in multiple formats. Users can upload data or provide links to the data to visualize user-specific data.

Ensembl: This is another online browser maintained by the European Bioinformatics Institute and the Wellcome Trust Sanger Institute in the UK, http://www.ensembl.org. Similar to the UCSC browser, users can visualize genes or genomic coordinates from multiple species and it also comes with auxiliary data. Ensembl is associated with the Biomart tool which is similar to UCSC Table browser, and can download genome data including all the auxiliary data set in multiple formats.

IGV: Integrated genomics viewer (IGV) is a desktop application developed by Broad institute (https://www.broadinstitute.org/igv/). It is developed to deal with large amounts of high-throughput sequencing data, which is harder to view in online browsers. IGV can integrate your local sequencing results with online annotation on your desktop machine. This is useful when viewing sequencing data, especially alignments. Other browsers mentioned above have similar features, however you will need to make your large sequencing data available online somewhere before it can be viewed by browsers.

1.5.0.2 Data repositories for high-throughput assays

Genome browsers contain lots of auxiliary high-throughput data. However, there are many more public high-throughput data sets available and they are certainly not available through genome browsers. Normally, every high-throughput dataset associated with a publication should be deposited in public archives. There are two major public archives we use to deposit data. One of them is Gene Expression Omnibus (GEO) hosted at http://www.ncbi.nlm.nih.gov/geo/, and the other one is European Nucleotide Archive (ENA) hosted at http://www.ebi.ac.uk/ena. These repositories accept high-throughput datasets and users can freely download and use these public data sets for their own research. Many data sets in these repositories are in their raw format, for example, the format the sequencer provides mostly. Some data sets will also have processed data but that is not a norm.

Apart from these repositories, there are multiple multi-national consortia dedicated to certain genome biology or disease-related problems and they maintain their own databases and provide access to processed and raw data. Some of these consortia are mentioned below.


The international nucleotide sequence database collaboration

The International Nucleotide Sequence Database Collaboration (INSDC http://www.insdc.org/) has been the core infrastructure for collecting and providing nucleotide sequence data and metadata for >30 years. Three partner organizations, the DNA Data Bank of Japan (DDBJ) at the National Institute of Genetics in Mishima, Japan the European Nucleotide Archive (ENA) at the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) in Hinxton, UK and GenBank at National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health in Bethesda, Maryland, USA have been collaboratively maintaining the INSDC for the benefit of not only science but all types of community worldwide.

© The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research.

Figures

Cumulative 10-year growth of raw…

Cumulative 10-year growth of raw next generation sequence data: total bytes (dashed) and…


Scientific introduction

DNA sequencing determines the nucleotide acid sequence of an organism’s unique hereditary information. As output, it generates hundreds or thousands of short, linear pieces of microbial DNA, which are fragments of the full DNA genome. The next step after DNA sequencing, therefore, involves combining (assembling) those fragments into contiguous fragments of DNA (contigs) using computational approaches.

Genome reconstruction of a single bacterium

The genome of a single bacterium is generally:

  • circular
  • double-stranded
  • of variable length, but generally in the order of million base pairs in size

Commonly used DNA sequencing technologies (applying so-called 2nd generation sequencing) generate pieces of DNA that are:

Therefore, you can think of the task of genome reconstruction as a somewhat “hard” puzzle problem: we need to rebuild a whole image from its pieces.

How do we do that exactly when aiming to reconstruct the genome sequence of a single bacterium? In the most straightforward case, our organism has already been sequenced and its genome sequence has been deposited in a public repository (such as the EMBL-EBI’s European Nucleotide Archive, ENA). In this case, we can use this sequence to help us rebuild the “puzzle”, similarly as you would do by looking at the image on the cover of the puzzle box. This approach is called "mapping" – identifying where a specific piece of DNA comes from by comparing it to a known reference.

Remember that this is obviously a simplistic approach: due to their very high mutation rate, hardly ever is the genome of a sequenced bacterium absolutely identical to that of the reference genome. We must therefore be ready to accept that the mapping will not be perfect, and that the mismatches themselves, if sufficiently proven, might be the most interesting spots in the genome.

Genome reconstruction of complex microbial communities

What are the added challenges when reconstructing the genomes in a complex microbial community such as your gut microbiome?

  1. There are multiple genomes mixed together.
  2. We do not know which sequence belongs to which genome.
  3. We do not have a reference genome to help us rebuild the genome of every single bacterium in the community.
  4. Even if the sequences have a certain “depth” (i.e. we collect many pieces of the puzzle), we probably have not collected all sequences (i.e. we might remain with missing pieces of the whole image).

Environmental Shotgun Sequencing (ESS). (A) Sampling from habitat (B) filtering particles, typically by size (C) DNA extraction and lysis (D) cloning and library (E) sequence the clones (F) sequence assembly.
John C. Wooley, Adam Godzik, Iddo Friedberg, CC BY 2.5, via Wikimedia Commons

To solve this more complex problem, there are several strategies that, once more, resemble what you would instinctively do with a puzzle:

  • If there is some piece of the puzzle we have a reference for, we build that first.
  • If any of the pieces look “alike” (i.e. they are genetically similar, or, in the puzzle metaphor, they maybe have the same colour or pattern), we group them together.
  • If any of the pieces fit together very well (in the bioinformatics jargon, they can be “assembled” together), we assume they belong together.
  • If any of the pieces has a known function (in the puzzle metaphor, the corner or border pieces), we try to infer where they belong.

The problem of missing data

As mentioned, however, we might not have all the pieces we need to fully reconstruct the image. Since this image is the starting point to then investigate the bacterial composition in the sample (who is there) and subsequently their possible function (what they might be doing), take a second to think about the impact of the missing part of the data: aside from hampering a complete understanding of the microbial community, we must also understand that we can describe what we see, but we cannot claim any meaning from what we don’t see. Simply put, if I grab a few socks from my drawer and none of them is red, I cannot conclude that I have no red socks. Why? The overall complexity of the microbial community is too high for our sampling capacity therefore, we will end up with missing data.


ENAbling easy access to DNA sequence information

The European Nucleotide Archive (ENA) has been launched, consolidating three major sequence resources to become Europe's primary access point to globally comprehensive DNA and RNA sequence information. The ENA is freely available from the European Bioinformatics Institute (EMBL-EBI), a part of European Molecular Biology Laboratory.

Faster and cheaper DNA sequencing has led to previously unimaginable amounts of data being deposited in the public nucleotide sequence databases: today, ENA holds over 20 terabases of nucleotide sequence which, combined with associated information (annotation), occupies 230 terabytes of disk space. Carefully annotated and crosslinked sequence records from the EMBL Nucleotide Sequence Database (EMBL-Bank) form the backbone of the ENA. But importantly, ENA now also provides direct access to raw sequence data: the European Trace Archive contains raw data from electrophoresis-based sequencing machines and was previously maintained at the Wellcome Trust Sanger Institute the Sequence Read Archive (SRA) is a newly established repository for raw data from next-generation (array-based) sequencing platforms. Improved submission and data-access tools make it easier for ENA's users to share their sequence data.

"Large-scale DNA sequencing was previously the domain of a small number of specialist labs, but next-generation sequencing has made it accessible to the majority of molecular life scientists," explains Graham Cameron, the EMBL-EBI's Associate Director. "The launch of ENA reflects our continuing commitment to promoting scientific progress by providing global access to nucleotide sequence information. This has been central to EMBL's mission since the 1980s when we launched the EMBL Data Library."

Guy Cochrane, who leads the ENA team, stated that "ENA has been designed to provide our users with improved access both to annotated and to raw sequence data through the same user-friendly interface. It provides graphical browsing, web services, text search and a new rapid sequence similarity search. ENA also provides access to related information, with over 190 million cross references to external records, many of which are in other EMBL-EBI data resources."

The ENA team plans to launch many new features for the resource over the next twelve months, including enhancements to the user-friendly browser, improved interactive submissions tools and organism- and project-centred portals into ENA data.

Tim Hubbard, Head of Informatics at the Wellcome Trust Sanger Institute, said: "As major generators of DNA sequence data, it is important to us that the research community has ready access not only to annotated sequence information, but also to raw data. It's great to see the launch of ENA with new interfaces for users to this vast and rapidly growing body of information." Funding for the ENA is provided by EMBL, the Wellcome Trust and SLING, a Framework Programme 7 project coordinated by the EMBL-EBI and funded by the European Commission.


Enabling easy access to DNA sequence information

The European Nucleotide Archive (ENA) is launched today, consolidating three major sequence resources to become Europe's primary access point to globally comprehensive DNA and RNA sequence information. The ENA is freely available from the European Bioinformatics Institute (EMBL-EBI), a part of European Molecular Biology Laboratory.

Faster and cheaper DNA sequencing has led to previously unimaginable amounts of data being deposited in the public nucleotide sequence databases: today, ENA holds over 20 terabases of nucleotide sequence which, combined with associated information (annotation), occupies 230 terabytes of disk space. Carefully annotated and crosslinked sequence records from the EMBL Nucleotide Sequence Database (EMBL-Bank) form the backbone of the ENA. But importantly, ENA now also provides direct access to raw sequence data: the European Trace Archive contains raw data from electrophoresis-based sequencing machines and was previously maintained at the Wellcome Trust Sanger Institute the Sequence Read Archive (SRA) is a newly established repository for raw data from next-generation (array-based) sequencing platforms. Improved submission and data-access tools make it easier for ENA's users to share their sequence data.

"Large-scale DNA sequencing was previously the domain of a small number of specialist labs, but next-generation sequencing has made it accessible to the majority of molecular life scientists," explains Graham Cameron, the EMBL-EBI's Associate Director. "The launch of ENA reflects our continuing commitment to promoting scientific progress by providing global access to nucleotide sequence information. This has been central to EMBL's mission since the 1980s when we launched the EMBL Data Library."

Guy Cochrane, who leads the ENA team, stated that "ENA has been designed to provide our users with improved access both to annotated and to raw sequence data through the same user-friendly interface. It provides graphical browsing, web services, text search and a new rapid sequence similarity search. ENA also provides access to related information, with over 190 million cross references to external records, many of which are in other EMBL-EBI data resources."

The ENA team plans to launch many new features for the resource over the next twelve months, including enhancements to the user-friendly browser, improved interactive submissions tools and organism- and project-centred portals into ENA data.

Tim Hubbard, Head of Informatics at the Wellcome Trust Sanger Institute, said: "As major generators of DNA sequence data, it is important to us that the research community has ready access not only to annotated sequence information, but also to raw data. It's great to see the launch of ENA with new interfaces for users to this vast and rapidly growing body of information." Funding for the ENA is provided by EMBL, the Wellcome Trust and SLING, a Framework Programme 7 project coordinated by the EMBL-EBI and funded by the European Commission.


Conclusions

We have shown that there is extensive variation in the level of nucleotide diversity across the genome of an avian species. This variation is seen in autosomal sequence and is thus unrelated to the well-known effects of sex linkage on genetic diversity (Hedrick 2007 Frankham 2012 ). Linked selection is likely to play a strong role in governing within-genome heterogeneity in diversity levels, with (variation in) recombination rate and density of targets of selection being primary determinants of the extent of linked selection. As far as we are aware, this study is the first to characterize genomewide nucleotide diversity through whole-genome resequencing of a large population sample and then use these real data to simulate how well genetic diversity would be captured by the use of genetic markers. We find that diversity estimation by sequencing a small number of amplicons is bound to be associated with large confidence intervals. Given the heterogeneity in diversity levels across the genome, gathering sequence data from many loci will increase the precision in diversity estimation. Naturally, one could ask whether molecular ecological studies will continue to be based on sequence data from a limited number of loci when genotyping-by-sequencing and whole-genome resequencing become increasingly feasible in many projects. However, even with the use of next-generation sequencing technologies, target capture approaches are cost-effective and can be used for a wide range of applications (Jones & Good 2016 ).

The ability to reliably estimate genetic diversity of different populations is critical for making conclusions about evolutionary processes. To end with an example (Gohli et al. 2013 ), recently reported an association between genetic diversity and female promiscuity in 18 passerine bird species based on sequence data from five introns (mean length ≈400 bp). One possible explanation to this would be that species with strong sexual selection for compatible genes (i.e. negative frequency-dependent selection for rare or dissimilar alleles) have relatively high levels of genetic diversity. The validity of this result was criticized by (Spurgin 2013 ) on several methodological grounds, including the precision of diversity estimates and the inference of species level diversity from sampling of individual populations (see also response to the criticism by (Lifjeld et al. 2013 ). Based on experiences from the present study, we note that more firm conclusions should have been possible to reach with more extensive sampling of genomic data, either confirming or disproving the idea of a relationship between genetic diversity and female promiscuity.


Discussion

BlobToolKit is a significant extension of the approach launched in BlobTools. In particular, by permitting user interaction with the rich data associated with each contig in the Viewer mode, BlobToolKit can enhance discovery of novel biology. The addition of real-time interaction addresses a criticism of the approach, relative to cluster-based methods such as Anvi’o (Eren et al. 2015), that it limits the amount of supporting data that can be included (Delmont and Eren 2016). We envisage three main uses for BlobToolKit. The first is in the research laboratory aiming to sequence for the first time the genome of a new species. BlobToolKit can be used during the assembly process, to filter contaminants and cobionts, and to explore issues such as haploid vs. diploid contigs, and patterns of coverage in different sequence read datasets (for example, comparing male and female read sets in heterogametic organisms). As part of an assembly workflow, BlobToolKit should ensure better quality assemblies with higher biological credibility.

The second use is in publication and visualization of published assemblies. The BlobToolKit Viewer generates publication quality images that are fully reproducible via the embedding of control parameters in the URL. These images should, we believe, become standard in reporting genome assemblies, and thus enhance the ease of assessment of assembly quality. We have worked to embed BlobToolKit views into the presentation of genome assemblies at the ENA for just this reason and believe that we have demonstrated that collaboration between tools developers and public databases is important in refining best practice in data publication. Journals may generate (or request that authors supply) BlobToolKit assessments of new assemblies submitted for publication, to aid review and speed publication of high quality data.

The third is in comparative and evolutionary genomics. With ongoing improvements in sequencing technologies and assembly software, genome assemblies are improving in quality and contiguity. Among other players, the Earth Biogenome Project (Lewin et al. 2018), 10K Vertebrate Genome Project (Genome 10K Community of Scientists 2009) and Tree of Life project (https://www.sanger.ac.uk/science/programmes/tree-of-life) collectively aim to generate chromosomally-contiguous reference genomes for (in the first instance) all known families of Eukaryota. BlobToolKit protocols can be used to explore these genomes for evidence of past horizontal gene transfer, for the presence of symbionts and parasites, and to explore chromosomal patterns of gene expression.

The difficulty we experienced in associating raw sequence read sets with submitted assemblies has led ENA to include a more apparent and thorough explanation of the benefits of and process for referencing reads during eukaryotic genome assembly submission to the repository. We advocate the practice of assembly submission along with associated reads to INSDC to enable downstream analysis and assembly contamination detection.

We aim to complete analysis of all public genomes in INSDC and post them to the BlobToolKit Viewer website at https://blobtoolkit.genomehubs.org/view in the near future, and then maintain currency with the flow of new genomes. The toolkit is under active development (see https://github.com/blobtoolkit) and we welcome feature requests and collaborations to expand and improve its capabilities.


Watch the video: Hvordan søger man viden? - Benyones Essabar - 14. uge - HD (December 2022).