Sequencing inaccurate at the primer site

Sequencing inaccurate at the primer site

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

The times I have sent a sample to sequence, both the forward and the reverse primer sites, show high inaccuracy while the rest of the gene is correctly sequenced. Because of this, the sequences of my in silico construct and the sequenced sample do not align in this section; but they do align at the rest of the gene almost 100%.

Is there a reason for this? Is this simply a sequencing artifact or is should I trust the sequenced sample and assume that the primer sites have mutated?

The very extremities of sequencing reads obtained by most if not all sequencing technologies are usually of lower quality, though more often so in the 5' region. You should disregard this data, or better yet design additional primers further away to encapsulate that region too if you desperately need it.

Below is a fairly typical output from FASTQC analysis of Illumina sequencing data. You can see how the quality is at its peak in the middle of the read (base index on the x axis. You'd likely see a similar thing for Sanger sequencing which is presumably what you're using.

How to Design a Primer

Oligonucleotide primers are necessary when running a PCR reaction. One needs to design primers that are complementary to the template region of DNA. They are synthesized chemically by joining nucleotides together. One must selectively block and unblock repeatedly the reactive groups on a nucleotide when adding a nucleotide one at a time. The main property of primers is that they must correspond to sequences on the template molecule (must be complementary to template strand). However, primers do not need to correspond to the template strand completely it is essential, however, that the 3’ end of the primer corresponds completely to the template DNA strand so elongation can proceed. Usually a guanine or cytosine is used at the 3’ end, and the 5’ end of the primer usually has stretches of several nucleotides. Also, both of the 3’ ends of the hybridized primers must point toward one another.

The size of the primer is very important as well. Short primers are mainly used for amplifying a small, simple fragment of DNA. On the other hand, a long primer is used to amplify a eukaryotic genomic DNA sample. However, a primer should not be too long (> 30-mer primers) or too short. Short primers produce inaccurate, nonspecific DNA amplification product, and long primers result in a slower hybridizing rate. On average, the DNA fragment that needs to be amplified should be within 1-10 kB in size.

The structure of the primer should be relatively simple and contain no internal secondary structure to avoid internal folding. One also needs to avoid primer-primer annealing which creates primer dimers and disrupts the amplification process. When designing, if unsure about what nucleotide to put at a certain position within the primer, one can include more than one nucleotide at that position termed a mixed site. One can also use a nucleotide-based molecular insert (inosine) instead of a regular nucleotide for broader pairing capabilities.


Bacterial 16S ribosomal DNA (rDNA) amplicons have been widely used in the classification of uncultured bacteria inhabiting environmental niches. Primers targeting conservative regions of the rDNAs are used to generate amplicons of variant regions that are informative in taxonomic assignment. One problem is that the percentage coverage and application scope of the primers used in previous studies are largely unknown. In this study, conservative fragments of available rDNA sequences were first mined and then used to search for candidate primers within the fragments by measuring the coverage rate defined as the percentage of bacterial sequences containing the target. Thirty predicted primers with a high coverage rate (>90%) were identified, which were basically located in the same conservative regions as known primers in previous reports, whereas 30% of the known primers were associated with a coverage rate of <90%. The application scope of the primers was also examined by calculating the percentages of failed detections in bacterial phyla. Primers A519–539, E969–983, E1063–1081, U515 and E517, are highly recommended because of their high coverage in almost all phyla. As expected, the three predominant phyla, Firmicutes, Gemmatimonadetes and Proteobacteria, are best covered by the predicted primers. The primers recommended in this report shall facilitate a comprehensive and reliable survey of bacterial diversity in metagenomic studies.

Citation: Wang Y, Qian P-Y (2009) Conservative Fragments in Bacterial 16S rRNA Genes and Primer Design for 16S Ribosomal DNA Amplicons in Metagenomic Studies. PLoS ONE 4(10): e7401.

Editor: Dawn Field, NERC Centre for Ecology and Hydrology, United Kingdom

Received: June 23, 2009 Accepted: September 13, 2009 Published: October 9, 2009

Copyright: © 2009 Wang, Qian. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: KAUST Global Partnership. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.


Description of barcoding assay

Our approach to quantify DNA polymerase fidelity is illustrated in Figure 1. A pool of templates with identical sequences undergoes one round of extension with the polymerase of interest (Figure 1A). The primers (Figure 1B) contain a randomized 12 bp barcode sequence to tag each product with a unique ‘product barcode’. The primers also contain a ‘condition barcode’ unique to each reaction condition, allowing multiple reactions to be pooled and sequenced simultaneously. After extension by the polymerase of interest, the complementary strand is synthesized by a high-fidelity polymerase using a primer of the same structure. This complementary strand is then PCR amplified using primers complementary to the partial Illumina adapters on both ends of the product, generating a library with multiple barcoded copies of each original product. After paired-end sequencing, the reads are grouped according to the product barcodes on both ends (Figure 1C). Errors generated in the initial extension by the polymerase of interest should be present in all copies of the product. Sequencing errors can be recognized because they are most likely present in only a fraction of copies and can therefore be eliminated. After sequencing errors are filtered out, the DNA polymerase errors are obtained for each product. This approach is not subject to PCR quantification biases because error rates are quantified using the number of unique products and not their final amplified amount.

Schematic of barcoding strategy. (A) Workflow to generate products for paired-end sequencing. The pool of templates is replicated using the polymerase of interest. The complementary strands are then synthesized using a high-fidelity polymerase. In both cases, a special primer (green and orange) containing a partial Illumina Adapter, a random product barcode and a condition barcode is used. Primers complementary to the partial Illumina adapter are used to PCR amplify the complementary strands, forming the sequencing library. Each amplification product is tagged with a unique set of product barcodes that indicates its origin. (B) The special primer contains a part of the Illumina sequencing adapter, a ‘condition barcode’ that is unique to each reaction, a 12 bp randomized ‘product barcode’ that uniquely tags each product and the priming sequencing for the region of interest. (C) After sequencing, reads are grouped according to condition barcode and product barcode. Sequences are aligned to the correct sequence and errors are called. Errors are only kept if they are present in all copies, otherwise they are discarded as sequencing error.

Schematic of barcoding strategy. (A) Workflow to generate products for paired-end sequencing. The pool of templates is replicated using the polymerase of interest. The complementary strands are then synthesized using a high-fidelity polymerase. In both cases, a special primer (green and orange) containing a partial Illumina Adapter, a random product barcode and a condition barcode is used. Primers complementary to the partial Illumina adapter are used to PCR amplify the complementary strands, forming the sequencing library. Each amplification product is tagged with a unique set of product barcodes that indicates its origin. (B) The special primer contains a part of the Illumina sequencing adapter, a ‘condition barcode’ that is unique to each reaction, a 12 bp randomized ‘product barcode’ that uniquely tags each product and the priming sequencing for the region of interest. (C) After sequencing, reads are grouped according to condition barcode and product barcode. Sequences are aligned to the correct sequence and errors are called. Errors are only kept if they are present in all copies, otherwise they are discarded as sequencing error.

In addition to removing sequencing errors, we took measures to minimize other sources of false positives throughout the protocol. We generated the starting templates by clonally amplifying a plasmid containing the template sequence in E. coli. E. coli replication has a low error rate of about 1 × 10 −9 errors per base pair per replication ( 27) and generates a homogenous starting template pool. We also minimized errors during synthesis and PCR amplification of the complementary strand by using Q5 DNA polymerase (Q5), the highest fidelity DNA polymerase available.

Sequencing error removal and DNA polymerase error rate quantification

To test if the barcoding strategy could reduce sequencing errors, we determined the error rate of Q5 DNA polymerase (Q5) when different numbers of product copies were used to filter sequencing errors. To do this, we grouped products according to the number of copies captured by sequencing and determined the error rate as a function of copy number. Since Q5 was used for both initial extension and complementary strand synthesis, the true error rate was calculated as half the recorded value. These error rates were averaged over two template sequences: a 188 base sequence within the chloramphenicol acetyltransferase (Cm R ) gene of the pBeloBac11 plasmid vector and a 281 base sequence within the LacZα gene of the pOPINP plasmid vector (Supplementary Tables S2 and S3). One replicate over the LacZα locus was excluded because the template showed evidence of DNA damage (Supplementary Figure S1).

For products with only one copy, sequencing errors and polymerase errors could not be separated and the recorded error rate was 1.3 × 10 −4 substitutions/bp (Figure 2). For products with 2 copies, the error rate decreased to 5.6 × 10 −6 substitutions/bp because sequencing errors were removed. When more copies were present, the substitution error rate decreased further to 4.4 × 10 −6 substitutions/bp for 5 copies. Sequencing generated deletions and insertions were also removed, with these error rates decreasing from 0.99 × 10 −5 deletions/bp and 2.2 ×10 −7 insertions/bp at 1 copy to no detected deletions or insertions at 5 copies (Figure 2). This shows that our barcoding method successfully allows the separation and removal of sequencing errors.

Q5 error rate as a function of product copy number. As the number of product copies used for comparison is increased, sequencing errors are increasingly eliminated. Error bars indicate standard error.

Q5 error rate as a function of product copy number. As the number of product copies used for comparison is increased, sequencing errors are increasingly eliminated. Error bars indicate standard error.

We measured the average error rates of 3′→5′ exonuclease deficient Klenow Fragment (Klenow (exo-)), Taq, E. coli Y-family DNA Polymerase IV (Pol IV) and Q5 over the Cm R , LacZα (-) strand, and LacZα (+) strand loci (Supplementary Tables S4 and S5) for comparison with published values. To minimize sequencing errors while maximizing product number, we analyzed products with 3 or more copies. The Q5 error rate was subtracted from our measurements to account for errors made during complementary strand synthesis. We measured the error rates over two technical replicates, where replicates were done by taking aliquots from the template pool and conducting the assay on each in parallel. We compared our results with those previously collected using denaturing gradient gel electrophoresis (DGGE) ( 23), sequencing of mutant bacterial colonies with LacZα forward mutation selection ( 11, 28–31), or direct sequencing of bacterial colonies without phenotypic selection ( 10, 32) (Supplementary Table S6). Our error rates for Klenow (exo-) and Taq were similar to both DGGE and direct sequencing values (Figure 3). In contrast, our error rates for Klenow (exo-), Taq, Pol IV and Q5 were on average 7 times higher for substitutions and 3 times higher for deletions than the LacZα forward mutation assay values. To investigate the cause of this difference, we considered in greater detail the case of Pol IV replication over the LacZα (+)-strand. We first modified the buffer conditions for Pol IV replication over the LacZα (+)-strand to match that in the corresponding forward mutation assay study (see Materials and Methods for buffer composition). This reduced the error rates from 1.06 × 10 −3 sub/bp and 1.3 × 10 −3 del/bp to 4.6 × 10 −4 sub/bp and 6.3 × 10 −4 del/bp. We then removed the errors which are not phenotypically detectable ( 11), further reducing the error rate to 3.2 × 10 −4 sub/bp and 4.6 × 10 −4 del/bp (Supplementary Methods). This overall 3-fold reduction shows that extension conditions and phenotypic detectability have a significant influence on error rates and may partially account for the discrepancy. However, since some LacZα forward mutation assay results were up to 20 times lower than our measurements, a significant deviation still remains. Thus, our measurements corresponded with results from other non-phenotypic techniques but appeared to deviate from those of the mutation assay.

Comparison of error rates from our barcoded sequencing assay (red circle) with results from denaturing gradient gel electrophoresis (DGGE) (black triangles), sequencing of mutant bacterial colonies with LacZα forward mutation selection (black circles), or direct sequencing of bacterial colonies without phenotypic selection (black squares) (see Supplementary Table S6). For our assay values, the red points are means and the red bars indicate the standard error.

Comparison of error rates from our barcoded sequencing assay (red circle) with results from denaturing gradient gel electrophoresis (DGGE) (black triangles), sequencing of mutant bacterial colonies with LacZα forward mutation selection (black circles), or direct sequencing of bacterial colonies without phenotypic selection (black squares) (see Supplementary Table S6). For our assay values, the red points are means and the red bars indicate the standard error.

Variations in error rate and identification of error hotspots

DNA polymerase fidelity varies across a template. To investigate this variation and its reproducibility, we mapped the frequency of single-base substitutions for Pol IV (39 641 substitutions in 163 949 products) and Klenow (exo-) (4046 substitutions in 122 846 products) replication over the LacZα locus (−) strand for two technical replicates each (Supplementary Table S4). The error spectra (Figure 4A and Supplementary Figure S2) illustrate that error rate varied substantially from base to base along the template. The variation ranged over two orders of magnitude, and the error rates at each base were reproducible between replicates for Pol IV (Pearson ρ = 0.97, P < 0.01) and Klenow (exo-) (Pearson ρ = 0.74, P < 0.01) (Figure 4B), indicating that the variation was not due to sampling noise. Pol IV and Klenow (exo-) were strikingly different in their error spectra (Spearman ρ = 0.13, P < 0.01) (Figure 4C), suggesting that these variations were polymerase specific. To demonstrate how sampling noise would make these variations difficult to accurately characterize, we randomly under-sampled the error spectra to between 50 and 2000 errors and compared it to the original. We repeated this 100 times to obtain the average similarity at each error sampling number. The error rates averaged over the entire template remained similar (Supplementary Figure S3), but the correlation between error rates at each base improved from ρ ∼0.30 at 50 errors to ρ ∼0.95 at 2000 errors (Figure 4D). Next, to identify error hotspots, we fit the distribution of substitution errors per base position to a combination of Poisson distributions using the omputer-assisted analysis of mixture distributions (C.A.MAN) package ( 33) (Figure 5A and Supplementary Figure S4). Hotspots were first defined as those positions that deviated significantly from the fitted distribution (P < 0.05 with Benjamini–Hochberg correction assuming independence). This yielded 1 error hotspot in the Klenow (exo-) spectrum at position 260 which was 13 times more error prone than average (Supplementary Figure S2). Under-sampling made this hotspot difficult to distinguish, as even with 500 errors the hotspot was only called in 50% of under-sampling replicates (Figure 5B). If the definition of ‘hotspot’ was relaxed to include all positions that belong to the Poisson distribution with the highest mean error parameter, we could identify 2 hotspots at positions 106 and 132 in the Pol IV spectrum that were 5 times more error prone than average, and 2 hotspots at positions 80 and 260 in the Klenow (exo-) spectra that were 5 and 13 times more error prone than average (Supplementary Figure S2). To rule out the possibility that the error spectra of Pol IV and Klenow (exo-) were different because their extension buffers were different, we repeated the analysis after using Pol IV buffer for both polymerases. As before, there was a poor correlation (Spearman ρ = 0.31, P < 0.01) and no overlap in hotspots. In summary, we were able to identify variations in error rate that are both reproducible and polymerase specific.

Variations in substitution error rate when replicating the LacZα (−) strand template. (A) A snapshot of the first 30 bases of the replication product for Pol IV and Klenow (exo-) illustrates variations in substitution error rate across the template. The length of the bar indicates the error rate. (B) Correlation plots of error rates at each base position between technical replicates for Pol IV replicates (21 212 and 18 429 mutations) and Klenow (exo-) (1973 and 2073 mutations) show that the variations were reproducible and hence not due to sampling noise. R1 and R2 designate the first and second replicates respectively. An outlier for Klenow (exo-) at 1.66 × 10 −3 (replicate 1) and 2.04 × 10 −3 errors/bp (replicate 2) was excluded from analysis. (C) Correlation plots of error ranks for Pol IV versus Klenow (exo-) show that error spectra are strikingly different depending on the polymerase. (D) Pearson correlation coefficient between the original error spectrum and an under-sampled copy as the number of errors per under-sampled copy is changed. Average of 100 repeats at each error number.

Variations in substitution error rate when replicating the LacZα (−) strand template. (A) A snapshot of the first 30 bases of the replication product for Pol IV and Klenow (exo-) illustrates variations in substitution error rate across the template. The length of the bar indicates the error rate. (B) Correlation plots of error rates at each base position between technical replicates for Pol IV replicates (21 212 and 18 429 mutations) and Klenow (exo-) (1973 and 2073 mutations) show that the variations were reproducible and hence not due to sampling noise. R1 and R2 designate the first and second replicates respectively. An outlier for Klenow (exo-) at 1.66 × 10 −3 (replicate 1) and 2.04 × 10 −3 errors/bp (replicate 2) was excluded from analysis. (C) Correlation plots of error ranks for Pol IV versus Klenow (exo-) show that error spectra are strikingly different depending on the polymerase. (D) Pearson correlation coefficient between the original error spectrum and an under-sampled copy as the number of errors per under-sampled copy is changed. Average of 100 repeats at each error number.

Identification of substitution error hotspots. (A) Histograms showing the distribution of substitution errors per base position for Pol IV and Klenow (exo-) replication across the LacZα (−) strand template. The histograms are fit to a combination of Poisson distributions (red) using the C.A.MAN package. Hotspots and their positions are indicated. The red-lettered hotspot was identified as being exceptional to the fitted distribution (α < 0.05, Benjamini–Hochberg corrected). The black-lettered hotspots were identified under a more relaxed definition, which included all positions that belonged to the Poisson distribution with highest mean error parameter. (B) Frequency with which the Klenow (exo-) position 260 hotspot is identified when the original spectrum is under-sampled. Average of 100 repeats at each error sampling number.

Identification of substitution error hotspots. (A) Histograms showing the distribution of substitution errors per base position for Pol IV and Klenow (exo-) replication across the LacZα (−) strand template. The histograms are fit to a combination of Poisson distributions (red) using the C.A.MAN package. Hotspots and their positions are indicated. The red-lettered hotspot was identified as being exceptional to the fitted distribution (α < 0.05, Benjamini–Hochberg corrected). The black-lettered hotspots were identified under a more relaxed definition, which included all positions that belonged to the Poisson distribution with highest mean error parameter. (B) Frequency with which the Klenow (exo-) position 260 hotspot is identified when the original spectrum is under-sampled. Average of 100 repeats at each error sampling number.

To demonstrate the advantages of our assay in characterizing error spectra, we mapped the single-base substitution spectrum of Pol IV replicating over the LacZα locus (+) strand (837 substitutions in 6578 products) and compared our results to those mapped using the LacZα forward mutation assay (66 substitutions) ( 31) (Supplementary Figure S5). Our extension was performed in the same buffer as that reported in the forward mutation assay. The correlation between the error profiles was quite poor even after limiting our spectrum to phenotypically detectable errors (Spearman ρ = 0.30, P < 0.01), probably due to the small sample size of the forward mutation assay spectrum. Furthermore, whereas three error hotspots (under the relaxed definition) could be identified at positions 23, 95 and 270 from our data, only 1 error hotspot at base position 95 could be identified from the mutation assay data (Supplementary Figure S4) because the other two hotspots are not phenotypically detectable. This comparison illustrates the importance of a high-throughput and non-phenotypic method in error profiling and hotspot identification.

Impact of a DNA lesion on fidelity

DNA lesions compromise the fidelity and replication kinetics of DNA polymerase, but cells contain translesion synthesis polymerases that are specially adapted to synthesize across lesions. To test if our assay can correctly measure the effects of a lesion on DNA polymerase fidelity, we investigated the error rate of lesion bypass by Pol IV, a translesion synthesis polymerase, over an N 2 -furfuryl-dG lesion. The N 2 -furfuryl-dG lesion is a structural analogue of the main lesion formed in cells treated with nitrafurazone, and Pol IV has been shown to bypass this lesion effectively ( 34). We compared the error spectrum of Pol IV replicating across the N 2 -furfuryl-dG lesion to that of Pol I, an accurate 3′→5′ exonuclease-proficient replicative polymerase, as well as Klenow (exo-). The lesion-containing (‘damaged’) substrate and the lesion-free (control) substrate were made by ligating 20 base oligonucleotides into the M13mp7(L2) plasmid (see Materials and Methods). To account for false positives due to oligonucleotide synthesis errors, we subtracted the control substrate error spectra from the damaged substrate error spectra.

All three polymerases had increased substitution error rate when synthesizing DNA across the lesion. As expected, Pol IV had the lowest error rate of 1.27 × 10 −2 substitutions/bp (Table 1), while Pol I and Klenow (exo-) were much more error prone and made errors at rates of 1.25 × 10 −1 substitutions/bp and 1.93 × 10 −1 substitutions/bp, respectively. The relatively low error rate of Pol IV is consistent with previous kinetic measurements for DNA replication across the N 2 -furfuryl-dG lesion, which showed that Pol IV incorporates the correct base at a greater kinetic rate than Pol I ( 34). The dominant error type was different among the DNA polymerases. The G*·dTTP mismatch (C→T transition) was most common for Pol IV at 1.13 × 10 −2 occurrences/bp and also for Pol I at 1.17 × 10 −1 occurrences/bp. This result matches with previous kinetic measurements for Pol IV, which report that thymine was incorporated opposite the lesion at the greatest rate ( 34) (Supplementary Table S7). In contrast, the Klenow (exo-) substitution spectrum was dominated by the G*·dGTP mismatch, which occurred at a rate of 1.42 × 10 −1 errors/bp. We also characterized the fidelity of replication at bases adjacent to the lesion site (Supplementary Figure S6). Although it appears that replication fidelity is reduced in the proximity of the lesion, there are error hotspots in both the lesion and control spectra that seem to be caused by inconsistencies in oligonucleotide synthesis or damage from template preparation. Overall, the error rate at the lesion can be measured with confidence, but characterization of fidelity around the lesion requires additional investigation.

Error rates for each DNA polymerase when replicating across the N 2 -furfuryl-dG lesion

. Error Rate at N 2 -furfuryl-dG lesion (errors/bp) .
Type of Mutation . DNA Polymerase IV . Klenow (exo-) . DNA Polymerase I .
G*·dTTP 1.13 × 10 −2 4.47 × 10 −2 1.17 × 10 −1
G*·dATP 1.30 × 10 −3 6.17 × 10 −3 1.11 × 10 −3
G*·dGTP 8.58 × 10 −5 1.42 × 10 −1 7.33 × 10 −3
Total 1.27 × 10 −3 1.93 × 10 −1 1.25 × 10 −1
. Error Rate at N 2 -furfuryl-dG lesion (errors/bp) .
Type of Mutation . DNA Polymerase IV . Klenow (exo-) . DNA Polymerase I .
G*·dTTP 1.13 × 10 −2 4.47 × 10 −2 1.17 × 10 −1
G*·dATP 1.30 × 10 −3 6.17 × 10 −3 1.11 × 10 −3
G*·dGTP 8.58 × 10 −5 1.42 × 10 −1 7.33 × 10 −3
Total 1.27 × 10 −3 1.93 × 10 −1 1.25 × 10 −1
. Error Rate at N 2 -furfuryl-dG lesion (errors/bp) .
Type of Mutation . DNA Polymerase IV . Klenow (exo-) . DNA Polymerase I .
G*·dTTP 1.13 × 10 −2 4.47 × 10 −2 1.17 × 10 −1
G*·dATP 1.30 × 10 −3 6.17 × 10 −3 1.11 × 10 −3
G*·dGTP 8.58 × 10 −5 1.42 × 10 −1 7.33 × 10 −3
Total 1.27 × 10 −3 1.93 × 10 −1 1.25 × 10 −1
. Error Rate at N 2 -furfuryl-dG lesion (errors/bp) .
Type of Mutation . DNA Polymerase IV . Klenow (exo-) . DNA Polymerase I .
G*·dTTP 1.13 × 10 −2 4.47 × 10 −2 1.17 × 10 −1
G*·dATP 1.30 × 10 −3 6.17 × 10 −3 1.11 × 10 −3
G*·dGTP 8.58 × 10 −5 1.42 × 10 −1 7.33 × 10 −3
Total 1.27 × 10 −3 1.93 × 10 −1 1.25 × 10 −1


We showed that multiplexing of amplicon libraries for studying the diversity in metagenomic samples is prone to intractable cross-contamination events due to the mistagging phenomenon. Single-tagging as well as saturated double-tagging strategies are flawed by numerous and undetectable critical mistags. We showed that non-combinatorial designs minimize the occurrence of critical mistags by eliciting the formation of unexpected combinations, but offer a poor multiplexing capacity. The LSD represents the optimal trade-off between the error minimisation ability of non-combinatorial designs ( 7) and the multiplexing capacity of SAD ( 32). LSD enforces non-critical mistags in designs defined according to a number of possible combinations and a number of samples, as shown with the comparison to the SAD at identical saturation levels. For example, by relying on a LSD involving 30 forward and 30 reverse primers used 10 times each, one could multiplex 300 samples (or 100 samples in triplicates) at a saturation level of 33% only. This provides a perfect framework for our filtering method based on non-critical mistags information, but also a considerable gain of time and money. Moreover, limiting the number of deployed tagged primers reduces tag misidentification problems through the design of highly variable tag sequences ( 9), as well as the risk of cross-contamination during handling.

The magnitude of the mistagging phenomenon by far exceeds the expectations of previous studies relying on tagged primer constructs ( 6, 9). It has been proposed that one of the major sources of mistagging is primer cross-contamination ( 6, 8). However, we obtained no positives in 60-cycles PCR tests involving only one of two tagged primers in the mix, and in an additional library containing more than 300 000 reads, we found only 0.096% of them labelled with at least one of nine tagged primers left untouched out of 40 ordered primers (data not shown). Therefore, the impact of primer cross-contaminations seems negligible and only visible because of the sequencing depth of HTS. To some extent, especially purified primers can alleviate this source of mistagging, but at high expense for numerous samples, and even without removing them entirely ( 9).

Our study shows clearly that the mistagging events mainly occur during the PCR performed on the pool of labelled amplicons. This is demonstrated by the fact that the clones contaminating a sample originate from the other samples multiplexed within the same library. PCR-free library-preparation methods are promising, but necessitate high amounts of input DNA. This could be achieved by multiplexing more samples or non-homologous material such as a species genome or transcriptome ( 33). It has been demonstrated that the frequency of chimera formation is inversely related to the complexity of the sequence sample subject to PCR ( 34). Therefore, multiplexing non-homologous PCR products prior to library-preparation PCR enhances the sequence diversity and reduces the impact of chimera, which is probably responsible for the recombination of fragments ends where the tags are located. Hence, with more complex environmental DNA samples it could be predicted that chimera-driven mistags might be less prominent. However, their appearances might range in the same amounts as chimeras usually witnessed in environmental samples. This may explain what we observed in the LSD library, where both foraminiferal and eukaryotic PCR products were multiplexed. Alternatively, pooling high quantities of PCR products can increase the amount of input DNA. However, this second solution is risky because PCR products are stable laboratory contaminants that can be readily discovered in HTS conditions ( 35). Although appealing indexing library-preparation methods are flourishing, it is wise to label PCR products during the first amplification in order to be able to trace potential contaminants.

This first PCR enriches a specific diversity from complex samples, but also creates biases responsible for the inflation of diversity estimates ( 36, 37) and the introduction of artefactual variability across samples ( 38). To correct such biases, internal controls such as co-sequenced mock community samples can be employed ( 33), but their suitability depends on their complexity ( 39). Instead, our filtering method does not require the addition of a supplementary sample and rather relies directly on the properties of the data itself. Moreover, it is particularly suited to HTS, as its statistical power increases with the amount of sequence data. Indeed, a higher amount of non-critical mistags provides a finer resolution in the detection and removal of critical mistags.

Theoretically, each species genome template should produce exactly one ISU, including the polymorphic copies of a gene. Our filter operates at such a resolution because it is ISU-centred, i.e. it computes the read abundance distribution across samples of each ISU independently. Hence, it does not rely on a unique abundance threshold applied over all samples but computes a different threshold for each ISU in each sample, accounting for differential sample sequencing depths ( 40, 41). Moreover, our filter requires no tuning of subjective parameters resulting in different sets of arbitrary thresholds ( 19, 20). Being completely parameter-free, our approach has the utmost advantage of allowing the establishment of synoptic models towards more comparable diversity analyses.

The robustness of our approach is greatly reinforced by the incorporation of PCR replicates. As shown by our study, the replicates labelled using non-combinatorial tag pairs are less prone to cross-contamination by identical mistags or to the accumulation of random errors. Indeed, the probability of such co-occurring events corresponds to the product of the probabilities associated with each replicate. The importance of technical replicates has been emphasized for the filtering of erroneous sequences ( 23, 42). One approach is to focus on the union of replicates, assuming that the full sample complexity is missed by individual PCRs ( 43) and because arbitrary abundance-based filtering can lead to removing many rare genuine species ( 44, 45). Another approach is to analyse the diversity at the intersection of replicates, assuming that genuine species are detected in every PCR. Even with as few as 17% of diversity shared among replicates ( 46), this conservative assumption has been corroborated previously ( 23) and by our own results, although we cannot exclude false positive artefacts due to the size of our mock community samples. The incorporation of PCR replicates in a multiplexing design is not trivial under this assumption since (i) more replicates may result in more mistags and (ii) the same chimeras are likely to happen across replicates since the initial sequence diversity is similar in the replicates of a sample ( 34, 47). A trade-off between the number of samples and the amount of replication must be considered to ensure that rare species sequences remain unfiltered from replicates. Finally, it should be noted that the same amount of caution towards mistagging and applying alleviating measures should also be taken into account for non-environmental studies pooling together the tagged specimens ( 48).

In conclusion, we propose a few recommendations to increase the accuracy of HTS data sets based on multiplexed amplicon libraries:

Proscribe single-tagging and saturated double-tagging designs.

Choose tagged primer combinations according to LSD to maximize the mistagging information.

Minimize the sample saturation to reduce the proportion of critical mistags.

Incorporate at least two PCR replicates to remove error ISUs.

Label PCR replicates with tagged primers used only once to avoid inter-replicate mistags.

Use parameter-free, data-driven and ISU-centred filtering approach.

Avoid long primer constructs for multi-species samples.

Some of these recommendations can be easily implemented. We provide a LSD generator to assist the design of double-tagging strategies and a filter accounting for mistagging patterns and PCR replicates. Our approach allows accurate HTS data denoising and preserves both the relative abundance and the occurrence of rare, genuine sequences templates. We are confident that associating robust experimental planning with powerful sequence-data filtering is the condicio sine qua non of comprehensive surveys requiring the deployment of numerous samples and replicates.

The authors thank Simon Gregory, Sev Kender and Juan Montoya for fruitful comments on the manuscript as well as anonymous referees for helpful comments, and Fasteris SA for sequencing services.

Sequencing inaccurate at the primer site - Biology

Long-read/third-generation sequencing technologies are causing a new revolution in genomics as they provide a way to study genomes, transcriptomes, and metagenomes at an unprecedented resolution.

SMRT and nanopore sequencing allow for the first time the direct study of different types of DNA base modifications.

Moreover, nanopore technology can sequence directly RNA and identify RNA base modifications.

Owing to the portability of the MinION and the existence of extremely simple library preparation methods, nanopore technology allows the performance of high-throughput sequencing for the first time in the field and at remote places. This is of tremendous importance for the survey of outbreaks in developing countries.

Forty years ago the advent of Sanger sequencing was revolutionary as it allowed complete genome sequences to be deciphered for the first time. A second revolution came when next-generation sequencing (NGS) technologies appeared, which made genome sequencing much cheaper and faster. However, NGS methods have several drawbacks and pitfalls, most notably their short reads. Recently, third-generation/long-read methods appeared, which can produce genome assemblies of unprecedented quality. Moreover, these technologies can directly detect epigenetic modifications on native DNA and allow whole-transcript sequencing without the need for assembly. This marks the third revolution in sequencing technology. Here we review and compare the various long-read methods. We discuss their applications and their respective strengths and weaknesses and provide future perspectives.

3 Main Enzymes of DNA Replications | Cell Biology

A primase is an enzyme which makes the RNA primers required for initiation of Okazaki pieces on the lagging strand. Primase activity needs the formation of a complex of primase and at least six other proteins. This complex is called the primo-some.

The primo-some contains pre-priming proteins—arbitrarily called proteins i, n, n’ and n”—as well as the product of genes dna B and dna C. The primo-some carries out the initial priming activity for leading strand wherein the synthesis takes place continuously in the overall 5′ to 3′ direction.

It also carries out the repeating priming of the synthesis of Okazaki fragments for the lagging strand where the synthesis occurs discontinuously in the overall 3′ to 5′ direction.

The primase shows a very strong preference to initiate with adenosine followed by guanosine and this suggests that initiation of Okazaki fragments may occur at particular sites on the lagging strand. However, the small phage P4, which needs only about 20 Okazaki fragments per round of replication, shows no preferential initiation sites.

The primase that is tightly as­sociated with the eukaryotic DNA polymerase Q is made of two sub-units and shows no stringent sequence requirements. But it does not act at random.

Enzyme # 2. DNA Polymerase:

DNA polymerase is an enzyme that makes a new DNA on a template strand. Both prokaryotic and eukaryotic cells contain more than one species of DNA polymerase enzymes. Only some of these enzymes actually carry out replication and sometimes they are designated as DNA replicases. The others are involved in subsidiary roles in replication and/or par­ticipate repair synthesis of DNA to replace damaged sequences.

DNA polymerase catalyses the formation of a phosphodiester bond between the 3′ hydroxyl group at the growing end of a DNA chain (the primer) and the 5′ phosphate group of the incoming deoxyribonucleoside triphosphate (Fig. 20.8).

Growth is in the 5’→3′ direc­tion and the order in which the deoxyribonucleotides are added is dictated by base pairing to a template DNA chain. Thus, besides four types of deoxyribonucleotides and Mg++ ions, the enzyme requires both primer and template DNA (Figs. 20.9 and 20.10). No DNA polymerase has been found which is able to initiate DNA chains.

DNA polymerases isolated from prokaryotes and eukaryotes differ from each other in several aspects a brief account of these enzyme is given below:

(i) Prokaryotic DNA Polymerase:

There are three different types of prokaryotic DNA polymerases which are called DNA poly­merase I, II and III. These enzymes have been isolated from prokaryotes. DNA polymerase I or Romberg enzyme was first to be isolated from E.coli by Arthur Kornberg etal and was used for DNA synthesis in 1956. Kornberg received (jointly with Severo Ochoa) the Nobel Prize for this work in 1959.

DNA polymerase is a protein of Mr109, 000 in the form of a single polypeptide chain. It contains only one sulphydryl group and one disulphide’ group—the residue at the N- terminus is methionine.

Most of the prokaryotic DNA polymerase I exhibits the following activates:

ii. 3′ → 5′ exonuclease activity.

iii. 5′ → 3′ exonuclease activity.

iv. Excision of the RNA primers used in the initiation DNA synthesis.

DNA polymerase I is mainly responsible for the synthesis of new strand of DNA. This is the polymerase activity. The direction of synthesis of the new strand’ is always 5′ → 3′. But it is estimated that DNA polymerase incorporates wrong bases during DNA replication with a frequency of 10-5. This is not desirable.

Hence DNA polymerase has also 3′ 5′ exonuclease activity (Fig. 20.11) which enables it to proof­read or edit the newly synthesised DNA strand and, thereby, correct the errors made during DNA replication. An exonuclease is an enzyme that degrades nucleic acids from the free ends.

Therefore, whenever the DNA chain being syn­thesised has a terminal mismatch, i.e., insertion of a wrong base in the new chain, the 3’→ 5′ exonuclease activity of DNA polymerase I in reverse direction clips off the wrong base and immediately the same enzyme, i.e., DNA polymerase I, reinitiates the synthesis of correct base in the growing new chain.

Therefore, due to this dual activity of DNA polymerase I, the chance of errors in DNA replication is reduced.

The 5′ → 3′ exonuclease activity of DNA polymerase I is also very important. It func­tions in the removal of the DNA segment damaged by the irradiation of ultraviolet ray and other agents. An endonuclease (degrades nucleic acid by making internal cut) must cleave the DNA strand close to the site of damage before 5′ → 3′ exonuclease action of the DNA polymerase I may take place.

The 5′ → 3′ exonuclease activity of DNA polymerase I also functions in the removal of RNA primers from DNA. The ribonucleotides are Immediately replaced by deoxyribonucleotides due to the 5′ → 3′ polymerase activity of the enzyme.

The prokaryotic DNA polymerase II was discovered in pol A – mutant of E.coli. Pol A is a gene responsible for the synthesis of polymerase I. Therefore, the mutant of pol A – are deficient in DNA polymerase I or Kornberg enzyme. But, in absence of DNA polymerase I, replication of DNA also takes place in such mutant type.

Therefore, it is obvious that DNA poly­merase II plays a role in DNA replication of such mutant. DNA polymerase II has 5’→ 3′ polymerase activity but it uses gapped DNA template. This enzyme also has the 3′ → 5′ but not the 5′ → 3′ exonuclease activity. The function of E.coli DNA polymerase II in vivo is unknown.

Prokaryotic DNA polymerase III was also discovered in pol A – mutant. There is a strong evidence that unlike DNA polymerase I and II, polymerase III is essential for DNA synthe­sis. The best template for DNA polymerase III is double-stranded DNA with very small gaps containing 3′-OH priming ends. In the DNA polymerase II, the core enzyme is tightly associated with two small sub-units.

The core enzyme has both 3′ → 5′ exonuclease (which could be involved in proof-reading) and 5’→ 3′ exonuclease activities, although the latter is only manifest in vitro on duplex DNA with a single-stranded 5′ tail.

This enzyme has a higher affinity for nucleotide triphosphate than DNA polymerase I and II and catalyses the synthesis of DNA chains at very high rates, i.e., 10-15 times the rate of polymerase I. The major properties of the three DNA polymerases are summarised in Table 22.3.

A DNA polymerase molecule has four func­tional sites which are involved in polymerase activity.

These sites are:

(iii) Primer terminus site, and

The template site binds to the DNA strand functioning as template during DNA replica­tion and holds it in the correct orientation. The primer site is the site where the primer chains to which the nucleotides will be added are attached.

The primer terminus site ensures that the primer binding to the primer site has a free 3′- OH. A primer without a free 3′-OH is not able to bind to this site.

The triphosphate site is the site for bind­ing deoxyribonucleotide 5′-triphosphate that is complementary to the corresponding nucleotide of the template and catalyses the formation of phosphodiester bond between the 5′ phosphate of this nucleotide and the 3′-OH of the terminal primer nucleotide. In addition, there is a 3’→ 5′ exonuclease site and a 5′-3′ exonuclease site of DNA poly­merase I.

(ii) Eukaryotic DNA Polymerase:

In higher eukaryotes, there are at least four DNA polymerases known as α, β,y and δ and a fifth (ɛ) has recently been described. In yeast DNA, polymerase I corresponds to DNA polymerase a, polymerase II to e, polymerase III to 6 and polymerase m to S and they have renamed accordingly.

Polymerase α is present in the nuclei of the cell. DNA polymerase a shows optimal activity with a gapped DNA template but shows a remarkable ability to use single-stranded DNA by forming transient hairpins. It will not bind to duplex DNA.

The native, undegraded enzyme consists of a 180 K Da polymerase together with three sub- units—the 60 and 50 K Da sub-units of about 70, 60 and 50 K Da. Association of the 180 K Da polymerase with the 70KDa protein makes the 3’→ 5′ exonuclease activity of the larger sub-units comprise a primase activity which allows the enzyme to initiate replication on unprimed single-stranded cyclic DNAs.

There­fore, polymerase a have dual activity, i.e., both the polymerase and primase activity. The association of primase with DNA polymerase α is restricted to the DNA synthetic phase.

Polymerase β is also present in the nuclei. It shows optimal activity with native DNA acti­vated by limited treatment with native DNA-ase I to make single-stranded nicks and short gaps bearing 3′-OH priming termini and also shows negligible activity with denatured DNA. DNA polymerase β is believed to play a role in repair of DNA.

Polymerase δ is present in the dividing cell and have got similar properties polymerase a, but having 3′ → 5′ exonuclease activity. The activity of polymerase δ is dependent on activity on two auxiliary proteins: cyclin and activator I.

Due to presence of approximately equal activities of DNA polymerase α and δ it has been proposed that they act as a dimer at the replication fork with the highly processive polymerase δ acting on the leading strand and the primease-associated polymerase a acting on the lagging strand.

Cyclin or PCNA (proliferating cell nuclear antigen) independent form of DNA polymerase 6 is known as polymerase e which has two ac­tive polymerase sub-units of 220 and 145 K Da. DNA polymerase e is also probably involved in replication and it has been proposed that it takes over from DNA polymerase a in the synthesis of Okazaki fragments.

Polymerase y is found in small amount in animal cells. It is also found in mitochon­dria and chloroplasts and is believed to be responsible for replication of the chromosome of these organelles. DNA polymerase 7 isolated from chick embryos is a tetramer having four identical sub-units. It has also a proof-reading exonuclease activity.

Enzyme # 3. DNA Ligases:

DNA ligase is an important enzyme involved in DNA replication. DNA ligases catalyse the formation of a phosphodiester bond between the free 5′ phosphate end of an oligo or polynu­cleotide and the 3′-OH group of a second oligo or polynucleotide next to it.

A ligase-AMP complex seems to be an obligatory intermediate and is formed by reaction with NAD in case of E.coli and B. subtilis and with ATP in mam­malian and phage-infected cells.

The adenyl group is then transferred from the enzyme to the 5′ phosphoryl terminus of the DNA. The activated phosphoryl group is then attached by the 3′-hydroxyl terminus of the DNA to form a phosphodiester bond. DNA ligases join successive Okazaki fragments produced during discontinuous DNA replication and seal the nicks left behind by DNA polymerase.

Reverse Transcriptase:

The enzymes so far discussed are required for the synthesis of DNA on parental tem­plate strand of DNA. But in certain RNA virus or retrovirus, there is an enzyme—called RNA-dependent DNA polymerase or reverse transcriptase—which uses parental RNA strand as a template for the synthesis of DNA.

The immediate product of this enzyme activity is the formation of double-stranded RNA-DNA hybrid which is the result of the synthesis of a complementary strand of DNA using single- stranded viral RNA as template. This enzyme uses viral RNA as template.

Uncovering the Complexity of Transcriptomes with RNA-Seq

In recent years, the introduction of massively parallel sequencing platforms for Next Generation Sequencing (NGS) protocols, able to simultaneously sequence hundred thousand DNA fragments, dramatically changed the landscape of the genetics studies. RNA-Seq for transcriptome studies, Chip-Seq for DNA-proteins interaction, CNV-Seq for large genome nucleotide variations are only some of the intriguing new applications supported by these innovative platforms. Among them RNA-Seq is perhaps the most complex NGS application. Expression levels of specific genes, differential splicing, allele-specific expression of transcripts can be accurately determined by RNA-Seq experiments to address many biological-related issues. All these attributes are not readily achievable from previously widespread hybridization-based or tag sequence-based approaches. However, the unprecedented level of sensitivity and the large amount of available data produced by NGS platforms provide clear advantages as well as new challenges and issues. This technology brings the great power to make several new biological observations and discoveries, it also requires a considerable effort in the development of new bioinformatics tools to deal with these massive data files. The paper aims to give a survey of the RNA-Seq methodology, particularly focusing on the challenges that this application presents both from a biological and a bioinformatics point of view.

1. Introduction

It is commonly known that the genetic information is conveyed from DNA to proteins via the messenger RNA (mRNA) through a finely regulated process. To achieve such a regulation, the concerted action of multiple cis-acting proteins that bind to gene flanking regions—“core” and “auxiliary” regions—is necessary [1]. In particular, core elements, located at the exons' boundaries, are strictly required for initiating the pre-mRNA processing events, whereas auxiliary elements, variable in number and location, are crucial for their ability to enhance or inhibit the basal splicing activity of a gene.

Until recently—less than 10 years ago—the central dogma of genetics indicated with the term “gene” a DNA portion whose corresponding mRNA encodes a protein. According to this view, RNA was considered a “bridge” in the transfer of biological information between DNA and proteins, whereas the identity of each expressed gene, and of its transcriptional levels, were commonly indicated as “transcriptome” [2]. It was considered to mainly consist of ribosomal RNA (80–90%, rRNA), transfer RNA (5–15%, tRNA), mRNA (2–4%) and a small fraction of intragenic (i.e., intronic) and intergenic noncoding RNA (1%, ncRNA) with undefined regulatory functions [3]. Particularly, both intragenic and intergenic sequences, enriched in repetitive elements, have long been considered genetically inert, mainly composed of “junk” or “selfish” DNA [4]. More recently it has been shown that the amount of noncoding DNA (ncDNA) increases with organism complexity, ranging from 0.25% of prokaryotes' genome to 98.8% of humans [5]. These observations have strengthened the evidence that ncDNA, rather than being junk DNA, is likely to represent the main driving force accounting for diversity and biological complexity of living organisms.

Since the dawn of genetics, the relationship between DNA content and biological complexity of living organisms has been a fruitful field of speculation and debate [6]. To date, several studies, including recent analyses performed during the ENCODE project, have shown the pervasive nature of eukaryotic transcription with almost the full length of nonrepeat regions of the genome being transcribed [7].

The unexpected level of complexity emerging with the discovery of endogenous small interfering RNA (siRNA) and microRNA (miRNA) was only the tip of the iceberg [8]. Long interspersed noncoding RNA (lincRNA), promoter- and terminator-associated small RNA (PASR and TASR, resp.), transcription start site-associated RNA (TSSa-RNA), transcription initiation RNA (tiRNA) and many others [8] represent part of the interspersed and crosslinking pieces of a complicated transcription puzzle. Moreover, to cause further difficulties, there is the evidence that most of the pervasive transcripts identified thus far, have been found only in specific cell lines (in most of cases in mutant cell lines) with particular growth conditions, and/or particular tissues. In light of this, discovering and interpreting the complexity of a transcriptome represents a crucial aim for understanding the functional elements of such a genome. Revealing the complexity of the genetic code of living organisms by analyzing the molecular constituents of cells and tissues, will drive towards a more complete knowledge of many biological issues such as the onset of disease and progression.

The main goal of the whole transcriptome analyses is to identify, characterize and catalogue all the transcripts expressed within a specific cell/tissue—at a particular stage—with the great potential to determine the correct splicing patterns and the structure of genes, and to quantify the differential expression of transcripts in both physio- and pathological conditions [9].

In the last 15 years, the development of the hybridization technology, together with the tag sequence-based approaches, allowed to get a first deep insight into this field, but, beyond a shadow of doubt, the arrival on the marketplace of the NGS platforms, with all their “Seq” applications, has completely revolutionized the way of thinking the molecular biology.

The aim of this paper is to give an overview of the RNA-Seq methodology, trying to highlight all the challenges that this application presents from both the biological and bioinformatics point of view.

2. Next Generation Sequencing Technologies

Since the first complete nucleotide sequence of a gene, published in 1964 by Holley [10] and the initial developments of Maxam and Gilbert [11] and Sanger et al. [12] in the 1970s (see Figure 1), the world of nucleic acid sequencing was a RNA world and the history of nucleic acid sequencing technology was largely contained within the history of RNA sequencing.

In the last 30 years, molecular biology has undergone great advances and 2004 will be remembered as the year that revolutionized the field thanks to the introduction of massively parallel sequencing platforms, the Next Generation Sequencing-era, [13–15], started. Pioneer of these instruments was the Roche (454) Genome Sequencer (GS) in 2004 (, able to simultaneously sequence several hundred thousand DNA fragments, with a read length greater than 100 base pairs (bp). The current GS FLX Titanium produces greater than 1 million reads in excess of 400 bp. It was followed in 2006 by the Illumina Genome Analyzer (GA) ( capable to generate tens of millions of 32-bp reads. Today, the Illumina GAIIx produces 200 million 75–100 bp reads. The last to arrive in the marketplace was the Applied Biosystems platform based on Sequencing by Oligo Ligation and Detection (SOLiD) (, capable of producing 400 million 50-bp reads, and the Helicos BioScience HeliScope (, the first single-molecule sequencer that produces 400 millions 25–35 bp reads.

While the individual approaches considerably vary in their technical details, the essence of these systems is the miniaturization of individual sequencing reactions. Each of these miniaturized reactions is seeded with DNA molecules, at limiting dilutions, such that there is a single DNA molecule in each, which is first amplified and then sequenced. To be more precise, the genomic DNA is randomly broken into smaller sizes from which either fragment templates or mate-pair templates are created. A common theme among NGS technologies is that the template is attached to a solid surface or support (immobilization by primer or template) or indirectly immobilized (by linking a polymerase to the support). The immobilization of spatially separated templates allows simultaneous thousands to billions of sequencing reactions. The physical design of these instruments allows for an optimal spatial arrangement of each reaction, enabling an efficient readout by laser scanning (or other methods) for millions of individual sequencing reactions onto a standard glass slide. While the immense volume of data generated is attractive, it is arguable that the elimination of the cloning step for the DNA fragments to sequence is the greatest benefit of these new technologies. All current methods allow the direct use of small DNA/RNA fragments not requiring their insertion into a plasmid or other vector, thereby removing a costly and time-consuming step of traditional Sanger sequencing.

It is beyond a shadow of doubt that the arrival of NGS technologies in the marketplace has changed the way we think about scientific approaches in basic, applied and clinical research. The broadest application of NGS may be the resequencing of different genomes and in particular, human genomes to enhance our understanding of how genetic differences affect health and disease. Indeed, these platforms have been quickly applied to many genomic contexts giving rise to the following “Seq” protocols: RNA-Seq for transcriptomics, Chip-Seq for DNA-protein interaction, DNase-Seq for the identification of most active regulatory regions, CNV-Seq for copy number variation, and methyl-Seq for genome wide profiling of epigenetic marks.

3. RNA-Seq

RNA-Seq is perhaps one of the most complex next-generation applications. Expression levels, differential splicing, allele-specific expression, RNA editing and fusion transcripts constitute important information when comparing samples for disease-related studies. These attributes, not readily available by hybridization-based or tag sequence-based approaches, can now be far more easily and precisely obtained if sufficient sequence coverage is achieved. However, many other essential subtleties in the RNA-Seq data remain to be faced and understood.

Hybridization-based approaches typically refer to the microarray platforms. Until recently, these platforms have offered to the scientific community a very useful tool to simultaneously investigate thousands of features within a single experiment, providing a reliable, rapid, and cost-effective technology to analyze the gene expression patterns. Due to their nature, they suffer from background and cross-hybridization issues and allow researchers to only measure the relative abundance of RNA transcripts included in the array design [16]. This technology, which measures gene expression by simply quantifying—via an indirect method—the hybridized and labeled cDNA, does not allow the detection of RNA transcripts from repeated sequences, offering a limited dynamic range, unable to detect very subtle changes in gene expression levels, critical in understanding any biological response to exogenous stimuli and/or environmental changes [9, 17, 18].

Other methods such as Serial, Cap Analysis of Gene Expression (SAGE and CAGE, resp.) and Polony Multiplex Analysis of Gene Expression (PMAGE), tag-based sequencing methods, measure the absolute abundance of transcripts in a cell/tissue/organ and do not require prior knowledge of any gene sequence as occurs for microarrays [19]. These analyses consist in the generation of sequence tags from fragmented cDNA and their following concatenation prior to cloning and sequencing [20]. SAGE is a powerful technique that can therefore be viewed as an unbiased digital microarray assay. However, although SAGE sequencing has been successfully used to explore the transcriptional landscape of various genetic disorders, such as diabetes [21, 22], cardiovascular diseases [23], and Downs syndrome [24, 25], it is quite laborious for the cloning and sequencing steps that have thus far limited its use.

In contrast, RNA-Seq on NGS platforms has clear advantages over the existing approaches [9, 26]. First, unlike hybridization-based technologies, RNA-Seq is not limited to the detection of known transcripts, thus allowing the identification, characterization and quantification of new splice isoforms. In addition, it allows researchers to determine the correct gene annotation, also defining—at single nucleotide resolution—the transcriptional boundaries of genes and the expressed Single Nucleotide Polymorphisms (SNPs). Other advantages of RNA-Seq compared to microarrays are the low “background signal,” the absence of an upper limit for quantification and consequently, the larger dynamic range of expression levels over which transcripts can be detected. RNA-Seq data also show high levels of reproducibility for both technical and biological replicates.

Recent studies have clearly demonstrated the advantages of using RNA-Seq [27–50]. Table 1 provides a short description of recent and more relevant papers on RNA-Seq in mammals.

Many research groups have been able to precisely quantify known transcripts, to discover new transcribed regions within intronic or intergenic regions, to characterize the antisense transcription, to identify alternative splicing with new combinations of known exon sequences or new transcribed exons, to evaluate the expression of repeat elements and to analyze a wide number of known and possible new candidate expressed SNPs, as well as to identify fusion transcripts and other new RNA categories.

3.1. Sample Isolation and Library Preparation

The first step in RNA-Seq experiments is the isolation of RNA samples further RNA processing strictly depends on the kind of analysis to perform. Indeed, as “transcriptome” is defined as the complete collection of transcribed elements in a genome (see [2]), it consists of a wide variety of transcripts, both mRNA and non-mRNA, and a large amount (90–95%) of rRNA species. To perform a whole transcriptome analysis, not limited to annotated mRNAs, the selective depletion of abundant rRNA molecules (5S, 5.8S, 18S and 28S) is a key step. Hybridization with rRNA sequence-specific 5

-biotin labeled oligonucleotide probes, and the following removal with streptavidin-coated magnetic beads, is the main procedure to selectively deplete large rRNA molecules from total isolated RNA. Moreover, since rRNA—but not capped mRNAs—is characterized by the presence of

phosphate, an useful approach for selective ribo-depletion is based on the use of an exonuclease able to specifically degrade RNA molecules bearing a phosphate (mRNA-ONLY kit, Epicentre). Compared to the polyadenylated (polyA+) mRNA fraction, the ribo-depleted RNA is enriched in non-polyA mRNA, preprocessed RNA, tRNA, regulatory molecules such as miRNA, siRNA, small ncRNA, and other RNA transcripts of yet unknown function (see review [8]).

How closely the RNA sequencing reflects the original RNA populations is mainly determined in the library preparation step, crucial in the whole transcriptome protocols. Although NGS protocols were first developed for the analysis of genomic DNA, these technical procedures have been rapidly and effectively adapted to the sequencing of double-strand (ds) cDNA for transcriptome studies [51].

A double-stranded cDNA library can be usually prepared by using: (1) fragmented double-stranded (ds) cDNA and (2) hydrolyzed or fragmented RNA.

The goal of the first approach is to generate high-quality, full-length cDNAs from RNA samples of interest to be fragmented and then ligated to an adapter for further amplification and sequencing. By the way, since the primer adaptor is ligated to a fragmented ds cDNA, any information on the transcriptional direction would completely be lost. Preserving the strandedness is fundamental for data analysis it allows to determine the directionality of transcription and gene orientation and facilitates detection of opposing and overlapping transcripts. To take into account and thus to avoid this biologically relevant issue, many approaches, such as pretreating the RNA with sodium bisulphite to convert cytidine into uridine [52], have been so far developed. Other alternative protocols, differing in how the adaptors are inserted into ds cDNA, have been recently published: direct ligation of RNA adaptors to the RNA sample before or during reverse transcription [30, 31, 53], or incorporation of dUTP during second strand synthesis and digestion with uracil-Nglycosylase enzyme [45]. For instance, SOLiD Whole Transcriptome Kit contains two different sets of oligonucleotides with a single-stranded degenerate sequence at one end, and a defined sequence required for sequencing at the other end, constraining the orientation of RNA in the ligation reaction. The generation of ds cDNA from RNA involves a number of steps. First, RNA is converted into first-strand cDNA using reverse transcriptase with either random hexamers or oligo(dT) as primers. The resulting first-strand cDNA is then converted into double-stranded cDNA, further fragmented with DNAse I and then ligated to adapters for amplification and sequencing [54]. The advantage of using oligo dT is that the majority of cDNA produced should be polyadenylated mRNA, and hence more of the sequence obtained should be informative (nonribosomal). The significant disadvantage is that the reverse transcriptase enzyme will fall off of the template at a characteristic rate, resulting in a bias towards the

end of transcripts. For long mRNAs this bias can be pronounced, resulting in an under representation (or worse in the absence) of the end of the transcript in the data. The use of random primers would therefore be the preferred method to avoid this problem and to allow a better representation of the end of long ORFs. However, when oligo dT primers are used for priming, the slope which is formed by the diminishing frequency of reads towards the end of the ORF can, in some cases, be useful for determining the strand of origin for new transcripts if strand information has not been retained [28, 37].

Fragmenting RNA, rather than DNA, has the clear advantage of reducing possible secondary structures, particularly for tRNA and miRNA, resulting in a major heterogeneity in coverage and can also lead to a more comprehensive transcriptome analysis (Figure 2). In this case, the RNA sample is first fragmented by using controlled temperature or chemical/enzymatic hydrolysis, ligated to adapters and retro-transcribed by complementary primers. Different protocols have been so far developed. Indeed, the adaptor sequences may be directly ligated to the previously fragmented RNA molecules by using T4 RNA ligase, and the resulting library can be reverse transcribed with primer pairs specifically suited on the adaptor sequences, and then sequenced. Another approach, recently described in [55], consists in the in vitro polyadenilation of RNA fragments in order to have a template for the next step of reverse transcription using poly(dT) primers containing both adaptor sequences (linkers), separated back-to-back by an endonuclease site. The resulting cDNAs are circularized and then cleaved at endonuclease site in the adaptors, thus leaving ss cDNA with the adaptors at both ends [55]. A third protocol described by [33], named double random priming method, uses biotinylated random primers (a sequencing primer P1 at the end, and a random octamer at the end). After a first random priming reaction, the products are isolated by using streptavidin beads and a second random priming reaction is performed on a solid phase with a random octamer carrying the sequencing primer P2. Afterwards, second random priming products are released from streptavidin beads by heat, PCR-amplified, gel-purified, and finally subjected to sequencing process from the P1 primer. Moreover, as already mentioned, in [45] the authors used dUTP—a surrogate for dTTP—during the second-strand synthesis to allow a selective degradation of second cDNA strand after adaptor ligation using a uracil-N-glycosylase. The use of engineered DNA adaptors, combined to the dUTP protocol, ensures that only the cDNA strand corresponding to the “real" transcript is used for library amplification and sequencing, reserving the strandedness of gene transcription [45].

Library preparation and clonal amplification. Schematic representation of a workflow for library preparation in RNA-Seq experiments on the SOLiD platform. In the figure is depicted a total RNA sample after depletion of rRNA, containing both polyA and non-polyA mRNA, tRNAs, miRNAs and small noncoding RNAs. Ribo-depleted total RNA is fragmented (1), then ligated to specific adaptor sequences (2) and retro-transcribed (3). The resulting cDNA is size selected by gel electrophoresis (4), and cDNAs are PCR amplified (5). Then size distribution is evaluated (6). Emulsion PCR, with one cDNA fragment per bead, is used for the clonal amplification of cDNA libraries (7). Purified and enriched beads are finally deposited onto glass slides (8), ready to be sequenced by ligation.

However, independently on the library construction procedure, particular care should be taken to avoid complete degradation during RNA fragmentation.

The next step of the sequencing protocols is the clonally amplification of the cDNA fragments.

Illumina, 454 and SOLiD use clonally amplified templates. In particular, the last two platforms use an innovative procedure, emulsion PCR (emPCR), to prepare sequencing templates in a cell-free system. cDNA fragments from a fragment or paired-end library are separated into single strands and captured onto beads under conditions that favour one DNA molecule per bead. After the emPCR and beads enrichment, millions of them are chemically crosslinked to an amino-coated glass surface (SOLiD) or deposited into individual PicoTiterPlate (PTP) wells (454) in which the NGS chemistry can be performed. Solid-phase amplification (Illumina) can also be used to produce randomly distributed, clonally amplified clusters from fragment or mate-pair templates on a glass slide. High-density forward and reverse primers are covalently attached to the slide, and the ratio of the primers to the template defines the surface density. This procedure can produce up to 200 million spatially separated template clusters, providing ends for primer hybridization, needed to initiate the NGS reaction. A different approach is the use of single molecules templates (Helicos BioScience) usually immobilized on solid supports, in which PCR amplification is no more required, thus avoiding the insertion of possible confounding mutations in the templates. Furthermore, AT- and GC-rich sequences present amplification issues, with over- or under-representation bias in genome alignments and assemblies. Specific adaptors are bound to the fragmented templates, then hybridized to spatially distributed primers covalently attached to the solid support [56].

3.2. Sequencing and Imaging

NGS platforms use different sequencing chemistry and methodological procedures.

Illumina and HeliScope use the Cyclic Reversible Termination (CRT), which implies the use of reversible terminators (modified nucleotide) in a cyclic method. A DNA polymerase, bound to the primed template, adds one fluorescently modified nucleotide per cycle then the remaining unincorporated nucleotides are washed away and imaging capture is performed. A cleavage step precedes the next incorporation cycle to remove the terminating/inhibiting group and the fluorescent dye, followed by an additional washing. Although these two platforms use the same methodology, Illumina employs the four-colour CRT method, simultaneously incorporating all 4 nucleotides with different dyes HeliScope uses the one-colour (Cy5 dye) CRT method.

Substitutions are the most common error type, with a higher portion of errors occurring when the previous incorporated nucleotide is a G base [57]. Under representation of AT-rich and GC-rich regions, probably due to amplification bias during template preparation [57–59], is a common drawback.

In contrast, SOLiD system uses the Sequencing by Ligation (SBL) with 1, 2-nucleotide probes, based on colour space, which is an unique feature of SOLiD. It has the main advantage to improve accuracy in colour and single nucleotide variations (SNV) calling, the latter of which requires an adjacent valid colour change. In particular, a universal primer is hybridized to the template beads, and a library of 1, 2-nucleotide probes is added. Following four-colour imaging, the ligated probes are chemically cleaved to generate a 5 -phosphate group. Probe hybridization and ligation, imaging, and probe cleavage is repeated ten times to yield ten colour calls spaced in five-base intervals. The extended primer is then stripped from the solid-phase-bound templates. A second ligation round is performed with a

primer, which resets the interrogation bases and the corresponding ten colour calls one position to the left. Ten ligation cycles ensue, followed by three rounds of ligation cycles. Colour calls from the five-ligation rounds are then ordered into a linear sequence (the csfasta colour space) and aligned to a reference genome to decode the sequence. The most common error type observed by using this platform are substitutions, and, similar to Illumina, SOLiD data have also revealed an under representation of AT- and GC-rich regions [58].

Another approach is pyrosequencing (on 454), a non-electrophoretic bioluminescence method, that unlike the above-mentioned sequencing approaches is able to measure the release of pyrophosphate by proportionally converting it into visible light after enzymatic reactions. Upon incorporation of the complementary dNTP, DNA polymerase extends the primer and pauses. DNA synthesis is reinitiated following the addition of the next complementary dNTP in the dispensing cycle. The enzymatic cascade generates a light recorded as a flowgram with a series of picks corresponding to a particular DNA sequence. Insertions and deletions are the most common error types.

An excellent and detailed review about the biotechnological aspects of NGS platforms can be found in [15].

3.3. From Biology to Bioinformatics

The unprecedented level of sensitivity in the data produced by NGS platforms brings with it the power to make many new biological observations, at the cost of a considerable effort in the development of new bioinformatics tools to deal with these massive data files.

First of all, the raw image files from one run of some next generation sequencers can require terabytes of storage, meaning that simply moving the data off the machine can represent a technical challenge for the computer networks of many research centers. Moreover, even when the data are transferred from the machine for subsequent processing, common desktop computer will be hopelessly outmatched by the volume of data from a single run. As a result, the use of a small cluster of computers is extremely beneficial to reduce computational bottleneck.

Another issue is the availability of software required to perform downstream analysis. Indeed after image and signal processing the output of a RNA-Seq experiment consists of 10–400 millions of short reads (together with their base-call quality values), typically of 30–400 bp, depending on the DNA sequencing technology used, its version and the total cost of the experiments.

NGS data analysis heavily relies on proper mapping of sequencing reads to corresponding reference genomes or on their efficient de novo assembly. Mapping NGS reads with high efficiency and reliability currently faces several challenges. As noticed by [60], differences between the sequencing platforms in samples preparation, chemistry, type and volume of raw data, and data formats are very large, implying that each platform produces data affected by characteristic error profiles. For example the 454 system can produce reads with insertion or deletion errors during homopolymer runs and generate fewer, but longer, sequences in fasta like format allowing to adapt classical alignment algorithms the Illumina has an increased likelihood to accumulate sequence errors toward the end of the read and produce fasta reads, but they are shorter, hence requiring specific alignment algorithms the SOLiD also tends to accumulate bias at the end of the reads, but uses di-base encoding strategy and each sequence output is encoded in a colour space csfasta format. Hence, some sequence errors are correctable, providing better discrimination between sequencing error and polymorphism, at the cost of requiring analysis tools explicitly built for handling this aspect of the data. It is not surprising that there are no “box standard” software available for end-users, hence the implementation of individualized data processing pipelines, combining third part packages and new computational methods, is the only advisable approach. While some existing packages are already enabling to solve general aspects of RNA-Seq analysis, they also require a time consuming effort due to the lack of clear documentation in most of the algorithms and the variety of the formats. Indeed, a much clear documentation of the algorithms is needed to ensure a full understanding of the processed data. Community adoption of input/output data formats for reference alignments, assemblies and detected variants is also essential for ease the data management problem. Solving these issues may simply shift the software gap from sequence processing (base-calling, alignment or assembly, positional counting and variant detection) to sequence analysis (annotation and functional impact).

3.4. Genome Alignment and Reads Assembly

The first step of any NGS data analysis consists of mapping the sequence reads to a reference genome (and/or to known annotated transcribed sequences) if available, or de novo assembling to produce a genome-scale transcriptional map. (see Figure 3 for an illustration of a classical RNA-Seq computational pipeline). The decision to use one of strategies is mainly based on the specific application. However, independently on the followed approach, there is a preliminary step that can be useful to perform which involves the application of a quality filtering to remove poor quality reads and to reduce the computational time and the effort for further analysis.

Analyzing the transcriptome of organisms without a specific reference genome requires de novo assembling (or a guided assembly with the help of closely related organisms) of expressed sequence tags (ESTs) using short-read assembly programs such as [61, 62]. A reasonable strategy for improving the quality of the assembly is to increase the read coverage and to mix different reads types. However RNA-Seq experiments without a reference genome propose specific features and challenges that are out of the scope of the present paper we refer the readers to [63, 64] for further details.

In most cases, the reference genome is available and the mapping can be carried out using either the whole genome or known transcribed sequences (see, e.g., [28–30, 32, 34, 37, 40, 46, 47]). In both cases, this preliminary but crucial step is the most computationally intensive of the entire process and strongly depends on the type of available sequences (read-length, error profile, amount of data and data format). It is not surprising that such nodal point still constitutes a very prominent area of research (see, e.g., [65–67] for a review) and has produced a great number of different algorithms in the last couple of years (e.g., [68–78]). Clearly, not all of them completely support the available platforms or are scalable for all amount of throughput or genome size. Nevertheless, the sequencing technologies are still in a developing phase with a very fast pace of increase in throughput, reads length and data formats after few months. Consequently, the already available mapping/assembly software are continuously under evolution in order to adapt themselves to the new data formats, to scale with the amount of data and to reduce their computational demand. New softwares are also continuously complementing the panorama. Moreover, the alignment phase of reads from RNA-Seq experiments presents many other subtleties to be considered standard mapping algorithms are not able to fully exploit the complexity of the transcriptome, requiring to be modified or adapted in order to account for splicing events in eucaryotes.

The easiest way to handle such difficulty is to map the reads directly on known transcribed sequences, with the obvious drawback of missing new transcripts. Alternatively, the reads can be mapped continuously to the genome, but with the added opportunity of mapping reads that cross splice junctions. In this case, the algorithms differ from whether they require or not junctions's model. Algorithms such as Erange [37] or RNA-mate [79] require library of junctions constructed using known splice junctions extracted from data-bases and also supplemented with any set of putative splice junctions obtained, for instance, using a combinatorial approach on genes' model or ESTs sequences. Clearly, such approaches do not allow to map junctions not previously assembled in the junctions' library. On the other hand, algorithms like the WT [69], QPALMA [80], TopHat [81], G.Mo.R-Se [63], and PASS [78] potentially allow to detect new splice isoforms, since they use a more sophisticated mapping strategy. For instance, WT [69] splits the reads in left and right pieces, aligns each part to the genome, then attempts to extend each alignment on the other side to detect the junction. Whereas TopHat [81] first maps the reads against the whole reference genome using [77], second aggregates the mapped reads in islands of candidate exons on which compute a consensus measure, then generates potential donor/acceptor splice sites using neighboring exons, and finally tries to align the reads, unmapped to the genome, to these splice junction sequences.

Most of the RNA-Seq packages are built on top of optimized short read core mappers [68, 69, 72, 77] and the mapping strategy is carried out by performing multiple runs or cycles. At the end of each cycle the unmatched reads are trimmed from one extreme and another step of alignment is attempted (see, e.g., [79]). Specific tolerances can be set for each alignment in order to increase the amount of mappable data. Obviously the simplest core approach is to map the sequence reads across the genome allowing the user to specify only the number of tolerated mismatches, although other methods allow to use also gapped alignment. Such flexibility can be beneficial for the rest of the analysis since both sequencing errors, that usually increase with the length of the sequence, and SNPs may cause substitutions and insertion/deletion of nucleotides in the reads. On the other hand, increasing the mapping flexibility also introduces a higher level of noise in the data. The compromise between the number of mapped reads and the quality of the resulting mapping is a very time consuming process without an optimal solution.

At the end of the mapping algorithm one can distinguish between three types of reads: reads that map uniquely to the genome or to the splice junctions (Uniquely Mappable Reads, UMR), reads with multiple (equally or similarly likely) locations either to the genome or to the splice junctions (Multilocation Mappable Reads, MMR) and reads without a specific mapping location. MMRs arise predominantly from conserved domains of paralogous gene families and from repeats. The fraction of mappable reads that are MMRs depends on the length of the read, the genome under investigation, and the expression in the individual sample however it is typically between 10–40% for mammalian derived libraries [30, 37]. Most of the studies [28, 34] usually discarded MMRs from further analysis, limiting the attention only to UMRs. Clearly, this omission introduces experimental bias, decreases the coverage and reduces the possibility of investigating expressed regions such as active retrotransposons and gene families. An alternative strategy for the removal of the MMRs is to probabilistically assign them to each genomic location they map to. The simplest assignment considers equal probabilities. However, far better results have been obtained using a guilt-by-association strategy that calculates the probability of a MMRs originating from a particular locus. In [82], the authors proposed to proportionally assign MMRs to each of their mapping locations based on unique coincidences with either UMRs and other MMRs. Such a technique was later adopted in [79]. By contrast, in [83], the authors computed the probability as the ratio between the number of UMRs occurring in a nominal window surrounding each locus occupied by the considered MMR and the total number of UMRs proximal to all loci associated with that MMR. Similarly, in [37] the MMRs were fractionally assigned to their different possible locations considering the expression levels of their respective gene models. All these rescue strategies lead to substantially higher transcriptome coverage and give expression estimates in better agreement with microarrays than those using only UMRs (see, [37, 83]). Very recently, a more sophisticated approach was proposed in [84]. The authors introduced latent random variables representing the true mappings, with the parameters of the graphical model corresponding to isoform expression levels, read distributions across transcripts, and sequencing error. They allocated MMRs by maximizing the likelihood of the expression levels using an Expectation-Maximization (EM) algorithm. Additionally, they also showed that previous rescue methods introduced in [37, 82] are roughly equivalent to one iteration of EM. Independently on the specific proposal, we observe that all the above mentioned techniques work much better with data that preserve RNA strandedness. Alternatively, the use of paired-end protocols should help to alleviate the MMRs problem. Indeed, when one of the paired reads maps to a highly repetitive element in the genome but the second does not, it allows both reads to be unambiguously mapped to the reference genome. This is accomplished by first matching the first nonrepeat read uniquely to a genomic position and then looking within a size window, based on the known size range of the library fragments, for a match for the second read. The usefulness of this approach was demonstrated to improve read matching from 85% (single reads) to 93% (paired reads) [70], allowing a significant improvement in genome coverage, particularly in repeat regions. Currently, all of the next generation sequencing technologies are capable for generating data from paired-end reads, but unfortunately, till now only few RNA-Seq software support the use of paired-end reads in conjunction with the splice junctions mapping.

One of the possible reasons for reads not mapping to the genome and splice junctions is the presence of higher sequencing errors in the sequence. Other reasons can be identified in higher polymorphisms, insertion/deletion, complex exon-exon junctions, miRNA and small ncRNA: such situations could potentially be recovered by more sophisticated or combined alignment strategy.

Once mapping is completed, the user can display and explore the alignment on a genome browser (see Figure 4 for a screen-shot example) such as UCSC Genome Browser [85] ( or the Integrative Genomics Viewer (IGV) (, or on specifically devoted browsers such as EagleView [86], MapView [87] or Tablet [88], that can provide some highly informative views of the results at different levels of aggregations. Such tools allow to incorporate the obtained alignment with database annotations and other source of information, to observe specific polymorphism against sequence error, to identify well documented artifacts due to the DNA amplifications, as well as to detect other source of problems such as the not uniformity of the reads coverage across the transcript. Unfortunately, in many cases the direct visualization of the data is hampered by the lack of a common format for the alignment algorithm, causing a tremendous amount of extra work in format conversion for visualization purposes, feature extraction and other downstream analysis. Only recently, the SAM (Sequencing Alignment/Map) format [89] has been proposed as a possible standard for storing read alignment against reference sequences.

(c) Strand-Specific Read Distribution in UCSC Genome Browser and IGV. (a) UCSC Genome Browser showing an example of stranded sequences generated by RNA-Seq experiment on NGS platform. In particular, the screenshot—of a characteristic “tail to tail” orientation of two human genes—clearly shows the specific expression in both strands where these two genes overlap, indicating that the strandedness of reads is preserved. (b) The same genomic location in the IGV browser, showing the reads (coloured blocks) distribution along TMED1 gene. The grey arrows indicate the sense of transcription. The specific expression in both strands where the genes overlap, indicates that the strandedness of reads is preserved. In (c) a greater magnification of the reads mapping to the same region at nucleotide level, useful to SNP analysis. The chromosome positions are shown at the top and genomic loci of the genes are shown at the bottom of each panel.
3.5. Quantifying Gene Expression and Isoforms' Abundance

Browser-driven analyses are very important for visualizing the quality of the data and to interpret specific events on the basis of the available annotations and mapped reads. However they only provide a qualitative picture of the phenomenon under investigation and the enormous amount of data does not allow to easily focus on the most relevant details. Hence, the second phase of most of the RNA-Seq pipeline consists of the automatic quantification of the transcriptional events across the entire genome (see Figure 4). From this point of view the interest is both quantifying known elements (i.e., genes or exons already annotated) and detecting new transcribed regions, defined as transcribed segments of DNA not yet annotated as exons in databases. The ability to detect these unannotated regions, even though biologically relevant, is one of the main advantages of the RNA-Seq over microarray technology. Usually, the quantification step is preliminary to any differential expression approach, see Figure 5.

Mapping and quantification of the signal. RNA-seq experiments produce short reads sequenced from processed mRNAs. When a reference genome is available the reads can be mapped on it using efficient alignment software. Classical alignment tools will accurately map reads that fall within an exon, but they will fail to map spliced reads. To handle such problem suitable mappers, based either on junctions library or on more sophisticated approaches, need to be considered. After the mapping step annotated features can be quantified.

In order to derive a quantitative expression for annotated elements (such as exons or genes) within a genome, the simplest approach is to provide the expression as the total number of reads mapping to the coordinates of each annotated element. In the classical form, such method weights all the reads equally, even though they map the genome with different stringency. Alternatively, gene expression can be calculated as the sum of the number of reads covering each base position of the annotated element in this way the expression is provided in terms of base coverage. In both cases, the results depend on the accuracy of the used gene models and the quantitative measures are a function of the number of mapped reads, the length of the region of interest and the molar concentration of the specific transcript. A straightforward solution to account for the sample size effect is to normalize the observed counts for the length of the element and the number of mapped reads. In [37], the authors proposed the Reads Per Kilobase per Million of mapped reads (RPKM) as a quantitative normalized measure for comparing both different genes within the same sample and differences of expression across biological conditions. In [84], the authors considered two alternative measures of relative expression: the fraction of transcripts and the fraction of nucleotides of the transcriptome made up by a given gene or isoform.

Although apparently easy to obtain, RPKM values can have several differences between software packages, hidden at first sight, due to the lack of a clear documentation of the analysis algorithms used. For example ERANGE [37] uses a union of known and new exon models to aggregate reads and determines a value for each region that includes spliced reads and assigned multireads too, whereas [30, 40, 81, 90] are restricted to known or prespecified exons/gene models. However, as noticed in [91], several experimental issues influence the RPKM quantification, including the integrity of the input RNA, the extent of ribosomal RNA remaining in the sample, the size selection steps and the accuracy of the gene models used.

In principle, RPKMs should reflect the true RNA concentration this is true when samples have relatively uniform sequence coverage across the entire gene model. The problem is that all protocols currently fall short of providing the desired uniformity, see for example [37], where the Kolmogorov-Smirnov statistics is used to compare the observed reads distribution on each selected exon model with the theoretical uniform one. Similar conclusions are also illustrated in [57, 58], among others.

Additionally, it should be noted that RPKM measure should not be considered as the panacea for all RNA-Seq experiments. Despite the importance of the issue, the expression quantification did not receive the necessary attention from the community and in most of the cases the choice has been done regardless of the fact that the main question is the detection of differentially expressed elements. Regarding this point in [92] it is illustrated the inherent bias in transcript length that affect RNA-Seq experiments. In fact the total number of reads for a given transcript is roughly proportional to both the expression level and the length of the transcript. In other words, a long transcript will have more reads mapping to it compared to a short gene of similar expression. Since the power of an experiment is proportional to the sampling size, there will be more statistical power to detect differential expression for longer genes. Therefore, short transcripts will always be at a statistical disadvantage relative to long transcripts in the same sample. RPKM-type measures provide an expression level normalized by the length of the gene and this only apparently solves the problem it gives an unbiased measure of the expression level, but also changes the variance of the data in a length dependent manner, resulting in the same bias to differential expression estimation. In order to account for such an inherent bias, in [92] the authors proposed to use a fixed length window approach, with a window size smaller than the smallest gene. This method can calculate aggregated tag counts for each window and consequently assess them for differential expression. However, since the analysis is performed at the window level some proportion of the data will be discarded moreover such an approach suffers for a reduced power and highly expressed genes are more likely to be detected due to the fact that the sample variance decreases with the expression level. Indeed, it should be noticed that the sample variance depends on both the transcript length and the expression level.

Finally, we observe that annotation files are often inaccurate boundaries are not always mapped precisely, ambiguities and overlaps among transcripts often occur and are not yet completely solved. Concerning this issue in [93] the authors proposed a method based on the definition of “union-intersection genes” to define the genomic region of interest and normalized absolute and relative expression measures within. Also, in this case we observe that all strategies work much better with data that preserve RNA strandedness, which is an extremely valuable information for transcriptome annotation, especially for regions with overlapping transcription from opposite directions.

The quantification methods described above do not account for new transcribed region. Although several studies have already demonstrated that RNA-Seq experiments, with their high resolution and sensitivity have great potentiality in revealing many new transcribed regions, unidentifiable by microarrays, the detection of new transcribed regions is mainly obtained by means of a sliding window and heuristic approaches. In [94] stretches of contiguous expression in intergenic regions are identified after removing all UTRs from the intergenic search space by using a combination of information arising from tiling-chip and sequence data and visual inspection and manual curation. The procedure is quite complex and is mainly due to the lack of strandedness information in their experiment. On the contrary, the hybridization data are less affected by these issues because they distinguish transcriptional direction and do not show any 5 bias (see [94] for further details). Then, new transcribed regions are required to have a length of at least 70 bp and an average sequence coverage of 5 reads per bp. A similar approach, with different choices of the threshold and the window, was proposed in [40], where the authors investigated either intergenic and intronic regions. The choices of the parameters are assessed by estimating noise levels by means of a Poisson model of the noncoding part of the genome. In [45] the whole genome is split into 50 bp windows (non-overlapping). A genomic region is defined as a new transcribed region if it results from the union of two consecutive windows, with at least two sequence reads mapped per window. Additionally, the gap between each new transcribed regions should be at least 50 bp, and the gap between a new transcribed region and an annotated gene (with the same strand) at least 100 bp. A slightly more sophisticated approach is used in ERANGE [37]. Reads that do not fall within known exons are aggregated into candidate exons by requiring regions with at least 15 reads, whose starts are not separated by more than 30 bp. Most of the candidate exons are assigned to neighboring gene models when they are within a specifiable distance of the model.

These studies, among others, reveal many of these new transcribed regions. Unfortunately, most of them do not seem to encode any protein, and hence their functions remain often to be determined. In any case, these new transcribed regions, combined with many undiscovered new splicing variants, suggest that there is considerably more transcript complexity than previously appreciated. Consequently further RNA-Seq experiments and more sophisticated analysis methods can disclose it.

The complexity of mammalian transcriptomes is also compounded by alternative splicing which allows one gene to produce multiple transcript isoforms. Alternative splicing includes events such as exon skipping, alternative or splicing, mutually exclusive exons, intron retention, and “cryptic” splice sites (see Figure 6). The frequency of occurrence of alternative splicing events is still underestimated. However it is well known that multiple transcript isoforms produced from a single gene can lead to protein isoforms with distinct functions, and that alternative splicing is widely involved in different physiological and pathological processes. One of the most important advantages of the RNA-Seq experiments is the possibility of understanding and comparing the transcriptome at the isoform level (see [95, 96]). In this context, two computational problems need to be solved: the detection of different isoforms and their quantification in terms of transcript abundance.

(h) Alternative splicing. Schematic representation of the possible patterns of alternative splicing of a gene. Boxes are discrete exons that can be independently included or excluded from the mRNA transcript. Light blue boxes represent constitutive exons, violet and red boxes are alternatively spliced exons. Dashed lines represent alternative splicing events. (a) Canonical exon skipping (b) 5

alternative splicing (d) Mutually exclusive splicing event involving the selection of only one from two or more exon variants (e) Intra-exonic “cryptic” splice site causing the exclusion of a portion of the exon from the transcript (f) Usage of new alternative 5

Initial proposals for solving these problems were essentially based on a gene-by-gene manual inspection usually focusing the attention to the detection of the presence of alternative splicing forms rather than to their quantification. For example, the knowledge of exon-exon junction reads and of junctions that fall into some isoform-specific regions can provide useful information for identifying different isoforms. The reliability of a splicing junction is usually assessed by counting features like the number of reads mapping to the junction, the number of mismatches on each mapped read, the mapping position on the junction and the mismatches location in a sort of heuristic approach. Unfortunately, these techniques cannot be scaled to the genome level and they are affected by a high false positive and false negative rate.

Following the above mentioned ideas, in [40] the authors detected junctions by computing the probability of a random hits for a read of length

on the splice junctions of length

with at most a certain number of mismatches. In [95], the authors used several information similar to those described above to train classifiers based on logistic regression for splicing junction detection. In [97], the authors introduced a new metric to measure the quality of each junction read. Then they estimated the distribution of such metric either with respect to known exon splice junctions and random splice junctions, and implemented an empirical statistical model to detect exon junctions evaluating the probability that an observed alignment distribution comes from a true junction.

The simple detection of specific isoforms does not provide useful information about their quantitative abundance. In principle, the quantification methods described above are equally applicable to quantify isoform expression. In practice, however, it is difficult to compute isoform-specific expression because most reads that are mapped to the genes are shared by more than one isoform and then it becomes difficult to assign each read only to a specific isoform. As a consequence, the assignment should rely on inferential methods that consider all data mapping to a certain region.

Several proposed methods for inferring isoforms' abundance are based on the preliminary knowledge of precise isoforms' annotation, on the assumption of uniform distribution of the reads across the transcript, on Poisson model for the reads' counts and equal weight for each read, regardless the quality of the match. The methods are often limited to handle only the cases where there is a relative small number of isoforms without confounding effects due to the overlap between genes. In particular in [98], the authors showed that the complexity of some isoform sets may still render the estimation problem nonidentifiable based on current RNA-Seq protocols and derived a mathematical characterization of identifiable isoform set. The main reason for such an effect is that current protocols with short single-end reads RNA-Seq are only able to asses local properties of a transcript. It is possible that the combination of short-read data with longer reads or paired-end reads will be able to go further in addressing such challenges.

Recently, in [90] the authors proposed a statistical method where, similar to [34], the count of reads falling into an annotated gene with multiple isoforms is modeled as a Poisson variable. They inferred the expression of each individual isoform using maximum likelihood approach, whose solution has been obtained by solving a convex optimization problem. In order to quantify the degree of uncertainty of the estimates, they carried out statistical inferences about the parameters from the posterior distribution by importance sampling. Interestingly, they showed that their method can be viewed as an extension of the RPKM concept and reduces to the RPKM index when there is only one isoform. An attempt to relax the assumption of uniform reads sampling is proposed in [84]. In this paper, the authors unified the notions of reads that map to multiple locations, that is, that could be potentially assigned to several genes, with those of reads that map to multiple isoforms through the introduction of latent random variables representing the true mappings. Then, they estimated the isoforms' abundance as the maximum likelihood expression levels using the EM algorithm. The Poisson distribution is also the main assumption in [99], where a comprehensive approach to the problem of alternative isoforms prediction is presented. In particular, the presence of alternative splicing event within the same sample is assessed by using Pearson’s chi-square test on the parameter of a multinomial distribution and the EM algorithm is used to estimate the abundance of each isoform.

3.6. Differential Expression

The final goal in the majority of transcriptome studies is to quantify differences in expression across multiple samples in order to capture differential gene expression, to identify sample-specific alternative splicing isoforms and their differential abundance.

Mimicking the methods used for microarray analysis, researchers started to approach such crucial question using statistical hypothesis' tests combined with multiple comparisons error procedures on the observed counts (or on the RPKM values) at the gene, isoform or exon level. Indeed, in [30] the authors applied the empirical Bayes moderated

-test proposed in [100] to the normalized RPKM. However in microarray experiments, the abundance of a particular transcript is measured as a fluorescence intensity, that can be effectively modeled as a continuous response, whereas for RNA-Seq data the abundance is usually a count. Therefore, procedures that are successful for microarrays do not seem to be appropriate for dealing with such type of data.

One of the pioneering works to handle such difference is [34], where the authors modeled the aggregated reads count for each gene using Poisson distribution. One can prove that the number of reads observed from a gene (or transcript isoform) follows a binomial distribution that can be approximated by a Poisson distribution, under the assumption that RNA-Seq reads follow a random sampling process, in which each read is sampled independently and uniformly from every possible nucleotide in the sample. In this set-up, in [34] the authors used a likelihood ratio test to test for significant differences between the two conditions. The Poisson model was also employed by [40], where the authors used the method proposed in [101] to determine the significance of differential expression. On the contrary, in [83], the authors simply estimated the difference in expression of a gene between two conditions through the difference of the count proportions

computed using a classical

-test statistics. In [18], the authors employed the Fishers exact test to better weigh the genes with relatively small counts. Similarly in [99] the authors used Poisson model and Fishers exact test to detect alternative exon usage between conditions.

Recently, more sophisticated approaches have been proposed in [102, 103]. In [102], the authors proposed an empirical Bayesian approach, based on the negative binomial distribution it results very flexible and reduces to the Poisson model for a particular choice of the hyperparameter. They carried out differential expression testing using a moderated Bayes approach similar in the spirit to the one described in [100], but adapted for data that are counts. We observed that the method is designed for finding changes between two or more groups when at least one of the groups has replicated measurements. In [103], the observed counts of reads mapped to a specific gene obtained from a certain sample was modeled using Binomial distribution. Under such assumption, it can be proved that the log ratio between the two samples conditioned to the intensity signal (i.e., the average of the two logs counts) follows an approximate normal distribution, that is used for assessing the significance of the test. All the above-mentioned methods assume that the quantification of the features of interest under the experimental conditions has been already done and each read has been assigned to only one elements, hence the methods are directly applicable to detect genes or exons differences provided that overlapping elements are properly filtered out. By contrast the above described methods are not directly suited for detecting isoforms' differences unless the quantification of the isoform abundance has been carried out using specific approaches. To handle such difficulties, in [104], the authors proposed a hierarchical Bayesian model to directly infer the differential expression level of each transcript isoform in response to two conditions. The difference in expression of each isoform is modeled by means of an inverse gamma model and a latent variable is introduced for guiding the isoform's selection. The model can handle the heteroskedasticity of the sequence read coverage and inference is carried out using Gibbs sampler.

It should be noticed that although these techniques already provide interesting biological insights, they have not been sufficiently validated on several real data-sets where different type of replicates are available, neither sufficiently compared each others in terms of advantages and disadvantages. As with any new biotechnology it is important to carefully study the different sources of variation that can affect measure of the biological effects of interest and to statistically asses the reproducibility of the biological findings in a rigorous way, and to date this has been often omitted. Indeed, it should be considered that there are a variety of experimental effects that could possibly increase the variability, the bias, or be confounded with sequencing-based measures, causing miss-understanding of the results. Unfortunately, such problems have received little of attention until now. In order to fill this gap, in [93] the authors presented a statistical inference framework for transcriptome analysis using RNA-Seq mapped read data. In particular, they proposed a new statistical method based on log-linear regression for investigating relationships between read counts and biological and experimental variables describing input samples as well as genomic regions of interest. The main advantage of the log-linear regression approach is that it allows to account both for biological effect and a variety of experimental effects. Their paper represents one of the few attempts of looking at the analysis of RNA-Seq data from a general point of view.

4. Challenges and Perspective for NGS

From the development of the Sanger method to the completion of the HGP, genetics has made significant advances towards the understanding of gene content and function. Even though significant achievements were reached by Human Genome, HapMap and ENCODE Projects [7, 105, 106], we are far from an exhaustive comprehension of the genomic diversity among humans and across the species, and from understanding gene expression variations and its regulation in both physio and pathological conditions. Since the appearance of first NGS platforms in the 2004, it was clear that understanding this diversity at a cost of around $5–10 million per genome sequence [107], placed it outside the real possibilities of most research laboratories, and very far from single individual economical potential. To date, we are in the “$1,000 genome” era, and, although this important barrier has not yet been broken, its a current assumption that this target is going to be reached within the end of 2010. It is likely that the rapid evolution of DNA sequencing technology, able to provide researchers with the ability to generate data about genetic variation and patterns of gene expression at an unprecedented scale, will become a routine tool for researchers and clinicians within just a few years.

As we can see, the number of applications and the great amount of biological questions that can be addressed by “Seq” experiments on NGS platforms is leading a revolution in the landscape of molecular biology, but the imbalance between the pace at which technology innovations are introduced in the platforms and the biological discoveries derivable from them is growing up. The risk is the creation of a glut of “under-used” information that in few months becomes of no use because the new one is produced. It is necessary to invest in an equivalent development of new computational strategies and expertise to deal with the volumes of data created by the current generation of new sequencing instruments, to maximize their potential benefit.

These platforms are creating a new world to explore, not only in the definition of experimental/technical procedures of large-scale analyses, but also in the downstream computational analysis and in the bioinformatics infrastructures support required for high-quality data generation and for their correct biological interpretation. In practice, they have shifted the bottleneck from the generation of experimental data to their management and to their statistical and computational analysis. There are few key points to consider. The first one is the data management: downstream computational analysis becomes difficult without appropriate Information Technology (IT) infrastructure. The terabytes of data produced by each sequencing run requires conspicuous storage and backup capacity, which increases considerably the experimental costs. The second one regards the protocols used for the production of raw data: each platform has its peculiarity in both sample preparation and type and volume of raw data produced, hence they require individualized laboratory expertise and data processing pipelines. Third, beside vendor specific and commercial software, several other open-source analysis tools are continuously appearing. Unfortunately, there is often an incomplete documentation and it is easy to spend more time in evaluating software suites than in analyzing the output data. Whichever software is used, the most important question is to understand its limitations and assumptions. Community adoption of input/output data standards is also essential to efficiently handle the data management problem. Till now the effort has been mainly devoted to the technological development rather than to the methodological counterpart. The choice of a careful experimental design has been also not always adequately considered.

As regards the RNA-Seq, we have still to face several critical issues either from a biological and computational point of view. RNA-seq protocols are extremely sensitive and need a very careful quality control for each wet laboratory step. For instance, the contamination of reagents with RNAse and the degradation of RNA, even partial, must be avoided during all the technical procedures. The quality of total isolated RNA is the first, and probably the most crucial point for an RNA-Seq experiment. Poor yield of polyA enrichment or low efficiency of total RNA ribodepletion are also critical issues for preparing high-quality RNA towards the library construction. It is clear that, independently on the library construction procedure, particular care should be taken to avoid complete degradation of RNA during the controlled RNA fragmentation step. Furthermore, in order to correctly determine the directionality of gene transcription and to facilitate the detection of opposing and overlapping transcripts within gene-dense genomic regions, particular care should be taken to preserve the strandedness of RNA fragments during the library preparation. In addition, to provide a more uniform coverage throughout the transcript length, random priming for reverse transcription protocols, rather than oligo dT priming (with the bias of low coverage at the 5 ends), should be done after removal of rRNA. Finally, it should be considered that for the platforms based on CRT and SBL, substitutions and under representation of AT-rich and GC-rich regions, probably due to amplification bias during template preparation, are the most common error type. In contrast, for pyrosequencing platforms, insertions and deletions represent a common drawback.

For what concern the data analysis, to the above-mentioned points, we should note that most of the available software for read alignment are designed for genomic mapping hence they are not fully capable to discover exon junctions. The classical extension for handling RNA-Seq data involves the preconstruction of junction libraries reducing the possibility of discovering new junctions. It would be desirable to develop new methods that allow either new junction detection and also the use of paired-end reads, that are particularly promising for more accurate study. Additionally further developments are required to assess the significance of new transcribed regions, the construction of new putative genes and the precise quantification of each isoform, for which there is still a lack of statistical methodologies. For what concerns the detection of differential expression, existing techniques were not sufficiently validated on biological data and compared in terms of specificity and sensitivity. Moreover, of potentially great impact, is the lack of biological replicates which precludes gauging the magnitude of individual effects in relation to technical effects. Biological replicates is essential in a RNA-Seq experiment to draw generalized conclusions about the “real" differences observed between two or more biological groups.

Facing such multidisciplinary challenges will be the key point for a fruitful transfer from laboratory studies to clinical applications. Indeed, the availability of low-cost, efficient and accurate technologies for gene expression and genome sequencing will be useful in providing pathological gene expression profiles in a wide number of common genetic disorders including type II diabetes, cardiovascular disease, Parkinson disease and Downs syndrome. Moreover, the application of NGS to the emerging disciplines of pharmacogenomics and nutrigenomics will allow to understand drug response and nutrient-gene interactions on the basis of individual patient's genetic make-up, leading in turn to the development of targeted therapies for many human diseases or tailored nutrient supplementation [108].


We are grateful to the anonymous referees whose valuable comments helped to substantially improve the paper. This work was supported by the CNR-Bioinformatics Project.


  1. D. D. Licatalosi and R. B. Darnell, “RNA processing and its regulation: global insights into biological networks,” Nature Reviews Genetics, vol. 11, no. 1, pp. 75–87, 2010. View at: Publisher Site | Google Scholar
  2. V. E. Velculescu, L. Zhang, W. Zhou et al., “Characterization of the yeast transcriptome,” Cell, vol. 88, no. 2, pp. 243–251, 1997. View at: Publisher Site | Google Scholar
  3. J. Lindberg and J. Lundeberg, “The plasticity of the mammalian transcriptome,” Genomics, vol. 95, no. 1, pp. 1–6, 2010. View at: Publisher Site | Google Scholar
  4. W. F. Doolittle and C. Sapienza, “Selfish genes, the phenotype paradigm and genome evolution,” Nature, vol. 284, no. 5757, pp. 601–603, 1980. View at: Google Scholar
  5. R. J. Taft, M. Pheasant, and J. S. Mattick, “The relationship between non-protein-coding DNA and eukaryotic complexity,” BioEssays, vol. 29, no. 3, pp. 288–299, 2007. View at: Publisher Site | Google Scholar
  6. T. Cavalier-Smith, “Cell volume and the evolution of eukaryote genome size,” in The Evolution of Genome Size, T. Cavalier-Smith, Ed., pp. 105–184, John Wiley & Sons, Chichester, UK, 1985. View at: Google Scholar
  7. E. Birney, J. A. Stamatoyannopoulos, A. Dutta et al., “Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project,” Nature, vol. 447, no. 7146, pp. 799–816, 2007. View at: Publisher Site | Google Scholar
  8. A. Jacquier, “The complex eukaryotic transcriptome: unexpected pervasive transcription and novel small RNAs,” Nature Reviews Genetics, vol. 10, no. 12, pp. 833–844, 2009. View at: Publisher Site | Google Scholar
  9. Z. Wang, M. Gerstein, and M. Snyder, “RNA-Seq: a revolutionary tool for transcriptomics,” Nature Reviews Genetics, vol. 10, no. 1, pp. 57–63, 2009. View at: Publisher Site | Google Scholar
  10. R. W. Holley, “Alanine transfer RNA,” in Nobel Lectures in Molecular Biology 1933–1975, pp. 285–300, Elsevier North Holland, New York, NY, USA, 1977. View at: Google Scholar
  11. A. M. Maxam and W. Gilbert, “A new method for sequencing DNA,” Proceedings of the National Academy of Sciences of the United States of America, vol. 74, no. 2, pp. 560–564, 1977. View at: Google Scholar
  12. F. Sanger, S. Nicklen, and A. R. Coulson, “DNA sequencing with chain-terminating inhibitors,” Proceedings of the National Academy of Sciences of the United States of America, vol. 74, no. 12, pp. 5463–5467, 1977. View at: Google Scholar
  13. E. R. Mardis, “Next-generation DNA sequencing methods,” Annual Review of Genomics and Human Genetics, vol. 9, pp. 387–402, 2008. View at: Publisher Site | Google Scholar
  14. J. Shendure and H. Ji, “Next-generation DNA sequencing,” Nature Biotechnology, vol. 26, no. 10, pp. 1135–1145, 2008. View at: Publisher Site | Google Scholar
  15. M. L. Metzker, “Sequencing technologies the next generation,” Nature Reviews Genetics, vol. 11, no. 1, pp. 31–46, 2010. View at: Publisher Site | Google Scholar
  16. R. A. Irizarry, D. Warren, F. Spencer et al., “Multiple-laboratory comparison of microarray platforms,” Nature Methods, vol. 2, no. 5, pp. 345–349, 2005. View at: Publisher Site | Google Scholar
  17. P. A. C. 't Hoen, Y. Ariyurek, H. H. Thygesen et al., “Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms,” Nucleic Acids Research, vol. 36, no. 21, article e141, 2008. View at: Publisher Site | Google Scholar
  18. J. S. Bloom, Z. Khan, L. Kruglyak, M. Singh, and A. A. Caudy, “Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays,” BMC Genomics, vol. 10, article 221, 2009. View at: Publisher Site | Google Scholar
  19. M. Harbers and P. Carninci, “Tag-based approaches for transcriptome research and genome annotation,” Nature Methods, vol. 2, no. 7, pp. 495–502, 2005. View at: Publisher Site | Google Scholar
  20. M. P. Horan, “Application of serial analysis of gene expression to the study of human genetic disease,” Human Genetics, vol. 126, no. 5, pp. 605–614, 2009. View at: Google Scholar
  21. H. Misu, T. Takamura, N. Matsuzawa et al., “Genes involved in oxidative phosphorylation are coordinately upregulated with fasting hyperglycaemia in livers of patients with type 2 diabetes,” Diabetologia, vol. 50, no. 2, pp. 268–277, 2007. View at: Publisher Site | Google Scholar
  22. T. Takamura, H. Misu, T. Yamashita, and S. Kaneko, “SAGE application in the study of diabetes,” Current Pharmaceutical Biotechnology, vol. 9, no. 5, pp. 392–399, 2008. View at: Publisher Site | Google Scholar
  23. D. V. Gnatenko, J. J. Dunn, S. R. McCorkle, D. Weissmann, P. L. Perrotta, and W. F. Bahou, “Transcript profiling of human platelets using microarray and serial analysis of gene expression,” Blood, vol. 101, no. 6, pp. 2285–2293, 2003. View at: Publisher Site | Google Scholar
  24. C. A. Sommer, E. C. Pavarino-Bertelli, E. M. Goloni-Bertollo, and F. Henrique-Silva, “Identification of dysregulated genes in lymphocytes from children with Down syndrome,” Genome, vol. 51, no. 1, pp. 19–29, 2008. View at: Publisher Site | Google Scholar
  25. W. Malagó Jr., C. A. Sommer, C. Del Cistia Andrade et al., “Gene expression profile of human Down syndrome leukocytes,” Croatian Medical Journal, vol. 46, no. 4, pp. 647–656, 2005. View at: Google Scholar
  26. B. T. Wilhelm and J.-R. Landry, “RNA-Seq-quantitative measurement of expression through massively parallel RNA-Sequencing,” Methods, vol. 48, no. 3, pp. 249–257, 2009. View at: Publisher Site | Google Scholar
  27. M. N. Bainbridge, R. L. Warren, M. Hirst et al., “Analysis of the prostate cancer cell line LNCaP transcriptome using a sequencing-by-synthesis approach,” BMC Genomics, vol. 7, article 246, 2006. View at: Publisher Site | Google Scholar
  28. U. Nagalakshmi, Z. Wang, K. Waern et al., “The transcriptional landscape of the yeast genome defined by RNA sequencing,” Science, vol. 320, no. 5881, pp. 1344–1349, 2008. View at: Publisher Site | Google Scholar
  29. T. T. Torres, M. Metta, B. Ottenwälder, and C. Schlötterer, “Gene expression profiling by massively parallel sequencing,” Genome Research, vol. 18, no. 1, pp. 172–177, 2008. View at: Publisher Site | Google Scholar
  30. N. Cloonan, A. R. R. Forrest, G. Kolle et al., “Stem cell transcriptome profiling via massive-scale mRNA sequencing,” Nature Methods, vol. 5, no. 7, pp. 613–619, 2008. View at: Publisher Site | Google Scholar
  31. L. J. Core, J. J. Waterfall, and J. T. Lis, “Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters,” Science, vol. 322, no. 5909, pp. 1845–1848, 2008. View at: Publisher Site | Google Scholar
  32. S.-I. Hashimoto, W. Qu, B. Ahsan et al., “High-resolution analysis of the 5 ' -end transcriptome using a next generation DNA sequencer,” PLoS ONE, vol. 4, no. 1, article e4108, 2009. View at: Publisher Site | Google Scholar
  33. H. Li, M. T. Lovci, Y.-S. Kwon, M. G. Rosenfeld, X.-D. Fu, and G. W. Yeo, “Determination of tag density required for digital transcriptome analysis: application to an androgen-sensitive prostate cancer model,” Proceedings of the National Academy of Sciences of the United States of America, vol. 105, no. 51, pp. 20179–20184, 2008. View at: Publisher Site | Google Scholar
  34. J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad, “RNA-Seq: an assessment of technical reproducibility and comparison with gene expression arrays,” Genome Research, vol. 18, no. 9, pp. 1509–1517, 2008. View at: Publisher Site | Google Scholar
  35. R. D. Morin, M. D. O'Connor, M. Griffith et al., “Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells,” Genome Research, vol. 18, no. 4, pp. 610–621, 2008. View at: Publisher Site | Google Scholar
  36. R. D. Morin, M. Bainbridge, A. Fejes et al., “Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing,” BioTechniques, vol. 45, no. 1, pp. 81–94, 2008. View at: Publisher Site | Google Scholar
  37. A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold, “Mapping and quantifying mammalian transcriptomes by RNA-Seq,” Nature Methods, vol. 5, no. 7, pp. 621–628, 2008. View at: Publisher Site | Google Scholar
  38. R. Rosenkranz, T. Borodina, H. Lehrach, and H. Himmelbauer, “Characterizing the mouse ES cell transcriptome with Illumina sequencing,” Genomics, vol. 92, no. 4, pp. 187–194, 2008. View at: Publisher Site | Google Scholar
  39. D. J. Sugarbaker, W. G. Richards, G. J. Gordon et al., “Transcriptome sequencing of malignant pleural mesothelioma tumors,” Proceedings of the National Academy of Sciences of the United States of America, vol. 105, no. 9, pp. 3521–3526, 2008. View at: Publisher Site | Google Scholar
  40. M. Sultan, M. H. Schulz, H. Richard et al., “A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome,” Science, vol. 321, no. 5891, pp. 956–960, 2008. View at: Publisher Site | Google Scholar
  41. Y. W. Asmann, E. W. Klee, E. A. Thompson et al., “ 3 ' tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer,” BMC Genomics, vol. 10, article 531, 2009. View at: Google Scholar
  42. I. Chepelev, G. Wei, Q. Tang, and K. Zhao, “Detection of single nucleotide variations in expressed exons of the human genome using RNA-Seq,” Nucleic Acids Research, vol. 37, no. 16, article e106, 2009. View at: Publisher Site | Google Scholar
  43. J. Z. Levin, M. F. Berger, X. Adiconis et al., “Targeted next-generation sequencing of a cancer transcriptome enhances detection of sequence variants and novel fusion transcripts,” Genome Biology, vol. 10, no. 10, article R115, 2009. View at: Publisher Site | Google Scholar
  44. C. A. Maher, N. Palanisamy, J. C. Brenner et al., “Chimeric transcript discovery by paired-end transcriptome sequencing,” Proceedings of the National Academy of Sciences of the United States of America, vol. 106, no. 30, pp. 12353–12358, 2009. View at: Publisher Site | Google Scholar
  45. D. Parkhomchuk, T. Borodina, V. Amstislavskiy et al., “Transcriptome analysis by strand-specific sequencing of complementary DNA,” Nucleic Acids Research, vol. 37, no. 18, article e123, 2009. View at: Publisher Site | Google Scholar
  46. T. E. Reddy, F. Pauli, R. O. Sprouse et al., “Genomic determination of the glucocorticoid response reveals unexpected mechanisms of gene regulation,” Genome Research, vol. 19, no. 12, pp. 2163–2171, 2009. View at: Publisher Site | Google Scholar
  47. F. Tang, C. Barbacioru, Y. Wang et al., “mRNA-Seq whole-transcriptome analysis of a single cell,” Nature Methods, vol. 6, no. 5, pp. 377–382, 2009. View at: Publisher Site | Google Scholar
  48. R. Blekhman, J. C. Marioni, P. Zumbo, M. Stephens, and Y. Gilad, “Sex-specific and lineage-specific alternative splicing in primates,” Genome Research, vol. 20, no. 2, pp. 180–189, 2010. View at: Publisher Site | Google Scholar
  49. G. A. Heap, J. H. M. Yang, K. Downes et al., “Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing,” Human Molecular Genetics, vol. 19, no. 1, pp. 122–134, 2010. View at: Publisher Site | Google Scholar
  50. D. Raha, Z. Wang, Z. Moqtaderi et al., “Close association of RNA polymerase II and many transcription factors with Pol III genes,” Proceedings of the National Academy of Sciences of the United States of America, vol. 107, no. 8, pp. 3639–3644, 2010. View at: Publisher Site | Google Scholar
  51. S. Marguerat and J. Bahler, “RNA-Seq: from technology to biology,” Cellular and Molecular Life Sciences, vol. 67, no. 4, pp. 569–579, 2010. View at: Publisher Site | Google Scholar
  52. Y. He, B. Vogelstein, V. E. Velculescu, N. Papadopoulos, and K. W. Kinzler, “The antisense transcriptomes of human cells,” Science, vol. 322, no. 5909, pp. 1855–1857, 2008. View at: Publisher Site | Google Scholar
  53. R. Lister, R. C. O'Malley, J. Tonti-Filippini et al., “Highly integrated single-base resolution maps of the epigenome in Arabidopsis,” Cell, vol. 133, no. 3, pp. 523–536, 2008. View at: Publisher Site | Google Scholar
  54. B. T. Wilhelm, S. Marguerat, I. Goodhead, and J. Bahler, “Defining transcribed regions using RNA-Seq,” Nature Protocols, vol. 5, no. 2, pp. 255–266, 2010. View at: Publisher Site | Google Scholar
  55. N. T. Ingolia, S. Ghaemmaghami, J. R. S. Newman, and J. S. Weissman, “Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling,” Science, vol. 324, no. 5924, pp. 218–223, 2009. View at: Publisher Site | Google Scholar
  56. T. D. Harris, P. R. Buzby, H. Babcock et al., “Single-molecule DNA sequencing of a viral genome,” Science, vol. 320, no. 5872, pp. 106–109, 2008. View at: Publisher Site | Google Scholar
  57. J. C. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer, “Substantial biases in ultra-short read data sets from high-throughput DNA sequencing,” Nucleic Acids Research, vol. 36, no. 16, article e105, 2008. View at: Publisher Site | Google Scholar
  58. O. Harismendy, P. C. Ng, R. L. Strausberg et al., “Evaluation of next generation sequencing platforms for population targeted sequencing studies,” Genome Biology, vol. 10, no. 3, article R32, 2009. View at: Publisher Site | Google Scholar
  59. L. W. Hillier, G. T. Marth, A. R. Quinlan et al., “Whole-genome sequencing and variant discovery in C. elegans,” Nature Methods, vol. 5, no. 2, pp. 183–188, 2008. View at: Publisher Site | Google Scholar
  60. J. D. McPherson, “Next-generation gap,” Nature Methods, vol. 6, no. 11S, pp. S2–S5, 2009. View at: Google Scholar
  61. D. R. Zerbino and E. Birney, “Velvet: algorithms for de novo short read assembly using de Bruijn graphs,” Genome Research, vol. 18, no. 5, pp. 821–829, 2008. View at: Publisher Site | Google Scholar
  62. I. Birol, S. D. Jackman, C. B. Nielsen et al., “De novo transcriptome assembly with ABySS,” Bioinformatics, vol. 25, no. 21, pp. 2872–2877, 2009. View at: Publisher Site | Google Scholar
  63. F. Denoeud, J.-M. Aury, C. Da Silva et al., “Annotating genomes with massive-scale RNA sequencing,” Genome Biology, vol. 9, no. 12, article R175, 2008. View at: Publisher Site | Google Scholar
  64. M. Yassoura, T. Kaplana, H. B. Fraser et al., “Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing,” Proceedings of the National Academy of Sciences of the United States of America, vol. 106, no. 9, pp. 3264–3269, 2009. View at: Publisher Site | Google Scholar
  65. C. Trapnell and S. L. Salzberg, “How to map billions of short reads onto genomes,” Nature Biotechnology, vol. 27, no. 5, pp. 455–457, 2009. View at: Publisher Site | Google Scholar
  66. P. Flicek and E. Birney, “Sense from sequence reads: methods for alignment and assembly,” Nature Methods, vol. 6, supplement 11, pp. S6–S12, 2009. View at: Google Scholar
  67. D. S. Horner, G. Pavesi, T. Castrignanò et al., “Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing,” Briefings in Bioinformatics, vol. 11, no. 2, pp. 181–197, 2009. View at: Publisher Site | Google Scholar
  68. A. Cox, “ELAND: efficient local alignment of nucleotide data,” unpublished, View at: Google Scholar
  69. “Applied Biosystems mappread and whole transcriptome software tools,” View at: Google Scholar
  70. H. Li, J. Ruan, and R. Durbin, “Mapping short DNA sequencing reads and calling variants using mapping quality scores,” Genome Research, vol. 18, no. 11, pp. 1851–1858, 2008. View at: Publisher Site | Google Scholar
  71. A. D. Smith, Z. Xuan, and M. Q. Zhang, “Using quality scores and longer reads improves accuracy of Solexa read mapping,” BMC Bioinformatics, vol. 9, article 128, 2008. View at: Publisher Site | Google Scholar
  72. R. Li, Y. Li, K. Kristiansen, and J. Wang, “SOAP: short oligonucleotide alignment program,” Bioinformatics, vol. 24, no. 5, pp. 713–714, 2008. View at: Publisher Site | Google Scholar
  73. R. Li, C. Yu, Y. Li et al., “SOAP2: an improved ultrafast tool for short read alignment,” Bioinformatics, vol. 25, no. 15, pp. 1966–1967, 2009. View at: Publisher Site | Google Scholar
  74. B. D. Ondov, A. Varadarajan, K. D. Passalacqua, and N. H. Bergman, “Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications,” Bioinformatics, vol. 24, no. 23, pp. 2776–2777, 2008. View at: Publisher Site | Google Scholar
  75. H. Jiang and W. H. Wong, “SeqMap: mapping massive amount of oligonucleotides to the genome,” Bioinformatics, vol. 24, no. 20, pp. 2395–2396, 2008. View at: Publisher Site | Google Scholar
  76. H. Lin, Z. Zhang, M. Q. Zhang, B. Ma, and M. Li, “ZOOM! Zillions of oligos mapped,” Bioinformatics, vol. 24, no. 21, pp. 2431–2437, 2008. View at: Publisher Site | Google Scholar
  77. B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome,” Genome Biology, vol. 10, no. 3, article R25, 2009. View at: Publisher Site | Google Scholar
  78. D. Campagna, A. Albiero, A. Bilardi et al., “PASS: a program to align short sequences,” Bioinformatics, vol. 25, no. 7, pp. 967–968, 2009. View at: Publisher Site | Google Scholar
  79. N. Cloonan, Q. Xu, G. J. Faulkner et al., “RNA-MATE: a recursive mapping strategy for high-throughput RNA-sequencing data,” Bioinformatics, vol. 25, no. 19, pp. 2615–2616, 2009. View at: Publisher Site | Google Scholar
  80. F. De Bona, S. Ossowski, K. Schneeberger, and G. Rätsch, “Optimal spliced alignments of short sequence reads,” Bioinformatics, vol. 24, no. 16, pp. i174–i180, 2008. View at: Publisher Site | Google Scholar
  81. C. Trapnell, L. Pachter, and S. L. Salzberg, “TopHat: discovering splice junctions with RNA-Seq,” Bioinformatics, vol. 25, no. 9, pp. 1105–1111, 2009. View at: Publisher Site | Google Scholar
  82. G. J. Faulkner, A. R. R. Forrest, A. M. Chalk et al., “A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE,” Genomics, vol. 91, no. 3, pp. 281–288, 2008. View at: Publisher Site | Google Scholar
  83. T. Hashimoto, M. J. L. de Hoon, S. M. Grimmond, C. O. Daub, Y. Hayashizaki, and G. J. Faulkner, “Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite,” Bioinformatics, vol. 25, no. 19, pp. 2613–2614, 2009. View at: Publisher Site | Google Scholar
  84. B. Li, V. Ruotti, R. M. Stewart, J. A. Thomson, and C. N. Dewey, “RNA-Seq gene expression estimation with read mapping uncertainty,” Bioinformatics, vol. 26, no. 4, pp. 493–500, 2009. View at: Publisher Site | Google Scholar
  85. W. J. Kent, C. W. Sugnet, T. S. Furey et al., “The human genome browser at UCSC,” Genome Research, vol. 12, no. 6, pp. 996–1006, 2002. View at: Publisher Site | Google Scholar
  86. W. Huang and G. Marth, “EagleView: a genome assembly viewer for next-generation sequencing technologies,” Genome Research, vol. 18, no. 9, pp. 1538–1543, 2008. View at: Publisher Site | Google Scholar
  87. H. Bao, H. Guo, J. Wang, R. Zhou, X. Lu, and S. Shi, “MapView: visualization of short reads alignment on a desktop computer,” Bioinformatics, vol. 25, no. 12, pp. 1554–1555, 2009. View at: Publisher Site | Google Scholar
  88. I. Milne, M. Bayer, L. Cardle et al., “Tablet-next generation sequence assembly visualization,” Bioinformatics, vol. 26, no. 3, pp. 401–402, 2010. View at: Publisher Site | Google Scholar
  89. H. Li, B. Handsaker, A. Wysoker et al., “The sequence alignment/map format and SAMtools,” Bioinformatics, vol. 25, no. 16, pp. 2078–2079, 2009. View at: Publisher Site | Google Scholar
  90. H. Jiang and W. H. Wong, “Statistical inferences for isoform expression in RNA-Seq,” Bioinformatics, vol. 25, no. 8, pp. 1026–1032, 2009. View at: Publisher Site | Google Scholar
  91. S. Pepke, B. Wold, and A. Mortazavi, “Computation for ChIP-Seq and RNA-Seq studies,” Nature Methods, vol. 6, no. 11S, pp. S22–S32, 2009. View at: Google Scholar
  92. A. Oshlack and M. J. Wakefield, “Transcript length bias in RNA-Seq data confounds systems biology,” Biology Direct, vol. 4, article 14, 2009. View at: Publisher Site | Google Scholar
  93. J. H. Bullard, E. A. Purdom, K. D. Hansen, S. Durinck, and S. Dudoit, “Statistical inference in mRNA-Seq: exploratory data analysis and differential expression,” Tech. Rep. 247/2009, University of California, Berkeley, 2009. View at: Google Scholar
  94. B. T. Wilhelm, S. Marguerat, S. Watt et al., “Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution,” Nature, vol. 453, no. 7199, pp. 1239–1243, 2008. View at: Publisher Site | Google Scholar
  95. Q. Pan, O. Shai, L. J. Lee, B. J. Frey, and B. J. Blencowe, “Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing,” Nature Genetics, vol. 40, no. 12, pp. 1413–1415, 2008. View at: Publisher Site | Google Scholar
  96. E. T. Wang, R. Sandberg, S. Luo et al., “Alternative isoform regulation in human tissue transcriptomes,” Nature, vol. 456, no. 7221, pp. 470–476, 2008. View at: Publisher Site | Google Scholar
  97. L. Wang, Y. Xi, J. Yu, L. Dong, L. Yen, and W. Li, “A statistical method for the detection of alternative splicing using RNA-Seq,” PLoS ONE, vol. 5, no. 1, article e8529, 2010. View at: Publisher Site | Google Scholar
  98. D. Hiller, H. Jiang, W. Xu, and W. H. Wong, “Identifiability of isoform deconvolution from junction arrays and RNA-Seq,” Bioinformatics, vol. 25, no. 23, pp. 3056–3059, 2009. View at: Publisher Site | Google Scholar
  99. H. Richard, M. H. Schulz, M. Sultan et al., “Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments,” Nucleic Acids Research, vol. 38, no. 10, p. e112, 2010. View at: Publisher Site | Google Scholar
  100. G. K. Smyth, “Linear models and empirical Bayes methods for assessing differential expression in microarray experiments,” Statistical Applications in Genetics and Molecular Biology, vol. 3, no. 1, article 3, 2004. View at: Google Scholar
  101. S. Audic and J.-M. Claverie, “The significance of digital gene expression profiles,” Genome Research, vol. 7, no. 10, pp. 986–995, 1997. View at: Google Scholar
  102. M. D. Robinson, D. J. McCarthy, and G. K. Smyth, “edgeR: a bioconductor package for differential expression analysis of digital gene expression data,” Bioinformatics, vol. 26, no. 1, pp. 139–140, 2010. View at: Publisher Site | Google Scholar
  103. L. Wang, Z. Feng, X. Wang, X. Wang, and X. Zhang, “DEGseq: an R package for identifying differentially expressed genes from RNA-Seq data,” Bioinformatics, vol. 26, no. 1, pp. 136–138, 2009. View at: Publisher Site | Google Scholar
  104. S. Zheng and L. Chen, “A hierarchical Bayesian model for comparing transcriptomes at the individual transcript isoform level,” Nucleic Acids Research, vol. 37, no. 10, article e75, 2009. View at: Publisher Site | Google Scholar
  105. F. S. Collins, E. S. Lander, J. Rogers, and R. H. Waterson, “Finishing the euchromatic sequence of the human genome,” Nature, vol. 431, no. 7011, pp. 931–945, 2004. View at: Publisher Site | Google Scholar
  106. International Human Genome Sequencing Consortium, “A haplotype map of the human genome,” Nature, vol. 437, no. 7063, pp. 1299–1320, 2005. View at: Google Scholar
  107. E. R. Mardis, “Anticipating the 1,000 dollar genome,” Genome Biology, vol. 7, no. 7, article 112, 2006. View at: Google Scholar
  108. V. Costa, A. Casamassimi, and A. Ciccodicola, “Nutritional genomics era: opportunities toward a genome-tailored nutritional regimen,” The Journal of Nutritional Biochemistry, vol. 21, no. 6, pp. 457–467, 2010. View at: Publisher Site | Google Scholar


Copyright © 2010 Valerio Costa et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Sequencing inaccurate at the primer site - Biology

A History of Genome Sequencing:

The sequencing of the human genome along with related organisms represents one of the largest scientific endeavors in the history of mankind. The information garnered from sequencing will provide the raw data for the exploding field of bioinformatics, where computer science and biology live in symbiotic harmony. The large scale sequencing proposed by the Human Genome Project in 1990 could never have been a reality without modern computer facilities. Merely, twenty years ago computers would have been powerless in light of such a daunting amount and array of data. Homologue identification and genome characterization between organisms constituting millions of nucleotides was unimaginable until the rapid advancement of microchips and processors over the past two decades. In addition the first sequenced genome of a live organism, Haemophilus influenzae , would have been impossible without the computational methods developed at the facilities of The Institute for Genetic Research (TIGR). While this is not technically an aspect of bioinformatics, this development would have been impossible with the computers of yesterday. The short history of genome sequencing began with Frederic Sanger's invention of sequencing almost twenty-five years ago.

The art of determining the sequence of DNA is known as Sanger sequencing after its brilliant pioneer. This technique involves the separation of flourescent labeled DNA fragments according to their length on a polyacrilimide gel (PAGE). The base at the end of each fragment can then be visualized and identified by the dye with which it reacts. The time and labor intensive nature of gel preparation and running, as well as the large amounts of sample required, increase the time and costs of genomic sequencing. These conditions drastically reduce the efficiency of sequencing projects ultimately limiting researchers in their sequencing attempts.

Bacteriophage fX174, was the first genome to be sequenced, a viral genome with only 5,368 base pairs (bp). Frederic Sanger, in another revolutionary discovery, invented the method of "shotgun" sequencing, a strategy based on the isolation of random pieces of DNA from the host genome to be used as primers for the PCR amplification of the entire genome. The amplified portions of DNA are then assembled by their overlapping regions to form contiguous transcripts (otherwise known as contigs). The final step involved the utilization of custom primers to elucidate the gaps between the contigs thus giving the completely sequenced genome. Sanger first used "shotgun" sequencing five years later to complete the bacteriophage l sequence that was significantly larger, 48,502 bp. This method allowed sequencing projects to proceed at a much faster rate thus expanding the scope of realistic sequencing venture. Since then a couple of other viral and organellar genomes have been sequenced using similar techniques such as the 229 kb genome of cytomegalovirus (CMV), the 192 kb genome of vaccinia, and the 187 kb mitochondrial and the 121 kb chloroplast genomes of Marchantia polymorpha , and the 186 kb genome of smallpox.

The success with viral genome sequencing stemmed from the relatively small length of their genetic codes. In 1989, Andre Goffeau set up a European consortium to sequence the genome of the budding yeast Saccharomyces cerevisiae (12.5 Mb). Goffeau's European collaboration involved 74 different laboratories drawn to the project in hopes of sequencing the homologs of their favorite genes. Most laboratories utilized Sanger's "shotgun" method of sequencing that had become the accepted standard for genome sequencing. S. Cerevisiae had a sequence approximately 60 times larger than any sequence previously attempted indicating why Goffeau felt compelled to invite the cooperation of a group of laboratories. At the time the sequencing of model organisms such as S. Cerevisiae appeared to be the logical step towards the eventual characterization of the human genome, a task that seemed beyond the scope of technology due to its tremendous size of 3,000 Mb. Sequencing smaller genomes would highlight the problems with sequencing techniques eventually refining the technology to be used on large-scale projects like H. Sapiens . In addition, valuable insight concerning these organisms would be gained with the elucidation of their genetic makeup.

The following year saw the initiation of a plethora of ambitious sequencing proposals the foremost being the introduction of the Human Genome Project in 1990. The U.S. Human Genome Project (HGP) is a joint effort of the Department of Energy and the National Institute of Health that was designed as a three-step program to produce genetic maps, physical maps, and finally the complete nucleotide sequence map of the human chromosomes. The first two aims of the project are practically fulfilled and now the majority of work is concentrated on the exact nucleotide sequence of the human. In the wake of this pronouncement came the start of three projects aimed at elucidating the sequences of smaller model organisms, similar to S. Cerevisiae in their academic utility, such as Escherichia. coli , Mycoplasma capricolum , and Caenorhabditis. elegans . It was hoped that these projects would increase the efficiency of sequencing but unfortunately they fell short of this task. Many anticipated that E. coli would be the first genome to be sequenced entirely but to the shock of the science community, an outsider won the race for the first complete genome sequence of a free living organism, Haemophilus influenzae .

A team headed by J. Craig Venter from the Institute for Genomic Research (TIGR) and Nobel laureate Hamilton Smith of Johns Hopkins University, sequenced the 1.8 Mb bacterium with new computational methods developed at TIGR's facility in Gaithersburg, Maryland. Previous sequencing projects had been limited by the lack of adequate computational approaches to assemble the large amount of random sequences produced by "shotgun" sequencing. In conventional sequencing, the genome is broken down laboriously into ordered, overlapping segments, each containing up to 40 Kb of DNA. These segments are "shotgunned" into smaller pieces and then sequenced to reconstruct the genome. Venter's team utilized a more comprehensive approach by "shotgunning" the entire 1.8 Mb H. Influenzae genome. Previously, such an approach would have failed because the software did not exist to assemble such a massive amount of information accurately. Software, developed by TIGR, called the TIGR Assembler was up to the task, reassembling the approximately 24,000 DNA fragments into the whole genome. After the H. Influenzae genome was "shotgunned" and the clones purified sufficiently the TIGR Assembler software required approximately 30 hours of central processing unit time on a SPARCenter 2000 containing half a gigabyte of RAM testifying to the enormous complexity of the computation.

Venter's H. Influenzae project had failed to win funding from the National Institute of Health indicating the serious doubts surrounding his ambitious proposal. It simply was not believed that such an approach could sequence the large 1.8 Mb sequence of the bacterium accurately. Venter proved everyone wrong and succeeded in sequencing the genome in 13 months at a cost of 50 cents per base which was half the cost and drastically faster than conventional sequencing. This new method of sequencing led to a multitude of completed sequences over the ensuing years by TIGR. Mycoplasma Genitalium , a bacterium that is associated with reproductive-tract infections and is renowned for having the shortest genome of all free-living organisms was sequenced by TIGR in a period of eight months between January and August of 1995 an extraordinary example of the efficiency of TIGR's new sequencing method. TIGR subsequently published the first genome sequence of a representative of the Archaea , Methanococcus jannaschii , the first genome sequence of a sulfur-metabolizing organism, Archaeoglobus fulgidus , the genome sequence of the pathogen involved in peptic ulcer disease, Helicobacter pylori , and the genome sequence of the Lyme disease spirochaete, Borrelia burgdorferi .

TIGR's dramatic leadership role in the field of genome sequencing was paralleled by the final completion of two of the largest genomic sequences , the bacterium E. Coli K-12 , and the yeast, S. Cerevisiae in 1997. These projects were the culmination of over seven years of intensive work. The yeast genome was the final result of a tremendous international collaboration of more than 600 scientists from over 100 laboratories representing the largest decentralised experiment in modern molecular biology. The final work represented efforts of scientist from Japan, Europe, Canada, and the United States producing the largest full length sequence (12 Mb) ever done. In an incredible display of organizational mastery only 3.4% of the total sequencing efforts were duplicated among laboratories. The E.Coli sequence was considerably smaller (4.6 Mb) but equally important in terms of experimental utility. E. Coli is the preferred model in biochemical genetics, molecular biology, and biotechnology and its genomic characterization will undoubtedly further research toward a more complete understanding of this important experimental, medical, and industrial organism.

At the close of 1997, we are halfway through the time allotted for completing the Human Genome Project projected to finish on September 30, 2005 approximately fifty years after the landmark paper of Watson and Crick. Currently major groups have sequenced approximately 50 Mb of human DNA representing less than 1.5% of the 3,000 Mb genome. The estimated finish of the human genome by the year 2,000 appears quite optimistic considering that the world's large-scale sequencing capacity is approximately 100 Mb per year. To complete the genome the average production must increase to 400 Mb per year. Several factors including the slow rate of Sanger sequencing and the high accuracy goal of the HGP which allows for one error in 10,000 bases limits the ability of researchers to proceed more quickly. Advancements in Sanger sequencing or possible replacements for this time intensive process will be necessary to ensure the HGP's goal of completion by the year 2005.

As of September of 1997, thirteen genome sequences of free-living organisms had been completed including the two largest, E. Coli and yeast, and eleven other microbial genomes under the length of 4.2 Mb. Four other large-scale projects are in progress including the sequencing of the Nematode, C. Elegans , which is 71% completed, the fruit fly, Drosophola Melanogaster which is 6% completed, the mouse which has less than 1% finished, and the human which is only 1.5% completed. These statistics are impressive considering that only four years ago no completed sequences existed.

The rapid proliferation of biological information in the form of genome sequences has been the major factor in the creation of the field of bioinformatics, that focuses on the acquisition, storage, access, analysis, modeling, and distribution of the many types of information embedded in DNA sequences. This field will be challenged by the heightening demands of increased information on the algorithms currently utilized for sequence manipulation. The growing sequence knowledge of the human genome has been likened to the establishment of the periodic table in the 19th century. Just as past chemists systematically organized all elements in an array that captured their differences and similarities, the Human Genome Project will allow modern scientists to construct a biological periodic table relating units of nucleotides. The periodic table will not contain 100 elements, but 100,000 genes reflecting not their similarity in electronic configuration but their evolutionary and functional relationship. Bioinformatics will be the tool of the modern scientist in interpreting this periodic table of biological information.

For any comments on the paper email the author: Edmund Pillsbury.
Cool Genome Sequencing Sites

To see the pioneers of genomic sequencing check out The Institute for Genomic Research
and their private affiliate Human Genome Sciences. Sequences for Haemophilus influenzae, Mycoplasma Genitalium, Methanococcus jannaschii, Archaeoglobus fulgidus , Helicobacter pylori , and Borrelia burgdorferi can be found at the TIGR site along with links to their papers.

Non-commercial sequencing projects exceeding 1Mb of production (listed from the largest to smallest):

Government sequence databases :

1) The National Center for Biotechnology Information (NCBI)---This is a great resource. . . includes GenBank the federal sequence repository where everyone submits sequences.
2) Genome Sequence Database (GSDB) at the National Center for Genome Resources (Sante Fe, New Mexico).
3) The Genome Data Base (GDB)---worldwide repository for mapping information.

Sequencing projects in progress:

List compiled by Edmund Pillsbury, any problems or questions please email.

This work was supported by the University Research Priority Program (URPP) Evolution in Action of the University of Zurich. This work made use of infrastructure provided by S3IT (, the Service and Support for Science IT team at the University of Zurich.


Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, Zurich, 8057, Switzerland

Stephan Schmeing & Mark D. Robinson

SIB Swiss Institute of Bioinformatics, Winterthurerstrasse 190, Zurich, 8057, Switzerland

Watch the video: DNA Sequencing - 3D (January 2023).