Information

Common Metrics of Assessing DNA Sequence Quality

Common Metrics of Assessing DNA Sequence Quality


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

What are commonly accepted metrics for assessing DNA sequence quality (platform-specific answers are fine)? I am relatively new to this topic, and I want to either find or code (in Python or C++) algorithms for checking sequencing data quality before getting into analysis.


As mention in the comments Phred scores are the quality scores for most sequencing platforms. This value expresses the probability of a base is being called wrongly. You can find more information here. This values can be found in a fastq file coded by symbols.

For example, a phred score of 30 means that there is a probability of 1 base being wrong every 1,000 (that is 99.9% accuracy). In general values around 30 are considered good enough, but it depends on your analysis whether you want to be more more or less restrictive.

Biopython has some examples on how to plot and filter phred scores. There exits lots of other software to do QA checks, for example this one.

You can also check length and distribution of the reads, duplication levels, etc.

In case you need to trim sequences from Illumina there are several stand-alone software like this that you can use.


Sequencing Quality Scores

Sequencing quality scores measure the probability that a base is called incorrectly. With sequencing by synthesis (SBS) technology, each base in a read is assigned a quality score by a phred-like algorithm 1,2 , similar to that originally developed for Sanger sequencing experiments.

Sequencing Technology Video
Sequencing Technology Video

See how Illumina SBS works.

Q Score Definition

The sequencing quality score of a given base, Q, is defined by the following equation:

where e is the estimated probability of the base call being wrong.

  • Higher Q scores indicate a smaller probability of error.
  • Lower Q scores can result in a significant portion of the reads being unusable. They may also lead to increased false-positive variant calls, resulting in inaccurate conclusions.

As shown below, a quality score of 20 represents an error rate of 1 in 100, with a corresponding call accuracy of 99%.

SBS Technology Overview

Illumina technology enables massively parallel sequencing with optimized SBS chemistry.

Relationship Between Sequencing Quality Score and Base Call Accuracy
Quality Score Probability of Incorrect Base Call Inferred Base Call Accuracy
10 (Q10) 1 in 10 90%
20 (Q20) 1 in 100 99%
30 (Q30) 1 in 1000 99.9%

Illumina Sequencing Quality Scores

Illumina sequencing chemistry delivers high accuracy, with a vast majority of bases scoring Q30 and above. This level of accuracy is ideal for a range of sequencing applications, including clinical research.

Learn how PhiX can be used as an in-run control for run quality monitoring in Illumina NGS.

Choosing an NGS Company

Seek out a best-in-class next-generation sequencing company with user-friendly bioinformatics tools and industry-leading support and service.

Additional Information About Quality Scores

For more in-depth information about sequencing quality scores, read the following technical notes:

Beginner's Guide to NGS

Considering bringing NGS to your lab, but unsure where to start? These resources cover key topics in NGS and are designed to help you plan your first experiment.

Interested in receiving newsletters, case studies, and information on new applications? Enter your email address below.

Related Solutions

Next-Generation Sequencing (NGS)

Discover the broad range of experiments you can perform with next-generation sequencing, and find out how Illumina NGS works.

Benefits of SBS Technology

Illumina SBS technology delivers proven base calling accuracy, with the fewest false positive, false negatives, and miscalls among leading NGS platforms.

Sequencing Platforms

Compare next-generation sequencing (NGS) platforms by application and specification. Find tools and guides to help you choose the right sequencer.

References
  1. Ewing B, Hillier L, Wendl MC, Green P. (1998): Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8(3):175-185
  2. Ewing B, Green P. (1998): Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8(3):186-194

Innovative technologies

At Illumina, our goal is to apply innovative technologies to the analysis of genetic variation and function, making studies possible that were not even imaginable just a few years ago. It is mission critical for us to deliver innovative, flexible, and scalable solutions to meet the needs of our customers. As a global company that places high value on collaborative interactions, rapid delivery of solutions, and providing the highest level of quality, we strive to meet this challenge. Illumina innovative sequencing and array technologies are fueling groundbreaking advancements in life science research, translational and consumer genomics, and molecular diagnostics.

For Research Use Only. Not for use in diagnostic procedures (except as specifically noted).


Background

DNA methylation is the most well-characterised epigenetic mark in humans. It is defined as the addition of a methyl (CH3) group to DNA and in mammalian cells occurs primarily at the cytosine of cytosine-guanine dinucleotides (CpG). DNA methylation can modify the function of regulatory elements and gene expression and is therefore integral to normal human development and biological functioning. Perturbations to normal DNA methylation patterns can lead to dysregulation of cellular processes and are linked with disease. Widespread aberrations in DNA methylation are a well-established hallmark of many cancers [1] and a growing body of literature shows a role for DNA methylation in the aetiology of other complex human diseases including chronic kidney disease [2], type 2 diabetes [3] and neuropsychiatric disease [4].

A full understanding of the role of DNA methylation in health and disease requires the development of tools that can simultaneously measure DNA methylation across large portions of the genome. The current ‘gold standard’ technique for fine mapping of methylated cytosines is whole-genome bisulphite sequencing (WGBS) [5]. This is based on the treatment of genomic DNA with sodium bisulphite, which converts unmethylated cytosines to uracils while leaving methylated cytosines unchanged, followed by whole-genome sequencing [6]. WGBS has been successfully applied to a range of biological tissues and cell lines to provide a complete map of the

28 million CpG sites in the human genome [7]. However, the high cost of this approach and significant technical expertise currently required to generate and process WGBS data means that it is not always the most feasible method to interrogate DNA methylation in large cohort studies.

In recent years, the Illumina Infinium BeadChips have provided a popular, user-friendly alternative. Like WGBS, this technology is based on sodium bisulphite conversion of DNA, but with subsequent single base resolution genotyping of targeted CpG sites using probes on a microarray. The advantage of the Infinium platforms is that they are easy to use, time-efficient and cost-effective and show good agreement with DNA methylation measurements from other platforms [8]. For a full comparison of the strengths and weaknesses of different DNA methylation profiling methods, including Infinium methylation arrays, MBDcap-Seq and reduced representation bisulphite sequencing (RRBS), see the recent review by Stirzaker and colleagues [5].

The Infinium methylation technology was first introduced with the HumanMethylation27K BeadChip (HM27) in 2008, which featured 25,578 probes predominantly targeting CpG sites within the proximal promoter region of 14,475 consensus coding sequence (CCDS) genes and well-described cancer genes [8]. Probes were preferentially designed to target CpG islands due to the established relationship between DNA methylation at promoter CpG islands and gene expression [8]. The 12-sample per array format and genome-wide span of HM27 represented a significant advance over previous methods, which were low-throughput and restricted to a small number of genomic loci. HM27 allowed researchers to explore the role of DNA methylation in carcinogenesis and identify cancer biomarkers [9] and for the first time perform large-scale ‘epigenome-wide association studies’ (EWAS), which revealed the associations between DNA methylation patterns and tobacco smoking [10], ageing [11] and other complex human phenotypes.

In 2011, the HM450 BeadChip superseded the HM27 BeadChip. The HM450 retained the 12-sample per array design and featured 485,577 probes, including probes targeting 94 % of the CpG sites on the HM27 [12]. The new content was selected after consultation with a consortium of DNA methylation researchers and comprised a more diverse set of genomic categories, including: CpG islands, shores and shelves, the 5′UTR, 3′UTR and bodies of RefSeq genes, FANTOM4 promoters, the MHC region and some enhancer regions [12]. The improved coverage, together with the high sample throughput, of the HM450 made it a popular tool for EWAS studies and for the generation of reference epigenomes, including the International Cancer Genome Consortium (ICGC) and the International Human Epigenome Consortium (IHEC). Notably, The Cancer Genome Atlas (TCGA) consortium used the HM450 platform to profile more than 7500 samples from over 200 different cancer types [5] and it is the platform of choice for large-scale epidemiological studies such as the ARIES study, which is analysing 1000 mother-child pairs at serial time points across their lifetime [13].

Although the HM450 has been widely embraced by the epigenetics research community, the technology initially presented some technical challenges. Foremost among these was the two probe types on the HM450. In order to assay the new genomic regions included on the HM450, probes with a different chemistry were added. However, the two probe types have a different dynamic range, reflecting potential bias in the DNA methylation measurements. Extensive discussion within the field led to the development of bioinformatics methods that now allow us to address the technical impact of the two probe designs, as comprehensively reviewed by Morris and Beck [14]. Additionally, both the HM27 and HM450 featured a proportion of probes that either hybridised to multiple regions of the genome or targeted genetically polymorphic CpGs [15–17]. However, the thorough identification and annotation of these probes means that we can now easily account for misleading measurements during processing. Finally, DNA methylation changes rarely occur in isolation and are more likely to affect contiguous genomic regions. It was therefore necessary to develop methods to accurately identify these differentially methylated regions (DMRs) from HM450 data. Today, a range of analytical packages is available to researchers for regional methylation analysis, for example [18–20]. In summary, methods for processing and analysis of Infinium methylation BeadChips have matured considerably over recent years and we as a community are now extremely proficient at handling this type of data.

The remaining concern with the HM450 platform was that the probe design missed important regulatory regions. Recent studies using other platforms such as WGBS have demonstrated that DNA methylation at regulatory enhancers can determine transcription and phenotypic variation, through modulation of transcription factor binding. Thus accurate quantification of DNA methylation at more regulatory regions is essential for our understanding of the role of DNA methylation in human development and disease. To meet this need, Illumina have recently released the Infinium MethylationEPIC (EPIC) BeadChip, with new content specifically designed to target enhancer regions [21]. The EPIC BeadChip contains over 850,000 probes, which cover more than 90 % of the sites on the HM450, plus more than 350,000 CpGs at regions identified as potential enhancers by FANTOM5 [22] and the ENCODE project [23]. The EPIC array promises to be an essential tool to further our understanding of DNA methylation mechanisms in human development and disease, in particular the DNA methylation landscape of distal regulatory elements. In this paper we perform a comprehensive evaluation of the new EPIC platform.


NGS workflow

A typical NGS experiment shares similar steps regardless of the instrument technology used (Figure 1).

1. Construct library

A sequencing “library” must be created from the sample. The DNA (or cDNA) sample is processed into relatively short double-stranded fragments (100–800 bp). Depending on the specific application, DNA fragmentation can be performed in a variety of ways, including physical shearing, enzyme digestion, and PCR-based ampilficati.on of specific genetic regions. The resulting DNA fragments are then ligated to technology-specific adaptor sequences, forming a fragment library. These adaptors may also have a unique molecular “barcode”, so each sample can be tagged with a unique DNA sequence. This allows for multiple samples to be mixed together and sequenced at the same time. For example, barcodes 1-20 can be used to individually label 20 samples and then analyze them in a single sequencing run. This approach, called “pooling” or “multiplexing”, saves time and money during sequencing experiments and controls for workflow variation, as pooled samples are processed together.

In addition to fragment libraries, there are two other specialized methods of library preparation: paired-end libraries and mate-pair libraries. Paired-end libraries allow users to sequence the DNA fragment from both ends, instead of typical sequencing which occurs only in a single direction. Paired-end libraries are created like regular fragment libraries, but they have adaptor tags on both ends of the DNA insert that enable sequencing from two directions. This methodology makes it easier to map reads and can be used to improve detection of genomic rearrangements, repetitive sequence elements, and RNA gene fusions or splice variants. However, improvements in modern library prep methods and analysis tools have made it possible to detect these features with single direction sequencing as well.

Mate-pair libraries are more complex to create than fragment or paired-end libraries and involve much larger-sized DNA inserts (over 2 kb and up to 30 kb). Sequencing of mate-pair libraries generates two reads that are distal to each other and in the opposite orientation. Using the physical information associated between the two sequencing reads, mate pair sequencing is useful for de novo assembly, large structural variant detection, and identification of complex genomic rearrangements.

2. Clonal amplification

Prior to sequencing, the DNA library must be attached to a solid surface and clonally amplified to increase the signal that can be detected from each target during sequencing. During this process, each unique DNA molecule in the library is bound to the surface of a bead or a flow-cell and PCR amplified to create a set of identical clones. In the case of Ion Torrent technology, a process called “templating” is used to add library molecules to beads. To learn more, see How Ion Torrent Technology Works.

3. Sequence library

All of the DNA in the library is sequenced at the same time using a sequencing instrument. Although each NGS technology is unique, they all utilize a version of the "sequencing by synthesis" method, reading individual bases as they grow along a polymerized strand. This is a cycle with common steps: DNA base synthesis on single stranded DNA, followed by detection of the incorporated base, and then subsequent removal of reactants to restart the cycle.

Most sequencing instruments use optical detection to determine nucleotide incorporation during DNA synthesis. Ion Torrent instruments use electrical detection to sense the release of hydrogen ions, which naturally occurs when nucleotides are incorporated during DNA synthesis. To learn more see, How Ion Torrent Technology Works.

4. Analyze data

Each NGS experiment generates large quantities of complex data consisting of short DNA reads. Although each technology platform has its own algorithms and data analysis tools, they share a similar analysis ‘pipeline’ and use common metrics to evaluate the quality of NGS data sets.

Analysis can be divided into three steps: primary, secondary, and tertiary analysis (Figure 2). Primary analysis is the processing of raw signals from instrument detectors into digitized data or base calls. These raw data are collected during each sequencing cycle. The output of primary analysis is files containing base calls assembled into sequencing reads (FASTQ files) and their associated quality scores (Phred quality score). Secondary analysis involves read filtering and trimming based on quality, followed by alignment of reads to a reference genome or assembly of reads for novel genomes, and finally by variant calling. The main output is a BAM file containing aligned reads. Tertiary analysis is the most challenging step, as it involves interpreting results and extracting meaningful information from the data.

Due to the complexity of NGS data and associated algorithms, NGS analysis is typically performed by bioinformatics specialists. To empower users who don’t have specialized bioinformatics training, platforms like Ion Torrent have created user-friendly, intuitive software that simplifies analysis and doesn’t require programming skills to get results. For a primer on the basic metrics used to analyze NGS data please see the article The Importance of Throughput and Coverage.


Parts of a standard FastQC report

An example FastQC report can be downloaded.

A PDF of this tutorial is available for download.

Basic Statistics

Simple information about input FASTQ file: its name, type of quality score encoding, total number of reads, read length and GC content.

Per base sequence quality

A box-and-whisker plot showing aggregated quality score statistics at each position along all reads in the file. Note that the X-axis is not uniform, it starts out with bases 1-10 being reported individually, after that, it will bin bases across a window a certain number of positions wide. The number of base positions binned together depends on the length of the read for example, with 150bp reads the latter part of the plot will report aggregate statistics for 5bp windows. Shorter reads will have smaller windows and longer reads larger windows. The blue line is the mean quality score at each base position/window. A primer on sequencing quality scores has been prepared by Illumina. The red line within each yellow box represents the median quality score at that position/window. Yellow box is the inner-quartile range for 25 th to 75 th percentile. The upper and lower whiskers represent the 10 th and 90 th percentile scores.

What to look for: It is normal with all Illumina sequencers for the median quality score to start out lower over the first 5-7 bases and to then rise. The average quality score will steadily drop over the length of the read. With paired end reads the average quality scores for read 1 will almost always be higher than for read 2.

A good per base quality graph.

A bad per base quality graph.

Per sequence quality scores

A plot of the total number of reads vs the average quality score over full length of that read.

What to look for: The distribution of average read quality should be fairly tight in the upper range of the plot.

A good per sequence quality graph.

A bad per sequence quality graph.

Per base sequence content

This plot reports the percent of bases called for each of the four nucleotides at each position across all reads in the file. Again, the X-axis is non-uniform as described for Per base sequence quality.

What to look for: For whole genome shotgun DNA sequencing the proportion of each of the four bases should remain relatively constant over the length of the read with %A=%T and %G=%C. With most RNA-Seq library preparation protocols there is clear non-uniform distribution of bases for the first 10-15 nucleotides this is normal and expected depending on the type of library kit used (e.g. TruSeq RNA Library Preparation). RNA-Seq data showing this non-uniform base composition will always be classified as Failed by FastQC for this module even though the sequence is perfectly good.

DNA library per base content.

RNA library per base content.

Per sequence GC content

Plot of the number of reads vs. GC% per read. The displayed Theoretical Distribution assumes a uniform GC content for all reads.

What to look for: For whole genome shotgun sequencing the expectation is that the GC content of all reads should form a normal distribution with the peak of the curve at the mean GC content for the organism sequenced. If the observed distribution deviates too far from the theoretical, FastQC will call a Fail. There are many situations in which this may occur which are expected so the assignment can be ignored. For example, in RNA sequencing there may be a greater or lesser distribution of mean GC content among transcripts causing the observed plot to be wider or narrower than an idealized normal distribution. The plot below is from some very high quality RNA-Seq data yet FastQC still assigned a Warn flag to it because the observed distribution was narrower than the theoretical.

Per base N content

Percent of bases at each position or bin with no base call, i.e. &lsquoN&rsquo.

What to expect: You should never see any point where this curve rises noticeably above zero. If it does this indicates a problem occurred during the sequencing run. The example below is a case where an error caused the instrument to be unable to call a base for approximately 20% of the reads at position 29.

A bad per base N content graph.

Sequence Duplication Levels

Percentage of reads of a given sequence in the file which are present a given number of times in the file. (This is the blue line. The red line is more difficult to interpret.) There are generally two sources of duplicate reads: PCR duplication in which library fragments have been over represented due to biased PCR enrichment or truly over represented sequences such as very abundant transcripts in an RNA-Seq library. The former is a concern because PCR duplicates misrepresent the true proportion of sequences in your starting material. The latter is an expected case and not of concern because it does faithfully represent your input.

What to expect: For whole genome shotgun data it is expected that nearly 100% of your reads will be unique (appearing only 1 time in the sequence data). This indicates a highly diverse library that was not over sequenced. If the sequencing output is extremely deep (e.g. > 100X the size of your genome) you will start to see some sequence duplication this is inevitable as there are in theory only a finite number of completely unique sequence reads which can be obtained from any given input DNA sample.

When sequencing RNA there will be some very highly abundant transcripts and some lowly abundant. It is expected that duplicate reads will be observed for high abundance transcripts. The RNA-Seq data below was flagged as Failed by FastQC even though the duplication is expected in this case.

DNA library sequence duplication.

RNA library sequence duplication.

Overrepresented Sequences

List of sequences which appear more than expected in the file. Only the first 50bp are considered. A sequence is considered overrepresented if it accounts for &ge 0.1% of the total reads. Each overrepresented sequence is compared to a list of common contaminants to try to identify it.

What to expect: In DNA-Seq data no single sequence should be present at a high enough frequency to be listed, though it is not unusual to see a small percentage of adapter reads. For RNA-Seq data it is possible that there may be some transcripts that are so abundant that they register as overrepresented sequence.

Adapter Content

Cumulative plot of the fraction of reads where the sequence library adapter sequence is identified at the indicated base position. Only adapters specific to the library type are searched.

What to expect: Ideally Illumina sequence data should not have any adapter sequence present, however when using long read lengths it is possible that some of the library inserts are shorter than the read length resulting in read-through to the adapter at the 3&rsquo end of the read. This is more likely to occur with RNA-Seq libraries where the distribution of library insert sizes is more varied and likely to include some short inserts. The example below is for a high quality RNA-Seq library with a small percentage of the library having inserts smaller than 150bp.

Kmer Content

Measures the count of each short nucleotide of length k (default = 7) starting at each positon along the read. Any given Kmer should be evenly represented across the length of the read. A list of kmers which appear at specific positions with greater than expected frequency are reported. The positions for the six most biased kmers are plotted. This module can be very difficult to interpret. As with the sequence duplication module described above, RNA-seq libraries may have highly represented Kmers that are derived from highly expressed sequences. If you wish to learn more about this module please see the FastQC Kmer Content documentation. The example Kmer content graph below is from a high quality DNA-Seq library. The biased Kmers near the start of the read likely are due to slight, sequence dependent efficiency of DNA shearing or a result of random priming.


Overview

The cost-accounting data presented here are summarized relative to two metrics: (1) "Cost per Megabase of DNA Sequence" - the cost of determining one megabase (Mb a million bases) of DNA sequence of a specified quality [see below] (2) "Cost per Genome" - the cost of sequencing a human-sized genome. For each, a graph is provided showing the data since 2001 in addition, the actual numbers reflected by the graphs are provided in a summary table.

NHGRI welcomes people to download these graphs and use them in their presentations and teaching materials. NHGRI plans to update these data on a regular basis. You can view the data in in Excel by downloading the Sequencing Costs 2020.

To illustrate the nature of the reductions in DNA sequencing costs, each graph also shows hypothetical data reflecting Moore's Law, which describes a long-term trend in the computer hardware industry that involves the doubling of 'compute power' every two years (See: Moore's Law [wikipedia.org]). Technology improvements that 'keep up' with Moore's Law are widely regarded to be doing exceedingly well, making it useful for comparison.

In both graphs, note: (1) the use a logarithmic scale on the Y axis and (2) the sudden and profound out-pacing of Moore's Law beginning in January 2008. The latter represents the time when the sequencing centers transitioned from Sanger-based (dideoxy chain termination sequencing) to 'second generation' (or 'next-generation') DNA sequencing technologies. Additional details about these graphs are provided below.

These data, however, do not capture all of the costs associated with the NHGRI Large-Scale Genome Sequencing Program. The sequencing centers perform a number of additional activities whose costs are not appropriate to include when calculating costs for production-oriented DNA sequencing. In other words, NHGRI makes a distinction between 'production' activities and 'non-production' activities. Production activities are essential to the routine generation of large amounts of quality DNA sequence data that are made available in public databases the costs associated with production DNA sequencing are summarized here and depicted on the two graphs. Additional information about the other activities performed by the sequencing centers is provided below.

The cost-accounting data presented here are summarized relative to two metrics: (1) "Cost per Megabase of DNA Sequence" - the cost of determining one megabase (Mb a million bases) of DNA sequence of a specified quality [see below] (2) "Cost per Genome" - the cost of sequencing a human-sized genome. For each, a graph is provided showing the data since 2001 in addition, the actual numbers reflected by the graphs are provided in a summary table.

NHGRI welcomes people to download these graphs and use them in their presentations and teaching materials. NHGRI plans to update these data on a regular basis. You can view the data in in Excel by downloading the Sequencing Costs 2020.

To illustrate the nature of the reductions in DNA sequencing costs, each graph also shows hypothetical data reflecting Moore's Law, which describes a long-term trend in the computer hardware industry that involves the doubling of 'compute power' every two years (See: Moore's Law [wikipedia.org]). Technology improvements that 'keep up' with Moore's Law are widely regarded to be doing exceedingly well, making it useful for comparison.

In both graphs, note: (1) the use a logarithmic scale on the Y axis and (2) the sudden and profound out-pacing of Moore's Law beginning in January 2008. The latter represents the time when the sequencing centers transitioned from Sanger-based (dideoxy chain termination sequencing) to 'second generation' (or 'next-generation') DNA sequencing technologies. Additional details about these graphs are provided below.

These data, however, do not capture all of the costs associated with the NHGRI Large-Scale Genome Sequencing Program. The sequencing centers perform a number of additional activities whose costs are not appropriate to include when calculating costs for production-oriented DNA sequencing. In other words, NHGRI makes a distinction between 'production' activities and 'non-production' activities. Production activities are essential to the routine generation of large amounts of quality DNA sequence data that are made available in public databases the costs associated with production DNA sequencing are summarized here and depicted on the two graphs. Additional information about the other activities performed by the sequencing centers is provided below.


DNA Sequencing

Illumina next-generation sequencing (NGS) technology uses clonal amplification and sequencing by synthesis (SBS) chemistry to enable rapid, accurate sequencing. The process simultaneously identifies DNA bases while incorporating them into a nucleic acid chain. Each base emits a unique fluorescent signal as it is added to the growing strand, which is used to determine the order of the DNA sequence.

NGS technology can be used to sequence the DNA from any organism, providing valuable information in response to almost any biological question. A highly scalable technology, DNA sequencing can be applied to small, targeted regions or the entire genome through a variety of methods, enabling researchers to investigate and better understand health and disease.

One Decade of Sequencing

Explore the breakthroughs, advancements, and progress.

Benefits of DNA Sequencing With NGS

  • Sequences large stretches of DNA in a massively parallel fashion, offering advantages in throughput and scale compared to capillary electrophoresis–based Sanger sequencing
  • Provides high resolution to obtain a base-by-base view of a gene, exome, or genome
  • Delivers quantitative measurements based on signal intensity
  • Detects virtually all types of genomic DNA alterations, including single nucleotide variants, insertions and deletions, copy number changes, and chromosomal aberrations
  • Offers high throughput and flexibility to scale studies and sequence multiple samples simultaneously
Benchtop DNA Sequencers

Compare the speed and throughput of Illumina DNA sequencing systems to find the best option for your lab.

Common DNA Sequencing Methods

Whole-Genome Sequencing

Whole-genome sequencing is the most comprehensive method for analyzing the genome. Rapidly dropping sequencing costs and the ability to obtain valuable information about the entire genetic code make this method a powerful research tool.

Targeted Resequencing

With targeted resequencing, a subset of genes or regions of the genome are isolated and sequenced, allowing researchers to focus time, expenses, and analysis on specific areas of interest.

ChIP Sequencing

By combining chromatin immunoprecipitation (ChIP) assays and sequencing, ChIP sequencing (ChIP-Seq) is a powerful method to identify genome-wide DNA binding sites for transcription factors and other proteins.

Library Prep for DNA Sequencing

Our versatile library prep portfolio allows you to examine small, targeted regions or the entire genome. We've innovated in PCR-free and on-bead fragmentation technology, offering time savings, flexibility, and increased sequencing data performance.

DNA Sequencing Methods Review

This collection of peer-reviewed publications contains pros and cons, schematic protocol diagrams, and related references for various DNA sequencing methods.

Related Solutions

Cancer DNA Sequencing

NGS-based sequencing methods allow cancer researchers to detect rare somatic variants, perform tumor-normal comparisons, and analyze circulating DNA fragments. Learn more about cancer sequencing.

Genotyping Solutions

Sequencing- and array-based genotyping technologies can provide insight into the functional consequences of genetic variation. Learn more about genotyping.

Cell-Free DNA Technology

Cell-free DNA (cfDNA) are short fragments of DNA released into the bloodstream. cfDNA from a maternal blood sample can be used to screen for common chromosomal conditions in the fetus. Learn more about cell-free DNA technology.

Microbial Sequencing

Analysis of microbial species using DNA sequencing can inform environmental metagenomics studies, infectious disease surveillance, molecular epidemiology, and more. Learn more about microbial sequencing methods.

Interested in receiving newsletters, case studies, and information on genomic analysis techniques? Enter your email address.

Additional Resources

NGS Technology

NGS has become an everyday research tool used to address today’s complex biological questions.

Gene Panel and Array Finder

Identify sequencing panels or microarrays that target your genes of interest.

DNA Sequencing Data Analysis

Find intuitive analysis tools that transform raw DNA sequencing data into meaningful results.

Sequencing Troubleshooting Tips

These short videos provide expert tips for issues such as overclustering, inconsistent quantitation, and sequencing through the insert.

Methods Guide

All the information you need, from BeadChips to library preparation to sequencer selection and analysis. Select the best tools for your lab.

Find the Right Kit

Determine the best library prep kit or array for your needs based on your starting material and method of interest.

Innovative technologies

At Illumina, our goal is to apply innovative technologies to the analysis of genetic variation and function, making studies possible that were not even imaginable just a few years ago. It is mission critical for us to deliver innovative, flexible, and scalable solutions to meet the needs of our customers. As a global company that places high value on collaborative interactions, rapid delivery of solutions, and providing the highest level of quality, we strive to meet this challenge. Illumina innovative sequencing and array technologies are fueling groundbreaking advancements in life science research, translational and consumer genomics, and molecular diagnostics.

For Research Use Only. Not for use in diagnostic procedures (except as specifically noted).


Two common diagnostic tests

Both diagnostic tests &mdash chorionic villus sampling and amniocentesis&mdash are invasive tests and involve extracting cells from the fetus and analyzing them under a microscope. Then geneticists can determine whether the fetus has too few or too many chromosomes present, or if the chromosomes are damaged and could result in a genetic problem.

Chorionic Villus Sampling (CVS)

Done during the first trimester of pregnancy usually at 10 to 12 weeks, this diagnostic test involves taking a small sample of cells from the placenta. Placental tissue contains the same genetic material as the fetus and can be checked for chromosomal abnormalities and other genetic disorders. However, CVS cannot identify neural tube defects, such as spina bifida, which can be detected by amniocentesis.

How it's done: Depending upon where the placenta is located and using ultrasound for guidance, a small tube is inserted through either the mother's abdomen or her vagina and a small tissue sample is withdrawn from the placenta.

Possible risks: CVS has a slightly higher risk of miscarriage than amniocentesis. CVS has a 1 percent risk of miscarriage, according to the Mayo Clinic.

Amniocentesis

"Amniocentesis is considered the gold standard for prenatal genetic testing," Greiner said.

How it's done: A long, thin needle is inserted into the mother's abdomen to obtain a sample of the amniotic fluid surrounding the fetus. The procedure is usually done between the 15th and 20th weeks of pregnancy, and the amniotic fluid contains cells from the fetus with genetic information about the unborn child.

Possible risks: Amniocentesis carries a lower risk of miscarriage than CVS, about 1 in 400, Greiner said.


Methods in Student Assessment

Below are a few common methods of assessment identified by Brown and Knight that can be implemented in the classroom.[1] It should be noted that these methods work best when learning objectives have been identified, shared, and clearly articulated to students.

Self-Assessment

The goal of implementing self-assessment in a course is to enable students to develop their own judgement. In self-assessment students are expected to assess both process and product of their learning. While the assessment of the product is often the task of the instructor, implementing student assessment in the classroom encourages students to evaluate their own work as well as the process that led them to the final outcome. Moreover, self-assessment facilitates a sense of ownership of one’s learning and can lead to greater investment by the student. It enables students to develop transferable skills in other areas of learning that involve group projects and teamwork, critical thinking and problem-solving, as well as leadership roles in the teaching and learning process.

Things to Keep in Mind about Self-Assessment

  1. Self-assessment is different from self-grading. According to Brown and Knight, “Self-assessment involves the use of evaluative processes in which judgement is involved, where self-grading is the marking of one’s own work against a set of criteria and potential outcomes provided by a third person, usually the [instructor].” (Pg. 52)
  2. Students may initially resist attempts to involve them in the assessment process. This is usually due to insecurities or lack of confidence in their ability to objectively evaluate their own work. Brown and Knight note, however, that when students are asked to evaluate their work, frequently student-determined outcomes are very similar to those of instructors, particularly when the criteria and expectations have been made explicit in advance.
  3. Methods of self-assessment vary widely and can be as eclectic as the instructor. Common forms of self-assessment include the portfolio, reflection logs, instructor-student interviews, learner diaries and dialog journals, and the like.

Peer Assessment

Peer assessment is a type of collaborative learning technique where students evaluate the work of their peers and have their own evaluated by peers. This dimension of assessment is significantly grounded in theoretical approaches to active learning and adult learning. Like self-assessment, peer assessment gives learners ownership of learning and focuses on the process of learning as students are able to “share with one another the experiences that they have undertaken.” (Brown and Knight, 1994, pg. 52)

Things to Keep in Mind about Peer Assessment

  1. Students can use peer assessment as a tactic of antagonism or conflict with other students by giving unmerited low evaluations. Conversely, students can also provide overly favorable evaluations of their friends.
  2. Students can occasionally apply unsophisticated judgements to their peers. For example, students who are boisterous and loquacious may receive higher grades than those who are quieter, reserved, and shy.
  3. Instructors should implement systems of evaluation in order to ensure valid peer assessment is based on evidence and identifiable criteria.

Essays

According to Euan S. Henderson, essays make two important contributions to learning and assessment: the development of skills and the cultivation of a learning style. (Henderson, 1980) Essays are a common form of writing assignment in courses and can be either a summative or formative form of assessment depending on how the instructor utilizes them in the classroom.

Things to Keep in Mind about Essays

  1. A common challenge of the essay is that students can use them simply to regurgitate rather than analyze and synthesize information to make arguments.
  2. Instructors commonly assume that students know how to write essays and can encounter disappointment or frustration when they discover that this is not the case for some students. For this reason, it is important for instructors to make their expectations clear and be prepared to assist or expose students to resources that will enhance their writing skills.

Exams and time-constrained, individual assessment

Examinations have traditionally been viewed as a gold standard of assessment in education, particularly in university settings. Like essays they can be summative or formative forms of assessment.

Things to Keep in Mind about Exams

  1. Exams can make significant demands on students’ factual knowledge and can have the side-effect of encouraging cramming and surface learning. On the other hand, they can also facilitate student demonstration of deep learning if essay questions or topics are appropriately selected. Different formats include in-class tests, open-book, take-home exams and the like.
  2. In the process of designing an exam, instructors should consider the following questions. What are the learning objectives that the exam seeks to evaluate? Have students been adequately prepared to meet exam expectations? What are the skills and abilities that students need to do well? How will this exam be utilized to enhance the student learning process?

As Brown and Knight assert, utilizing multiple methods of assessment, including more than one assessor, improves the reliability of data. However, a primary challenge to the multiple methods approach is how to weigh the scores produced by multiple methods of assessment. When particular methods produce higher range of marks than others, instructors can potentially misinterpret their assessment of overall student performance. When multiple methods produce different messages about the same student, instructors should be mindful that the methods are likely assessing different forms of achievement. (Brown and Knight, 1994).

For additional methods of assessment not listed here, see “Assessment on the Page” and “Assessment Off the Page” in Assessing Learners in Higher Education.

In addition to the various methods of assessment listed above, classroom assessment techniques also provide a useful way to evaluate student understanding of course material in the teaching and learning process. For more on these, see our Classroom Assessment Techniques teaching guide.


Author information

Affiliations

Aix-Marseille Université, CNRS, IRD, UMR 6116 - IMEP, Equipe Evolution Génome Environnement, Centre Saint-Charles, Case 36, 3 place Victor Hugo, 13331, Marseille Cedex 3, France

André Gilles, Emese Meglécz & Nicolas Pech

Genoscreen, Genomic Platform and R&D, Campus de l'Institut Pasteur, 1 rue du Professeur Calmette, Bâtiment Guérin, 4ème étage, 59000, Lille, France

Institut National de la Recherche Agronomique, UMR 1301, Equipe BPI, 400 route des Chappes, BP 167, 06903, Sophia-Antipolis Cedex, France

UMR CBGP (INRA/IRD/Cirad/Montpellier SupAgro), Campus international de Baillarguet, CS 30016, F-34988, Montferrier-sur-Lez cedex, France