What's the origin of junk DNA?

What's the origin of junk DNA?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Most eukaryotes posses a certain amount of junk DNA in their cell nuclei. What is (are) the origin(s) of this junk DNA, And is it realy junk (superfluous)?

"Junk DNA" is more aptly named noncoding DNA. This is defined as any DNA region that does not encode for a gene or more precisely is not within an open reading frame. In the human genome over 98% consists of noncoding DNA. However the more we learn about molecular biology the more we understand the biological function and importance of noncoding DNA. Examples for important functions are:

  1. Regulatory regions that control the expression of a gene
  2. Regions coding for regulatory RNA
  3. Regions where epigenetic regulation takes place

However, there are also regions which likely do not have beneficial biological function, which may rightfully be called junk:

  1. Transposons are genetic regions that can copy themselves (either by an enzymatically active RNA or by encoding for the protein transposase). They are believed to have evolved as "selfish genes" and several known defense mechanisms against rogue transposons exist (siRNA, RNAi). Transposons and the defense mechanisms have now become powerful tools in molecular biology research.
  2. Endogenous retrovirus sequences which are remainders of retroviruses which have inserted themselves into the germ line and become inactive through mutation.

However, even these "junk" regions are believed to have important evolutionary functions such as protection from mutation through retroviruses: Because there are large DNA regions where the precise order and function is not important, a retrovirus that inserts itself at random positions of the genome is less likely to cause permanent damage.

Briefly, we know of many mechanisms by which genomes can get larger. Tetrapods had at least two complete genome doublings in their history; transposons expand; retroviruses insert; partial duplications lead to pseudogenes. And these expansion mechanisms can be fast -- full genome duplications double size in a single generation.

But we know of very few mechanisms by which genomes can get smaller, and most of those are very slow, and very few are targeted.

From an mechanistic viewpoint, it's very difficult to imagine a targeted way to remove useless but harmless DNA quickly, and with 100% accuracy. If accuracy is not 100%, then the pathway would be more harmful than the DNA it seeks to remove.

The key is that if extra DNA is either harmless, or nearly harmless, there's no reason to eliminate it, and there are reasons (errors in removal) to not try to remove it.

So the short and simple answer is that genomes can accumulate useless DNA much more readily than they can get rid of it. It's just common sense, which matches 30 years of experimentation.

'Junk DNA' uncovers the nature of our ancient ancestors

The key to solving one of the great puzzles in evolutionary biology, the origin of vertebrates -- animals with an internal skeleton made of bone -- has been revealed in new research from Dartmouth College and the University of Bristol.

Vertebrates are the most anatomically and genetically complex of all organisms, but explaining how they achieved this complexity has vexed scientists. The study, published today [20 October] in Proceedings of the National Academy of Sciences claims to have solved this scientific riddle by analysing the genomics of primitive living fishes such as sharks and lampreys, and their spineless relatives such as sea squirts. 

Alysha Heimberg of Dartmouth College and colleagues studied the family relationships of primitive vertebrates. The team used microRNAs, a class of tiny molecules only recently discovered residing within what has usually been considered 'junk DNA', to show that lampreys and slime eels are distant relatives of jawed vertebrates.

Alysha said: “We learn from our results that lamprey and hagfish are equally related to jawed vertebrates and that hagfish are not representative of a more primitive vertebrate, which suggests that the ancestral vertebrate was more complex than anyone had previously thought.

“Vertebrates have been evolving for hundreds of millions of years but still express the same microRNA genes in the same organs as when they both first appeared.”

The team went on to test the idea that it was these same ‘junk DNA’ genes, microRNAs, which were responsible for the evolution origin of vertebrate anatomical features. They found that the same suite of microRNAs were expressed in the same organs and tissues, in lampreys and mice.

Co-author, Professor Philip Donoghue of the University of Bristol’s School of Earth Sciences, said: “The origin of vertebrates and the origin of these genes is no coincidence.”

Professor Kevin Peterson of Dartmouth College said: “This study not only points the way to understanding the evolutionary origin of our own lineage, but it also helps us to understand how our own genome was assembled in deep time.”


  1. ^ Pennisi E (September 2012). "Genomics. ENCODE project writes eulogy for junk DNA". Science. 337 (6099): 1159–1161. doi:10.1126/science.337.6099.1159. PMID22955811.
  2. ^
  3. The ENCODE Project Consortium (September 2012). "An integrated encyclopedia of DNA elements in the human genome". Nature. 489 (7414): 57–74. Bibcode:2012Natur.489. 57T. doi:10.1038/nature11247. PMC3439153 . PMID22955616. .
  4. ^ Cite error: The named reference Costa non-coding was invoked but never defined (see the help page).
  5. ^ ab
  6. Carey M (2015). Junk DNA: A Journey Through the Dark Matter of the Genome. Columbia University Press. ISBN9780231170840 .
  7. ^
  8. McKie R (24 February 2013). "Scientists attacked over claim that 'junk DNA' is vital to life". The Observer.
  9. ^
  10. Eddy SR (November 2012). "The C-value paradox, junk DNA and ENCODE". Current Biology. 22 (21): R898–9. doi: 10.1016/j.cub.2012.10.002 . PMID23137679. S2CID28289437.
  11. ^
  12. Doolittle WF (April 2013). "Is junk DNA bunk? A critique of ENCODE". Proceedings of the National Academy of Sciences of the United States of America. 110 (14): 5294–300. Bibcode:2013PNAS..110.5294D. doi:10.1073/pnas.1221376110. PMC3619371 . PMID23479647.
  13. ^
  14. Palazzo AF, Gregory TR (May 2014). "The case for junk DNA". PLOS Genetics. 10 (5): e1004351. doi:10.1371/journal.pgen.1004351. PMC4014423 . PMID24809441.
  15. ^
  16. Graur D, Zheng Y, Price N, Azevedo RB, Zufall RA, Elhaik E (2013). "On the immortality of television sets: "function" in the human genome according to the evolution-free gospel of ENCODE". Genome Biology and Evolution. 5 (3): 578–90. doi:10.1093/gbe/evt028. PMC3622293 . PMID23431001.
  17. ^
  18. Ponting CP, Hardison RC (November 2011). "What fraction of the human genome is functional?". Genome Research. 21 (11): 1769–76. doi:10.1101/gr.116814.110. PMC3205562 . PMID21875934.
  19. ^ ab
  20. Kellis M, Wold B, Snyder MP, Bernstein BE, Kundaje A, Marinov GK, et al. (April 2014). "Defining functional DNA elements in the human genome". Proceedings of the National Academy of Sciences of the United States of America. 111 (17): 6131–8. Bibcode:2014PNAS..111.6131K. doi:10.1073/pnas.1318948111. PMC4035993 . PMID24753594.
  21. ^
  22. Rands CM, Meader S, Ponting CP, Lunter G (July 2014). "8.2% of the Human genome is constrained: variation in rates of turnover across functional element classes in the human lineage". PLOS Genetics. 10 (7): e1004525. doi:10.1371/journal.pgen.1004525. PMC4109858 . PMID25057982.
  23. ^
  24. Mattick JS (2013). "The extent of functionality in the human genome". The HUGO Journal. 7 (1): 2. doi:10.1186/1877-6566-7-2. PMC4685169 .
  25. ^
  26. Morris K, ed. (2012). Non-Coding RNAs and Epigenetic Regulation of Gene Expression: Drivers of Natural Selection. Norfolk, UK: Caister Academic Press. ISBN978-1904455943 .

The amount of total genomic DNA varies widely between organisms, and the proportion of coding and non-coding DNA within these genomes varies greatly as well. For example, it was originally suggested that over 98% of the human genome does not encode protein sequences, including most sequences within introns and most intergenic DNA, [2] while 20% of a typical prokaryote genome is non-coding. [3]

In eukaryotes, genome size, and by extension the amount of non-coding DNA, is not correlated to organism complexity, an observation known as the C-value enigma. [4] For example, the genome of the unicellular Polychaos dubium (formerly known as Amoeba dubia) has been reported to contain more than 200 times the amount of DNA in humans. [5] The pufferfish Takifugu rubripes genome is only about one eighth the size of the human genome, yet seems to have a comparable number of genes approximately 90% of the Takifugu genome is non-coding DNA. [2] Therefore, most of the difference in genome size is not due to variation in amount of coding DNA, rather, it is due to a difference in the amount of non-coding DNA. [6]

In 2013, a new "record" for the most efficient eukaryotic genome was discovered with Utricularia gibba, a bladderwort plant that has only 3% non-coding DNA and 97% of coding DNA. Parts of the non-coding DNA were being deleted by the plant and this suggested that non-coding DNA may not be as critical for plants, even though non-coding DNA is useful for humans. [1] Other studies on plants have discovered crucial functions in portions of non-coding DNA that were previously thought to be negligible and have added a new layer to the understanding of gene regulation. [7]

Cis- and trans-regulatory elements Edit

Cis-regulatory elements are sequences that control the transcription of a nearby gene. Many such elements are involved in the evolution and control of development. [8] Cis-elements may be located in 5' or 3' untranslated regions or within introns. Trans-regulatory elements control the transcription of a distant gene.

Promoters facilitate the transcription of a particular gene and are typically upstream of the coding region. Enhancer sequences may also exert very distant effects on the transcription levels of genes. [9]

Introns Edit

Introns are non-coding sections of a gene, transcribed into the precursor mRNA sequence, but ultimately removed by RNA splicing during the processing to mature messenger RNA. Many introns appear to be mobile genetic elements. [10]

Studies of group I introns from Tetrahymena protozoans indicate that some introns appear to be selfish genetic elements, neutral to the host because they remove themselves from flanking exons during RNA processing and do not produce an expression bias between alleles with and without the intron. [10] Some introns appear to have significant biological function, possibly through ribozyme functionality that may regulate tRNA and rRNA activity as well as protein-coding gene expression, evident in hosts that have become dependent on such introns over long periods of time for example, the trnL-intron is found in all green plants and appears to have been vertically inherited for several billions of years, including more than a billion years within chloroplasts and an additional 2–3 billion years prior in the cyanobacterial ancestors of chloroplasts. [10]

Pseudogenes Edit

Pseudogenes are DNA sequences, related to known genes, that have lost their protein-coding ability or are otherwise no longer expressed in the cell. Pseudogenes arise from retrotransposition or genomic duplication of functional genes, and become "genomic fossils" that are nonfunctional due to mutations that prevent the transcription of the gene, such as within the gene promoter region, or fatally alter the translation of the gene, such as premature stop codons or frameshifts. [11] Pseudogenes resulting from the retrotransposition of an RNA intermediate are known as processed pseudogenes pseudogenes that arise from the genomic remains of duplicated genes or residues of inactivated genes are nonprocessed pseudogenes. [11] Transpositions of once functional mitochondrial genes from the cytoplasm to the nucleus, also known as NUMTs, also qualify as one type of common pseudogene. [12] Numts occur in many eukaryotic taxa.

While Dollo's Law suggests that the loss of function in pseudogenes is likely permanent, silenced genes may actually retain function for several million years and can be "reactivated" into protein-coding sequences [13] and a substantial number of pseudogenes are actively transcribed. [11] [14] Because pseudogenes are presumed to change without evolutionary constraint, they can serve as a useful model of the type and frequencies of various spontaneous genetic mutations. [15]

Repeat sequences, transposons and viral elements Edit

Transposons and retrotransposons are mobile genetic elements. Retrotransposon repeated sequences, which include long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs), account for a large proportion of the genomic sequences in many species. Alu sequences, classified as a short interspersed nuclear element, are the most abundant mobile elements in the human genome. Some examples have been found of SINEs exerting transcriptional control of some protein-encoding genes. [16] [17] [18]

Endogenous retrovirus sequences are the product of reverse transcription of retrovirus genomes into the genomes of germ cells. Mutation within these retro-transcribed sequences can inactivate the viral genome. [19]

Over 8% of the human genome is made up of (mostly decayed) endogenous retrovirus sequences, as part of the over 42% fraction that is recognizably derived of retrotransposons, while another 3% can be identified to be the remains of DNA transposons. Much of the remaining half of the genome that is currently without an explained origin is expected to have found its origin in transposable elements that were active so long ago (> 200 million years) that random mutations have rendered them unrecognizable. [20] Genome size variation in at least two kinds of plants is mostly the result of retrotransposon sequences. [21] [22]

Telomeres Edit

Telomeres are regions of repetitive DNA at the end of a chromosome, which provide protection from chromosomal deterioration during DNA replication. Recent studies have shown that telomeres function to aid in its own stability. Telomeric repeat-containing RNA (TERRA) are transcripts derived from telomeres. TERRA has been shown to maintain telomerase activity and lengthen the ends of chromosomes. [23]

The term "junk DNA" became popular in the 1960s. [24] [25] According to T. Ryan Gregory, the nature of junk DNA was first discussed explicitly in 1972 by a genomic biologist, David Comings, who applied the term to all non-coding DNA. [26] The term was formalized that same year by Susumu Ohno, [6] who noted that the mutational load from deleterious mutations placed an upper limit on the number of functional loci that could be expected given a typical mutation rate. Ohno hypothesized that mammal genomes could not have more than 30,000 loci under selection before the "cost" from the mutational load would cause an inescapable decline in fitness, and eventually extinction. This prediction remains robust, with the human genome containing approximately (protein-coding) 20,000 genes. Another source for Ohno's theory was the observation that even closely related species can have widely (orders-of-magnitude) different genome sizes, which had been dubbed the C-value paradox in 1971. [27]

The term "junk DNA" has been questioned on the grounds that it provokes a strong a priori assumption of total non-functionality and some have recommended using more neutral terminology such as "non-coding DNA" instead. [26] Yet "junk DNA" remains a label for the portions of a genome sequence for which no discernible function has been identified and that through comparative genomics analysis appear under no functional constraint suggesting that the sequence itself has provided no adaptive advantage.

Since the late 70s it has become apparent that the majority of non-coding DNA in large genomes finds its origin in the selfish amplification of transposable elements, of which W. Ford Doolittle and Carmen Sapienza in 1980 wrote in the journal Nature: "When a given DNA, or class of DNAs, of unproven phenotypic function can be shown to have evolved a strategy (such as transposition) which ensures its genomic survival, then no other explanation for its existence is necessary." [28] The amount of junk DNA can be expected to depend on the rate of amplification of these elements and the rate at which non-functional DNA is lost. [29] In the same issue of Nature, Leslie Orgel and Francis Crick wrote that junk DNA has "little specificity and conveys little or no selective advantage to the organism". [30] The term occurs mainly in popular science and in a colloquial way in scientific publications, and it has been suggested that its connotations may have delayed interest in the biological functions of non-coding DNA. [31]

Some evidence indicate that some "junk DNA" sequences are sources for (future) functional activity in evolution through exaptation of originally selfish or non-functional DNA. [32]

ENCODE Project Edit

In 2012, the ENCODE project, a research program supported by the National Human Genome Research Institute, reported that 76% of the human genome's non-coding DNA sequences were transcribed and that nearly half of the genome was in some way accessible to genetic regulatory proteins such as transcription factors. [33] However, the suggestion by ENCODE that over 80% of the human genome is biochemically functional has been criticized by other scientists, [34] who argue that neither accessibility of segments of the genome to transcription factors nor their transcription guarantees that those segments have biochemical function and that their transcription is selectively advantageous. After all, non-functional sections of the genome can be transcribed, given that transcription factors typically bind to short sequences that are found (randomly) all over the whole genome. [35]

Furthermore, the much lower estimates of functionality prior to ENCODE were based on genomic conservation estimates across mammalian lineages. [27] [36] [37] [38] Widespread transcription and splicing in the human genome has been discussed as another indicator of genetic function in addition to genomic conservation which may miss poorly conserved functional sequences. [39] Furthermore, much of the apparent junk DNA is involved in epigenetic regulation and appears to be necessary for the development of complex organisms. [40] [41] [42] Genetic approaches may miss functional elements that do not manifest physically on the organism, evolutionary approaches have difficulties using accurate multispecies sequence alignments since genomes of even closely related species vary considerably, and with biochemical approaches, though having high reproducibility, the biochemical signatures do not always automatically signify a function. [39] Kellis et al. noted that 70% of the transcription coverage was less than 1 transcript per cell (and may thus be based on spurious background transcription). On the other hand, they argued that 12–15% fraction of human DNA may be under functional constraint, and may still be an underestimate when lineage-specific constraints are included. Ultimately genetic, evolutionary, and biochemical approaches can all be used in a complementary way to identify regions that may be functional in human biology and disease. [39] Some critics have argued that functionality can only be assessed in reference to an appropriate null hypothesis. In this case, the null hypothesis would be that these parts of the genome are non-functional and have properties, be it on the basis of conservation or biochemical activity, that would be expected of such regions based on our general understanding of molecular evolution and biochemistry. According to these critics, until a region in question has been shown to have additional features, beyond what is expected of the null hypothesis, it should provisionally be labelled as non-functional. [43]

Some non-coding DNA sequences must have some important biological function. This is indicated by comparative genomics studies that report highly conserved regions of non-coding DNA, sometimes on time-scales of hundreds of millions of years. This implies that these non-coding regions are under strong evolutionary pressure and positive selection. [44] For example, in the genomes of humans and mice, which diverged from a common ancestor 65–75 million years ago, protein-coding DNA sequences account for only about 20% of conserved DNA, with the remaining 80% of conserved DNA represented in non-coding regions. [45] Linkage mapping often identifies chromosomal regions associated with a disease with no evidence of functional coding variants of genes within the region, suggesting that disease-causing genetic variants lie in the non-coding DNA. [45] The significance of non-coding DNA mutations in cancer was explored in April 2013. [46]

Non-coding genetic polymorphisms play a role in infectious disease susceptibility, such as hepatitis C. [47] Moreover, non-coding genetic polymorphisms contribute to susceptibility to Ewing sarcoma, an aggressive pediatric bone cancer. [48]

Some specific sequences of non-coding DNA may be features essential to chromosome structure, centromere function and recognition of homologous chromosomes during meiosis. [49]

According to a comparative study of over 300 prokaryotic and over 30 eukaryotic genomes, [50] eukaryotes appear to require a minimum amount of non-coding DNA. The amount can be predicted using a growth model for regulatory genetic networks, implying that it is required for regulatory purposes. In humans the predicted minimum is about 5% of the total genome.

Over 10% of 32 mammalian genomes may function through the formation of specific RNA secondary structures. [51] The study used comparative genomics to identify compensatory DNA mutations that maintain RNA base-pairings, a distinctive feature of RNA molecules. Over 80% of the genomic regions presenting evolutionary evidence of RNA structure conservation do not present strong DNA sequence conservation.

Non-coding DNA may perhaps serve to decrease the probability of gene disruption during chromosomal crossover. [52]

Evidence from Polygenic Scores and GWAS Edit

Genome-wide association studies (GWAS) and machine learning analysis of large genomic datasets has led to the construction of polygenic predictors for human traits such as height, bone density, and many disease risks. Similar predictors exist for plant and animal species and are used in agricultural breeding. [54] The detailed genetic architecture of human predictors has been analyzed and significant effects used in prediction are associated with DNA regions far outside coding regions. The fraction of variance accounted for (i.e., fraction of predictive power captured by the predictor) in coding vs. non-coding regions varies widely for different complex traits. For example, atrial fibrillation and coronary artery disease risk are mostly controlled by variants in non-coding regions (non-coding variance fraction over 70 percent), whereas diabetes and high cholesterol display the opposite pattern (non-coding variance roughly 20-30 percent). [53] Individual differences between humans are clearly affected in a significant way by non-coding genetic loci, which is strong evidence for functional effects. Whole exome genotypes (i.e., which contain information restricted to coding regions only) do not contain enough information to build or even evaluate polygenic predictors for many well-studied complex traits and disease risks.

In 2013, it was estimated that, in general, up to 85% of GWAS loci have non-coding variants as the likely causal association. The variants are often common in populations and were predicted to affect disease risks through small phenotypic effects, as opposed to the large effects of Mendelian variants. [55]

Some non-coding DNA sequences determine the expression levels of various genes, both those that are transcribed to proteins and those that themselves are involved in gene regulation. [56] [57] [58]

Transcription factors Edit

Some non-coding DNA sequences determine where transcription factors attach. [56] A transcription factor is a protein that binds to specific non-coding DNA sequences, thereby controlling the flow (or transcription) of genetic information from DNA to mRNA. [59] [60]

Operators Edit

An operator is a segment of DNA to which a repressor binds. A repressor is a DNA-binding protein that regulates the expression of one or more genes by binding to the operator and blocking the attachment of RNA polymerase to the promoter, thus preventing transcription of the genes. This blocking of expression is called repression. [61]

Enhancers Edit

An enhancer is a short region of DNA that can be bound with proteins (trans-acting factors), much like a set of transcription factors, to enhance transcription levels of genes in a gene cluster. [62]

Silencers Edit

A silencer is a region of DNA that inactivates gene expression when bound by a regulatory protein. It functions in a very similar way as enhancers, only differing in the inactivation of genes. [63]

Promoters Edit

A promoter is a region of DNA that facilitates transcription of a particular gene when a transcription factor binds to it. Promoters are typically located near the genes they regulate and upstream of them. [64]

Insulators Edit

A genetic insulator is a boundary element that plays two distinct roles in gene expression, either as an enhancer-blocking code, or rarely as a barrier against condensed chromatin. An insulator in a DNA sequence is comparable to a linguistic word divider such as a comma in a sentence, because the insulator indicates where an enhanced or repressed sequence ends. [65]

Evolution Edit

Shared sequences of apparently non-functional DNA are a major line of evidence of common descent. [66]

Pseudogene sequences appear to accumulate mutations more rapidly than coding sequences due to a loss of selective pressure. [15] This allows for the creation of mutant alleles that incorporate new functions that may be favored by natural selection thus, pseudogenes can serve as raw material for evolution and can be considered "protogenes". [67]

A study published in 2019 shows that new genes (termed de novo gene birth) can be fashioned from non-coding regions. [68] Some studies suggest at least one-tenth of genes could be made in this way. [68]

Long range correlations Edit

A statistical distinction between coding and non-coding DNA sequences has been found. It has been observed that nucleotides in non-coding DNA sequences display long range power law correlations while coding sequences do not. [69] [70] [71]

Forensic anthropology Edit

Police sometimes gather DNA as evidence for purposes of forensic identification. As described in Maryland v. King, a 2013 U.S. Supreme Court decision: [72]

The current standard for forensic DNA testing relies on an analysis of the chromosomes located within the nucleus of all human cells. 'The DNA material in chromosomes is composed of "coding" and "non-coding" regions. The coding regions are known as genes and contain the information necessary for a cell to make proteins. . . . Non-protein coding regions . . . are not related directly to making proteins, [and] have been referred to as "junk" DNA.' The adjective "junk" may mislead the lay person, for in fact this is the DNA region used with near certainty to identify a person. [72]

The Case for Junk DNA

Genomes are like books of life. But until recently, their covers were locked. Finally we can now open the books and page through them. But we only have a modest understanding of what we’re actually seeing. We are still not sure how much our genome encodes information that is important to our survival, and how much is just garbled padding.

Today is a good day to dip into the debate over what the genome is made of, thanks to the publication of an interesting commentary from Alex Palazzo and Ryan Gregory in PLOS Genetics. It’s called “The Case for Junk DNA.”

The debate over the genome can get dizzying. I find the best antidote to the vertigo is a little history. This history starts in the early 1900s.

At the time, geneticists knew that we carry genes–factors passed down from parents to offspring that influence our bodies–but they didn’t know what genes were made of.

That changed starting in the 1950s. Scientists recognized that genes were made of DNA, and then figured out how the genes shape our biology.

Our DNA is a string of units called bases. Our cells read the bases in a stretch of DNA–a gene–and build a molecule called RNA with a corresponding sequence. The cells then use the RNA as a guide to build a protein. Our bodies contain many different proteins, which give them structure and carry out jobs like digesting food.

But in the 1950s, scientists also began to discover bits of DNA outside the protein-coding regions that were important too. These so-called regulatory elements acted as switches for protein-coding genes. A protein latching onto one of those switches could prompt a cell to make lots of proteins from a given gene. Or it could shut down the gene completely.

Meanwhile, scientists were also finding pieces of DNA in the genome that appeared to be neither protein-coding genes nor regulatory elements. In the 1960s, for example, Roy Britten and David Kohne found hundreds of thousands of repeating segments of DNA, each of which turned out to be just a few hundred bases long. Many of these repeating sequences were the product of virus-like stretches of DNA. These pieces of “selfish DNA” made copies of themselves that were inserted back in the genome. Mutations then reduced them into inert fragments.

Other scientists found extra copies of genes that had mutations preventing them from making proteins–what came to be known as pseudogenes.

The human genome, we now know, contains about 20,000 protein-coding genes. That may sound like a lot of genetic material. But it only makes up about 2 percent of the genome. Some plants are even more extreme. While we have about 3.2 billion bases in our genomes, onions have 16 billion, mostly consisting of repeating sequences and virus-like DNA.

The rest of the genome became a mysterious wilderness for geneticists. They would go on expeditions to map the non-coding regions and try to figure out what they were made of.

Some segments of DNA turned out to have functions, even if they didn’t encode proteins or served as switches. For example, sometimes our cells make RNA molecules that don’t simply serve as templates for proteins. Instead, they have jobs of their own, such as sensing chemicals in the cell. So those stretches of DNA are considered genes, too–just not protein-coding genes.

With the exploration of the genome came a bloom of labels, some of which came to be used in confusing–and sometimes careless–ways. “Non-coding DNA” came to be a shorthand for DNA that didn’t encode proteins. But non-coding DNA could still have a function, such as switching off genes or producing useful RNA molecules.

Scientists also started referring to “junk DNA.” Different scientists used the term to refer to different things. The Japanese geneticist Susumu Ohno used the term when developing a theory for how DNA mutates. Ohno envisioned protein-coding genes being accidentally duplicated. Later, mutations would hit the new copies of those genes. In a few cases, the mutations would give the new gene copies a new function. In most, however, they just killed the gene. He referred to the extra useless copies of genes as junk DNA. Other people used the term to refer broadly to any piece of DNA that didn’t have a function.

And then–like crossing the streams in Ghostbusters–junk DNA and non-coding DNA got mixed up. Sometimes scientists discovered a stretch of non-coding DNA that had a function. They might clip out the segment from the DNA in an egg and find it couldn’t develop properly. BAM!–there was a press release declaring that non-coding DNA had long been dismissed as junk, but lo and behold, non-coding DNA can do something after all.

Given that regulatory elements were discovered in the 1950s (the discovery was recognized with Nobel Prizes), this is just illogical.

Nevertheless, a worthwhile questioned remained: how of the genome had a function? How much was junk?

To Britten and Kohne, the idea that repeating DNA was useless was “repugnant.” Seemingly on aesthetic grounds, they preferred the idea that it had a function that hadn’t been discovered yet.

Others, however, argued that repeating DNA (and pseudogenes and so on) were just junk–vast vestiges of disabled genetic material that we carry down through the generations. If the genome was mostly functional, then it was hard to see why it takes five times more functional DNA to make an onion than a human–or to explain the huge range of genome sizes:

In recent years, a consortium of scientists carried out a project called the Encyclopedia of DNA Elements (ENCODE for short) to classify all the parts of the genome. To see if non-coding DNA was functional, they checked for proteins that were attached to them–possibly switching on regulatory elements. They found a lot of them.

“These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions,” they reported.

Science translated that conclusion into a headline, “ENCODE Project writes eulogy for junk DNA.”

A lot of defenders of junk have attacked this conclusion–or, to be more specific, how the research got translated into press releases and then into news articles. In their new review, Palazzo and Gregory present some of the main objections.

Just because proteins grab onto a piece of DNA, for example, doesn’t actually mean that there’s a gene nearby that is going to make something useful. It could just happen to have the right sequence to make the proteins stick to it.

And even if a segment of DNA does give rise to RNA, that RNA may not have a function. The cell may accidentally make RNA molecules, which they then chop up.

If I had to guess why Britten and Kohne found junk DNA repugnant, it probably had to do with evolution. Darwin, after all, had shown how natural selection can transform a population, and how, over millions of years, it could produce adaptations. In the 1900s, geneticists turned his idea into a modern theory. Genes that boosted reproduction could become more common, while ones that didn’t could be eliminated from a population. You’d expect that natural selection would have left the genome mostly full of functional stuff.

Palazzo and Gregory, on the other hand, argue that evolution should produce junk. The reason has to do with the fact that natural selection can be quite weak in some situations. The smaller a population gets, the less effective natural selection is at favoring beneficial mutations. In small populations, a mutation can spread even if it’s not beneficial. And compared to bacteria, the population of humans is very small. (Technically speaking, it’s the “effective population size” that’s small–follow the link for an explanation of the difference.) When non-functional DNA builds up in our genome, it’s harder for natural selection to strip it out than if we were bacteria.

While junk is expected, a junk-free genome is not. Palazzo and Gregory based this claim on a concept with an awesome name: mutational meltdown.

Here’s how it works. A population of, say, frogs is reproducing. Every time they produce a new tadpole, that tadpole gains a certain number of mutations. A few of those mutations may be beneficial. The rest will be neutral or harmful. If harmful mutations emerge at a rate that’s too fast for natural selection to weed them out, they’ll start to pile up in the genome. Overall, the population will get sicker, producing fewer offspring. Eventually the mutations will drive the whole population to extinction.

Mutational meltdown puts an upper limit on how many genes an organism can have. If a frog has 10,000 genes, those are 10,000 potential targets for a harmful mutation. If the frog has 100,000 genes, it has ten times more targets.

Estimates of the human mutation rate suggest that somewhere between 70 to 150 new mutations strike the genome of every baby. Based on the risk of mutational meltdown, Palazzo and Gregory estimate that only ten percent of the human genome can be functional.* The other ninety percent must be junk DNA. If a mutation alters junk DNA, it doesn’t do any harm because the junk isn’t doing us any good to begin with. If our genome was 80 percent functional–the figure batted around when the ENCODE project results first came out–then we should be extinct.

It may sound wishy-washy for me to say this, but the junk DNA debates will probably settle somewhere in between the two extremes. Is the entire genome functional? No. Is everything aside from protein-coding genes junk? No–we’ve already known that non-coding DNA can be functional for over 50 years. Even if “only” ten percent of the genome turns out to be functional, that’s a huge collection of DNA. It’s six times bigger than the DNA found in all our protein-coding genes. There could be thousands of RNA molecules scientists have yet to understand.

Even if ninety percent of the genome does prove to be junk, that doesn’t mean the junk hasn’t played a role in our evolution. As I wrote last week in the New York Times, it’s from these non-coding regions that many new protein-coding genes evolve. What’s more, much of our genome is made up of viruses, and every now and then evolution has, in effect, harnessed those viral genes to carry out a job for our own bodies. The junk is a part of us, and it, too, helps to make us what we are.

*I mean functional in terms of its sequence. The DNA might still do something important structurally–helping the molecule bend in a particular way, for example.

[Update: Fixed caption. Tweaked the last paragraph to clarify that it’s not a case of teleology.]


DNA: Deoxyribonucleic acid is the chemical that stores genetic information in our cells. Shaped like a double helix, DNA passes down from one generation to the next.

RNA: Ribonucleic acid is a type of molecule used in making proteins in the body.

Genome: The complete genetic makeup of an organism, which contains all the biological information to build and keep it alive.

Gene: A stretch of DNA that tells a cell how to make specific proteins or RNA molecules.

Enzyme: A molecule that promotes a chemical reaction inside a living organism.

Stem cell: A biological master cell that can multiply and become many different types of tissue. They can also replicate to make more stem cells.

Functions for the Useless

Nearly a decade after the completion of the Human Genome Project, which gave us the first full read of our genetic script at the start of the century, a team of over 400 scientists released what they called the Encyclopedia of DNA Elements , or ENCODE for short. The international collaboration explored the function of every letter in the genome. The results of the massive undertaking called for a reassessment of junk DNA. Though less than two percent of the genome makes proteins, around 80 percent carries out some sort of function.

What fell into ENCODE’s definition of functionality was pretty broad, however. Any “biochemical activity” was fair game — getting transcribed into RNA, even if chopped later in the process, qualified sequences as functional. But many of the “junk” sections do have important roles, including regulating how DNA is transcribed and translated from there into proteins. If protein-coding sequences are the notes of a symphony, then some of the non-coding sequences act like the conductor, influencing the pace and repetitions of the masterpiece.

But not every bit of junk DNA might have a functional use. In a study published in Molecular Biology of the Cell in 2008, scientists cleaned junk DNA from yeast’s genome. For particular genes, they got rid of introns — the sections that get chopped away after DNA transcription. They reported the intron removal had no significant consequences for the cells under laboratory conditions, supporting the notion that they don’t have any function.

But studies published in Nature this year argued otherwise. When food is scarce, researchers found these sequences are essential for yeast survival. The usefulness of these introns might depend on the context, these studies argue — still a far cry from being junk.

Research team finds important role for junk DNA

Scientists have called it "junk DNA." They have long been perplexed by these extensive strands of genetic material that dominate the genome but seem to lack specific functions. Why would nature force the genome to carry so much excess baggage?

Now researchers from Princeton University and Indiana University who have been studying the genome of a pond organism have found that junk DNA may not be so junky after all. They have discovered that DNA sequences from regions of what had been viewed as the "dispensable genome" are actually performing functions that are central for the organism. They have concluded that the genes spur an almost acrobatic rearrangement of the entire genome that is necessary for the organism to grow.

It all happens very quickly. Genes called transposons in the single-celled pond-dwelling organism Oxytricha produce cell proteins known as transposases. During development, the transposons appear to first influence hundreds of thousands of DNA pieces to regroup. Then, when no longer needed, the organism cleverly erases the transposases from its genetic material, paring its genome to a slim 5 percent of its original load.

Laura Landweber (Photo: Denise Applewhite)

"The transposons actually perform a central role for the cell," said Laura Landweber, a professor of ecology and evolutionary biology at Princeton and an author of the study. "They stitch together the genes in working form." The work appeared in the May 15 edition of Science.

In order to prove that the transposons have this reassembly function, the scientists disabled several thousand of these genes in some Oxytricha. The organisms with the altered DNA, they found, failed to develop properly.

Other authors from Princeton's Department of Ecology and Evolutionary Biology include: postdoctoral fellows Mariusz Nowacki and Brian Higgins 2006 alumna Genevieve Maquilan and graduate student Estienne Swart. Former Princeton postdoctoral fellow Thomas Doak, now of Indiana University, also contributed to the study.

Landweber and other members of her team are researching the origin and evolution of genes and genome rearrangement, with particular focus on Oxytricha because it undergoes massive genome reorganization during development.

In her lab, Landweber studies the evolutionary origin of novel genetic systems such as Oxytricha's. By combining molecular, evolutionary, theoretical and synthetic biology, Landweber and colleagues last year discovered an RNA (ribonucleic acid)-guided mechanism underlying its complex genome rearrangements.

"Last year, we found the instruction book for how to put this genome back together again -- the instruction set comes in the form of RNA that is passed briefly from parent to offspring and these maternal RNAs provide templates for the rearrangement process," Landweber said. "Now we've been studying the actual machinery involved in the process of cutting and splicing tremendous amounts of DNA. Transposons are very good at that."

The term "junk DNA" was originally coined to refer to a region of DNA that contained no genetic information. Scientists are beginning to find, however, that much of this so-called junk plays important roles in the regulation of gene activity. No one yet knows how extensive that role may be.

Instead, scientists sometimes refer to these regions as "selfish DNA" if they make no specific contribution to the reproductive success of the host organism. Like a computer virus that copies itself ad nauseum, selfish DNA replicates and passes from parent to offspring for the sole benefit of the DNA itself. The present study suggests that some selfish DNA transposons can instead confer an important role to their hosts, thereby establishing themselves as long-term residents of the genome.

Is 75% of the Human Genome Junk DNA?

By the rude bridge that arched the flood,
Their flag to April’s breeze unfurled,
Here once the embattled farmers stood,
And fired the shot heard round the world.

–Ralph Waldo Emerson, Concord Hymn

Emerson referred to the Battles of Lexington and Concord, the first skirmishes of the Revolutionary War, as the “shot heard round the world.”

While not as loud as the gunfire that triggered the Revolutionary War, a recent article published in Genome Biology and Evolution by evolutionary biologist Dan Graur has garnered a lot of attention, 1 serving as the latest salvo in the junk DNA wars—a conflict between genomics scientists and evolutionary biologists about the amount of functional DNA sequences in the human genome.

Clearly, this conflict has important scientific ramifications, as researchers strive to understand the human genome and seek to identify the genetic basis for diseases. The functional content of the human genome also has significant implications for creation-evolution skirmishes. If most of the human genome turns out to be junk after all, then the case for a Creator potentially suffers collateral damage.

According to Graur, no more than 25% of the human genome is functional—a much lower percentage than reported by the ENCODE Consortium. Released in September 2012, phase II results of the ENCODE project indicated that 80% of the human genome is functional, with the expectation that the percentage of functional DNA in the genome would rise toward 100% when phase III of the project reached completion.

If true, Graur’s claim would represent a serious blow to the validity of the ENCODE project conclusions and devastate the RTB human origins creation model. Intelligent design proponents and creationists (like me) have heralded the results of the ENCODE project as critical in our response to the junk DNA challenge.

Junk DNA and the Creation vs. Evolution Battle

Evolutionary biologists have long considered the presence of junk DNA in genomes as one of the most potent pieces of evidence for biological evolution. Skeptics ask, “Why would a Creator purposely introduce identical nonfunctional DNA sequences at the same locations in the genomes of different, though seemingly related, organisms?”

When the draft sequence was first published in 2000, researchers thought only around 2–5% of the human genome consisted of functional sequences, with the rest being junk. Numerous skeptics and evolutionary biologists claim that such a vast amount of junk DNA in the human genome is compelling evidence for evolution and the most potent challenge against intelligent design/creationism.

But these arguments evaporate in the wake of the ENCODE project. If valid, the ENCODE results would radically alter our view of the human genome. No longer could the human genome be regarded as a wasteland of junk rather, the human genome would have to be recognized as an elegantly designed system that displays sophistication far beyond what most evolutionary biologists ever imagined.

ENCODE Skeptics

The findings of the ENCODE project have been criticized by some evolutionary biologists who have cited several technical problems with the study design and the interpretation of the results. (See articles listed under “Resources to Go Deeper” for a detailed description of these complaints and my responses.) But ultimately, their criticisms appear to be motivated by an overarching concern: if the ENCODE results stand, then it means key features of the evolutionary paradigm can’t be correct.

Calculating the Percentage of Functional DNA in the Human Genome

Graur (perhaps the foremost critic of the ENCODE project) has tried to discredit the ENCODE findings by demonstrating that they are incompatible with evolutionary theory. Toward this end, he has developed a mathematical model to calculate the percentage of functional DNA in the human genome based on mutational load—the amount of deleterious mutations harbored by the human genome.

Graur argues that junk DNA functions as a “ sponge ” absorbing deleterious mutations, thereby protecting functional regions of the genome. Considering this buffering effect, Graur wanted to know how much junk DNA must exist in the human genome to buffer against the loss of fitness—which would result from deleterious mutations in functional DNA—so that a constant population size can be maintained.

Historically, the replacement level fertility rates for human beings have been two to three children per couple. Based on Graur’s modeling, this fertility rate requires 85–90% of the human genome to be composed of junk DNA in order to absorb deleterious mutations—ensuring a constant population size, with the upper limit of functional DNA capped at 25%.

Graur also calculated a fertility rate of 15 children per couple, at minimum, to maintain a constant population size, assuming 80% of the human genome is functional. According to Graur’s calculations, if 100% of the human genome displayed function, the minimum replacement level fertility rate would have to be 24 children per couple.

He argues that both conclusions are unreasonable. On this basis, therefore, he concludes that the ENCODE results cannot be correct.

Response to Graur

So, has Graur’s work invalidated the ENCODE project results? Hardly. Here are four reasons why I’m skeptical.

1. Graur’s estimate of the functional content of the human genome is based on mathematical modeling, not experimental results.

An adage I heard repeatedly in graduate school applies: “Theories guide, experiments decide.” Though the ENCODE project results theoretically don’t make sense in light of the evolutionary paradigm, that is not a reason to consider them invalid. A growing number of studies provide independent experimental validation of the ENCODE conclusions. (Go here and here for two recent examples.)

To question experimental results because they don’t align with a theory’s predictions is a “ Bizarro World ” approach to science. Experimental results and observations determine a theory’s validity, not the other way around. Yet when it comes to the ENCODE project, its conclusions seem to be weighed based on their conformity to evolutionary theory. Simply put, ENCODE skeptics are doing science backwards.

While Graur and other evolutionary biologists argue that the ENCODE results don’t make sense from an evolutionary standpoint, I would argue as a biochemist that the high percentage of functional regions in the human genome makes perfect sense. The ENCODE project determined that a significant fraction of the human genome is transcribed. They also measured high levels of protein binding.

ENCODE skeptics argue that this biochemical activity is merely biochemical noise. But this assertion does not make sense because (1) biochemical noise costs energy and (2) random interactions between proteins and the genome would be harmful to the organism.

Transcription is an energy- and resource-intensive process. To believe that most transcripts are merely biochemical noise would be untenable. Such a view ignores cellular energetics. Transcribing a large percentage of the genome when most of the transcripts serve no useful function would routinely waste a significant amount of the organism’s energy and material stores. If such an inefficient practice existed, surely natural selection would eliminate it and streamline transcription to produce transcripts that contribute to the organism’s fitness.

Apart from energetics considerations, this argument ignores the fact that random protein binding would make a dire mess of genome operations. Without minimizing these disruptive interactions, biochemical processes in the cell would grind to a halt. It is reasonable to think that the same considerations would apply to transcription factor binding with DNA.

2. Graur’s model employs some questionable assumptions.

Graur uses an unrealistically high rate for deleterious mutations in his calculations.

Graur determined the deleterious mutation rate using protein-coding genes. These DNA sequences are highly sensitive to mutations. In contrast, other regions of the genome that display function—such as those that (1) dictate the three-dimensional structure of chromosomes, (2) serve as transcription factors, and (3) aid as histone binding sites—are much more tolerant to mutations. Ignoring these sequences in the modeling work artificially increases the amount of required junk DNA to maintain a constant population size.

3. The way Graur determines if DNA sequence elements are functional is questionable.

Graur uses the selected-effect definition of function. According to this definition, a DNA sequence is only functional if it is undergoing negative selection. In other words, sequences in genomes can be deemed functional only if they evolved under evolutionary processes to perform a particular function. Once evolved, these sequences, if they are functional, will resist evolutionary change (due to natural selection) because any alteration would compromise the function of the sequence and endanger the organism. If deleterious, the sequence variations would be eliminated from the population due to the reduced survivability and reproductive success of organisms possessing those variants. Hence, functional sequences are those under the effects of selection.

In contrast, the ENCODE project employed a causal definition of function. Accordingly, function is ascribed to sequences that play some observationally or experimentally determined role in genome structure and/or function.

The ENCODE project focused on experimentally determining which sequences in the human genome displayed biochemical activity using assays that measured

  • transcription,
  • binding of transcription factors to DNA,
  • histone binding to DNA,
  • DNA binding by modified histones,
  • DNA methylation, and
  • three-dimensional interactions between enhancer sequences and genes.

In other words, if a sequence is involved in any of these processes—all of which play well-established roles in gene regulation—then the sequences must have functional utility. That is, if sequence Q performs function G, then sequence Q is functional.

So why does Graur insist on a selected-effect definition of function? For no other reason than a causal definition ignores the evolutionary framework when determining function. He insists that function be defined exclusively within the context of the evolutionary paradigm. In other words, his preference for defining function has more to do with philosophical concerns than scientific ones—and with a deep-seated commitment to the evolutionary paradigm.

As a biochemist, I am troubled by the selected-effect definition of function because it is theory-dependent. In science, cause-and-effect relationships (which include biological and biochemical function) need to be established experimentally and observationally, independent of any particular theory. Once these relationships are determined, they can then be used to evaluate the theories at hand. Do the theories predict (or at least accommodate) the established cause-and-effect relationships, or not?

Using a theory-dependent approach poses the very real danger that experimentally determined cause-and-effect relationships (or, in this case, biological functions) will be discarded if they don’t fit the theory. And, again, it should be the other way around. A theory should be discarded, or at least reevaluated, if its predictions don’t match these relationships.

What difference does it make which definition of function Graur uses in his model? A big difference. The selected-effect definition is more restrictive than the causal-role definition. This restrictiveness translates into overlooked function and increases the replacement level fertility rate.

4. Buffering against deleterious mutations is a function.

As part of his model, Graur argues that junk DNA is necessary in the human genome to buffer against deleterious mutations. By adopting this view, Graur has inadvertently identified function for junk DNA. In fact, he is not the first to argue along these lines. Biologist Claudiu Bandea has posited that high levels of junk DNA can make genomes resistant to the deleterious effects of transposon insertion events in the genome. If insertion events are random, then the offending DNA is much more likely to insert itself into “junk DNA” regions instead of coding and regulatory sequences, thus protecting information-harboring regions of the genome.

If the last decade of work in genomics has taught us anything, it is this: we are in our infancy when it comes to understanding the human genome. The more we learn about this amazingly complex biochemical system, the more elegant and sophisticated it becomes. Through this process of discovery, we continue to identify functional regions of the genome—DNA sequences long thought to be “ junk. ”

In short, the criticisms of the ENCODE project reflect a deep-seated commitment to the evolutionary paradigm and, bluntly, are at war with the experimental facts.

Bottom line: if the ENCODE results stand, it means that key aspects of the evolutionary paradigm can’t be correct.

Perennial Problem of C-Value

Information and Structure.

The junk idea long predates genomics and since its early decades has been grounded in the “C-value paradox,” the observation that DNA amounts (C-value denotes haploid nuclear DNA content) and complexities correlate very poorly with organismal complexity or evolutionary “advancement” (10 ⇓ ⇓ ⇓ –14). Humans do have a thousand times as much DNA as simple bacteria, but lungfish have at least 30 times more than humans, as do many flowering plants and some unicellular protists (14). Moreover, as is often noted, the disconnection between C-value and organismal complexity is also found within more restricted groups comprising organisms of seemingly similar lifestyle and comparable organismal or behavioral complexity. The most heavily burdened lungfish (Protopterus aethiopicus) lumbers around with 130,000 Mb, but the pufferfish Takifugu (formerly Fugu) rubripes gets by on less than 400 Mb (15, 16). A less familiar but better (because monophyletic) animal example might be amphibians, showing a 120-fold range from frogs to salamanders (17). Among angiosperms, there is a thousandfold variation (14). Additionally, even within a single genus, there can be substantial differences. Salamander species belonging to Plethodon boast a fourfold range, to cite a comparative study popular from the 1970s (18). Sometimes, such within-genus genome size differences reflect large-scale or whole-genome duplications and sometimes rampant selfish DNA or transposable element (TE) multiplication. Schnable et al. (19) figure that the maize genome has more than doubled in size in the last 3 million y, overwhelmingly through the replication and accumulation of TEs for example. If we do not think of this additional or “excess” DNA, so manifest through comparisons between and within biological groups, as junk (irrelevant if not frankly detrimental to the survival and reproduction of the organism bearing it), how then are we to think of it?

Of course, DNA inevitably does have a basic structural role to play, unlinked to specific biochemical activities or the encoding of information relevant to genes and their expression. Centromeres and telomeres exemplify noncoding chromosomal components with specific functions. More generally, DNA as a macromolecule bulks up and gives shape to chromosomes and thus, as many studies show, determines important nuclear and cellular parameters such as division time and size, themselves coupled to organismal development (11 ⇓ –13, 17). The “selfish DNA” scenarios of 1980 (20 ⇓ –22), in which C-value represents only the outcome of conflicts between upward pressure from reproductively competing TEs and downward-directed energetic restraints, have thus, in subsequent decades, yielded to more nuanced understandings. Cavalier-Smith (13, 20) called DNA’s structural and cell biological roles “nucleoskeletal,” considering C-value to be optimized by organism-level natural selection (13, 20). Gregory, now the principal C-value theorist, embraces a more “pluralistic, hierarchical approach” to what he calls “nucleotypic” function (11, 12, 17). A balance between organism-level selection on nuclear structure and cell size, cell division times and developmental rate, selfish genome-level selection favoring replicative expansion, and (as discussed below) supraorganismal (clade-level) selective processes—as well as drift—must all be taken into account.

These forces will play out differently in different taxa. González and Petrov (23) point out, for instance, that Drosophila and humans are at opposite extremes in terms of the balance of processes, with the minimalist genomes of the former containing few (but mostly young and quite active) TEs, whereas at least one-half of our own much larger genome comprises the moribund remains of older TEs, principally SINEs and LINEs (short and long interspersed nuclear elements). Such difference may in part reflect population size. As Lynch notes, small population size (characteristic of our species) will have limited the effectiveness of natural selection in preventing a deleterious accumulation of TEs (24, 25).

Zuckerkandl (26) once mused that all genomic DNA must be to some degree “polite,” in that it must not lethally interfere with gene expression. Indeed, some might suggest, as I will below, that true junk might better be defined as DNA not currently held to account by selection for any sort of role operating at any level of the biological hierarchy (27). However, junk advocates have to date generally considered that even DNA fulfilling bulk structural roles remains, in terms of encoded information, just junk. Cell biology may require a certain C-value, but most of the stretches of noncoding DNA that go to satisfying that requirement are junk (or worse, selfish).

In any case, structural roles or multilevel selection theorizing are not what ENCODE commentators are endorsing when they proclaim the end of junk, touting the existence of 4 million gene switches or myriad elements that determine gene expression and assigning biochemical functions for 80% of the genome. Indeed, there would be no excitement in either the press or the scientific literature if all the ENCODE team had done was acknowledge an established theory concerning DNA’s structural importance. Rather, the excitement comes from interpreting ENCODE’s data to mean that a much larger fraction of our DNA than until very recently thought contributes to our survival and reproduction as organisms, because it encodes information transcribed or expressed phenotypically in one tissue or another, or specifically regulates such expression.

A Thought Experiment.

ENCODE (5) defines a functional element (FE) as “a discrete genome segment that encodes a defined product (for example, protein or non-coding RNA) or displays a reproducible biochemical signature (for example, protein binding, or a specific chromatin structure).” A simple thought experiment involving FEs so-defined is at the heart of my argument.

Suppose that there had been (and probably, some day, there will be) ENCODE projects aimed at enumerating, by transcriptional and chromatin mapping, factor footprinting, and so forth, all of the FEs in the genomes of Takifugu and a lungfish, some small and large genomed amphibians (including several species of Plethodon), plants, and various protists. There are, I think, two possible general outcomes of this thought experiment, neither of which would give us clear license to abandon junk.

The first outcome would be that FEs (estimated to be in the millions in our genome) turn out to be more or less constant in number, regardless of C-value—at least among similarly complex organisms. If larger C-value by itself does not imply more FEs, then there will, of course, be great differences in what we might call functional density (FEs per kilobase) (26) among species. FEs spaced by kilobases in Arabidopsis would be megabases apart in maize on average. Averages obscure details: the extra DNA in the larger genomes might be sequestered in a few giant silent regions rather than uniformly stretching out the space between FEs or lengthening intragenic introns. However, in either case, this DNA could be seen as a sort of polite functionless filler or diluent. At best, such DNA might have functions only of the structural or nucleoskeletal/nucleotypic sort. Indeed, even this sort of functional attribution is not necessary. There is room within an expanded, pluralistic and hierarchical theory of C-value (see below) (12, 27) for much DNA that makes no contribution whatever to survival and reproduction at the organismal level and thus is junk at that level, although it may be under selection at the sub- or supraorganismal levels (TEs and clade selection).

If the human genome is junk-free, then it must be very luckily poised at some sort of minimal size for organisms of human complexity. We may no longer think that mankind is at the center of the universe, but we still consider our species’ genome to be unique, first among many in having made such full and efficient use of all of its millions of SINES and LINES (retrotransposable elements) and introns to encode the multitudes of lncRNAs and house the millions of enhancers necessary to make us the uniquely complex creatures that we believe ourselves to be. However, were this extraordinary coincidence the case, a corollary would be that junk would not be defunct for many other larger genomes: the term would not need to be expunged from the genomicist’s lexicon more generally. As well, if, as is commonly believed, much of the functional complexity of the human genome is to be explained by evolution of our extraordinary cognitive capacities, then many other mammals of lesser acumen but similar C-value must truly have junk in their DNA.

The second likely general outcome of my thought experiment would be that FEs as defined by ENCODE increase in number with C-value, regardless of apparent organismal complexity. If they increase roughly proportionately, FE numbers will vary over a many-hundredfold range among organisms normally thought to be similarly complex. Defining or measuring complexity is, of course, problematic if not impossible. Still, it would be hard to convince ourselves that lungfish are 300 times more complex than Takifugu or 40 times more complex than us, whatever complexity might be. More likely, if indeed FE numbers turn out to increase with C-value, we will decide that we need to think again about what function is, how it becomes embedded in macromolecular structures, and what FEs as defined by ENCODE have to tell us about it.

What's the origin of junk DNA? - Biology

NIST-led Research De-Mystifies Origins Of 'Junk' DNA

One man's junk, is another's treasure
Washington - Mar 26, 2004
A debate over the origins of what is sometimes called "junk" DNA has been settled by research involving scientists at the Center for Advanced Research in Biotechnology (CARB) and a collaborator, who developed rigorous proof that these mysterious sections were added to DNA "late" in the evolution of life on earth--after the formation of modern-sized genes, which contain instructions for making proteins.

A biologist with the Commerce Department's National Institute of Standards and Technology (NIST) led the research team, which reported its findings in the March 10 online edition of Molecular Biology and Evolution.

The results are based on a systematic, statistically rigorous analysis of publicly available genetic data carried out with bioinformatics software developed at CARB.

In humans, there is so much apparent "junk" DNA (sections of the genome with no known function) that it takes up more space than the functional parts. Much of this junk consists of "introns," which appear as interruptions plopped down in the middle of genes.

Discovered in the 1970s, introns mystify scientists but are readily accounted for by cells: when the cellular machinery transcribes a gene in preparation for making a protein, introns are simply spliced out of the transcript.

Research from the CARB group appears to resolve a debate over the "early versus late" timing of the appearance of introns. Since introns were discovered in 1978, scientists have debated whether genes were born split (the "introns-early" view), or whether they became split after eukaryotic cells (the ones that gave rise to animals and their relatives) diverged from bacteria roughly 2 billion years ago (the "introns-late" view).

Bacterial genomes lack introns. Although the study did not attempt to propose a function for introns, or determine whether they are beneficial or harmful, the results appear to rule out the "introns-early" view.

The CARB analysis shows that the probability of a modern intron's presence in an ancestral gene common to the genes studied is roughly 1 percent, indicating that the vast majority of today's introns appeared subsequent to the origin of the genes.

This conclusion is supported by the findings regarding placement patterns for introns within genes. It long has been observed that, in the sequences of nitrogen-containing compounds that make up our DNA genomes, introns prefer some sites more than others. The CARB study indicates that these preferences are side effects of late-stage intron gain, rather than side effects of intron-mediated gene formation.

The CARB results are based on an analysis of carefully processed data for 10 families of protein-coding genes in animals, plants, fungi and their relatives (see sidebar for details of the method used). A variety of statistical modeling, theoretical, and automated analytical approaches were used while most were conventional, their combined application to the study of introns was novel.

The CARB study also is unique in using an evolutionary model as the basis for inferring the presence of ancestral introns. The research was made possible in part by the increasing availability, over the past decade, of massive amounts of genetic sequence data.

The lead researcher is Arlin B. Stoltzfus of NIST collaborators include Wei-Gang Qiu, formerly of CARB and the University of Mayland and now at Hunter College in New York City, and Nick Schisler, currently at Furman University, Greenville, S.C.

CARB is a cooperative venture of NIST and the University of Maryland Biotechnology Institute.

CARB's Approach to Understanding the Origins of 'Junk' DNA

Scientists long have compared the sequences of chemical compounds in different proteins, genes and entire genomes to derive clues about structure and function.

The most sophisticated comparative methods are evolutionary and rely on matching similar sequences from different organisms, inferring family trees to determine relationships, and reconstructing changes that must have occurred to create biologically relevant differences.

This type of analysis is usually done with one sequence family at a time. The Center for Advanced Research in Biotechnology (CARB), a cooperative venture of the Commerce Department's National Institute of Standards and Technology (NIST) and the University of Maryland Biotechnology Institute, developed software to automate the analysis of dozens--and perhaps hundreds, eventually--of sequence families at a time.

The automated methods also assess the reliability of all the information, so that conclusions are based on the most reliable parts of the analysis.

The CARB method has two parts. The first part consists of a combination of manual and automated processing of gene data from public databases. The data are clustered into families through matching of similar sequences, first in pairs and then in groups.

Then family trees are developed indicating how the genes are related to each other. A file is developed for each family that includes data on sequence matches, intron locations, family trees and reliability measures.

These datasets then are loaded into the second part of the system, which is fully automated. It consist of a relational database combined with software that computes probabilities for introns being present in ancestral genes using a method developed at CARB.

Each gene is assigned to a kingdom (plants, animals, fungi and others), and a matrix of intron presence/absence data is determined for each family based on the sequence alignments. This matrix, along with the family tree, is used to estimate ancestral states of introns, as well as rates of intron loss and gain. Additional software is used for analysis and visualization of results.

The CARB study analyzed data for 10 families of protein-coding genes in multi-celled organisms, encompassing 1,868 introns at 488 different positions.

Life-Seeking Chip Will Join Space Probes
Pasadena (UPI) Mar 23, 2004
U.S. scientists said Tuesday they have developed a miniature laboratory that can spot a tell-tale chemical signature of life.

With the rise of Ad Blockers, and Facebook - our traditional revenue sources via quality network advertising continues to decline. And unlike so many other news sites, we don't have a paywall - with those annoying usernames and passwords.

Watch the video: What is Junk DNA? Why is it important? (October 2022).