Big Data Solves Mystery: Why Humans Have No More Genes Than Worms

How can Man the Great have no more genes than a nematode, and less than a grape? A Haifa University analysis figures it out: We don't know what a 'gene' is.

Dreamstime.com

The human genome has been sequenced, and to our mortification, we seem to have no more genes than a worm, and a lot less than a tomato. Surely Man the Great and Powerful should have a lot more genes than a lowly invertebrate or fruit? Now a Haifa University team of statisticians and geneticists has found evidence backing a theory of why Man and Nematode could have about 20,000 genes each, yet be so vastly disparate in complexity: the existence of genetic networks, whose effect is so weak as to be almost undetectable by usual statistical means.

The team of statisticians and geneticists proposesa new approach to identify genetic networks - which means joint activity by genes, that affect each other's expression in different ways under different circumstances.

Until recently, genetics dealt mainly with the identification and effects of single genes that express themselves powerfully enough to be noticed. The methods devised by statistician Pavel Goldstein, under the guidance of Dr. Anat Reiner Ben-Naim in collaboration with Prof. Abraham Korol of the Evolutionary & Environmental Biology Department, can identify these networks, and may also help solve one of the big mysteries in genetics – what all that "junk DNA" in our genomes is really doing.

Pavel Goldstein. Photo by Avigail Tsuper

Have letters, can't read book

Haaretz

How many genes humans have is still unknown, though we believe we know most of them. More importantly, we don't know what our genes are doing, explains Goldstein.

It's like knowing the words comprising a language, but not knowing the language. What progress has been made in deciphering the "book of our DNA" has indicated that humans may have far fewer genes than assumed - somewhere between 20,000 to 25,000.

That number sounds embarrasingly small considering that nematode worms have about 20,000 too, grapes have about 30,000 and tomatoes have nearly 32,000.

But our obsession with "counting genes" may have misled us, if a "gene" is not a stand-alone thing that is either expressed (as a protein), partially expressed or not expressed. If genes interact and affect each other's expression, a given sequence of DNA may behave one way when "nudged" by another gene, and behave differently when "nudged" by a different gene, Goldstein explains to Haaretz.

And this may explain how a being (in our case, a human) can have so few "identified genes" and be so biologically complex.

The phenomenon of "cooperative" or interdependent genes is called epistasis. An epistatic gene's effect depends on activity by one or more "'modifier genes".

That was the theory. Now, using big data techniques, the Haifa team of statisticians and geneticists managed to nail down significant evidence of epistasis in action, using the genome of the humble rockcress, a cousin of cabbage.

This isn't about how many angels can dance on the head of a pin. A great many diseases, not least auto-immune conditions, could boil down to epistatic relations between genes that have yet to be identified, Dr. Reiner points out.

Rockcress, we thank thee for thy contribution to genetic research. Photo by Brona, Wikimedia

One man's junk

So at this point, definitions such as "man has 20,000 genes and so does the cat" are pretty pointless. The same applies to baffling genomic sequencing results indicating that about 98% of man's DNA is junk. (At least some of that junk turned out to be regulatory sequences - turning genes "on" or "off" or moderating their expression according to biological signals. Other bits are apparently genes we just haven't identified.)

"The Genome Project began in 1990 with the ambitious goal of sequencing the human genome within 15 years. They tried to take each gene and map it, but reached the conclusion that they had 98% junk because they didn't factor in the assumption of possible interactions between sequences, and that a given sequence could behave differently under different circumstances," Goldstein explains. "They only identified the 2% that had a strong enough effect to be identified in isolation."

So we have great hulking molecules of DNA with an unknown number of genes that interact with one another in unknown ways under circumstances we don't know. Tricky, analyzing that: The number of ways genes could 'combine' and interact is almost infinite. "It had been very hard to look at that effectively," Goldstein remarks.

This is where statistical theory comes into play.

Using our friend the rockcress genome, the team illustrated how their approach works with real data, and developed an algorithm to detect epistasis (collaboration between genes), while minimizing the chance of false identifications – by ignoring single, isolated known traits (genes). Instead, they looked at clusters (groups) of similar traits, using data mining techniques, he explains.

They also used dependence between neighboring DNA regions, arising from tight linkage of molecular markers – and then applied a hierarchical search technique. In other words, they began with an initial screen for epistatic global  regions, followed by a more focused search only in the areas with high probability for epistasis.

That two-step approach enabled identification of very tiny epistatic effects that cannot be identified by other methods, Goldstein elaborates. The approach also dramatically reduces the number of statistical tests needed, considerably decreasing computation time.

The result: the identified effects provide a broader picture on related networks of genes. So it seems science just hasn't figured out yet what all those "junk" sequences do when provoked the right way. And, unbelievably complex interactions between genes could explain just why Man might, after all, have a third less genes than a grape, and still be a lot more complex.