Information Content of a Human Being ---Warren D. Smith May 2001--------- The human genome is 3*10^9 base pairs (bp) i.e. 6 gigabits. However, really, most of this is "junk." There were fewer human genes than expected, because initially the number of genes had been estimated based on a segment of DNA that was anomalously gene-rich. Hence there is more junk than expected. Craig Ventner now (April 2001) claims he estimates there are 26400 genes with at least 2 supporting lines of evidence, plus about 12000 additional gene candidates, most of which (he suspects) will turn out not to be real. The Human Genome is discussed in Science 291,5507 (16 Feb 2001) and Nature 409,6822 (15 Feb 2001), both special issues. creature -------- #genes #bp Human 32000? 3*10^9 Fly (Drosophila Melanogaster) 13338 worm (C.Elegans) 18266 Yeast 6144 Mustard Weed 25706 H.Influenzae 1743 1830137 Two random humans differ in 1 base pair out of 1250. That implies that the amount of information needed to describe how you genetically DIFFER from me, is only about 50 megabits at most. Of course this is probably an overestimate since most of those differences actually have no discernible effect. It is suspected that if all 6 billion human genomes were sequenced, then almost all possible point mutations would be seen. It has been estimated that 1-100 point mutations occur per human generation going back 5000 generations. (I personally have estimated: 60.) The average span of a "typical" human gene is 27894 bp and it has 7 exons. The largest known human gene is for titin with 234 exons and including about 89 kbp of genuine coding DNA. Chromosome 19 has 23 genes per Mbp (max), Chr. 3 has only 5 (min). The % of human bp spanned by genes is 25.5-37.8%. The % of bp that are in exons (i.e. code for protein) is 1.1-1.4%. The % that is introns (i.e. transcribed to RNA but discarded) is 24.4-36.4%. The % that is intergenic is 63.6-74.5%. The longest known intergenic length is 3038416 bp. Four classes of "parasitic elements"make up 45% of your DNA; most of this arose from reverse transcription of RNA. 3% of your DNA is short repeats. 5% is duplicated large pieces. The true number of genes is not known but it won't change by a heck of a lot. It is for sure between 26000 and 39000. There are also promoters, repressors, and introns, and I don't think they should count as "junk," at least not 100% junk. So maybe human DNA is 98-99% junk. Now, even of the 1-2% non-junk, it could be argued that in fact, a good deal of it is not really transmitting information, since, if I change a random non-junk bp it probably will usually have no effect on your life. I asked Ned Wingreen once about that & he estimated that a non-junk DNA bp (actually a random bacterial bp) would, if changed, cause 2-3% probability of harm. In that case the "true info content" of human DNA is more like 2-3% of 1%, which is .0003 fraction, say, which for humans is 9*10^5 bp which is 1.8*10^6 bits. That is comparable to the amount of info in a typical paperback book, I guess. On the other hand, these estimates of harm caused by changing just ONE bp in isolation may tell us little about true info content - I suspect randomizing half of your gene bp's would kill you, even if you got to select the least harmful half before I flipped my coins! If we guess that maybe 1/6 of the bits in gene DNA actually really do matter (6 bits = 1 triplet specify an amino; 1 bit specifies hydrophobic vs hydrophilic) which would be 16.6% not 2-3%. In that case the "true info content" of human DNA is more like .0016 to .0032 fractionally, i.e., 5*10^6 to 10^7 bp, i.e.: 10-20 megabits. Another estimate of something similar: [AD Keefe & JW Szostak: Functional proteins from a random sequence library, Nature 410 (5 Apr 2001) 715] estimates (based on an experiment) that among 80-amino proteins with random sequences, about 1 in 10^11 of them will "work," i.e. have a chemical functionality you specify. (Their experiment with 6*10^12 such proteins found 4 that worked, i.e. had ATP-binding functionality.) That suggests that the true "entropy" of a gene is about log2(10^11) = 36.5 bits per gene. If we want the thing not only to work, but also work well, perhaps that brings the entropy up to 70 bits per gene (twice as much; this is a guess). Anyway, this is an interesting number. If humans have 30000 genes that means our entropy, in some sense, is only 2.1 Mbits. This is close to the previous estimate based onm Wingreen's guess. --------------------------------- Let's contrast this with the amount of mental information you learn and store in your brain over your lifetime. Psychological experiments have estimated [T.K.Landauer: How much do people remember, Cognitive Sci. 10,4 (1986) 477-493; 12,2 (1988) 293-297] that you remember about 2 bits per second and end up holding somewhere between .5 and 3.4 gigabits. This is smaller than the junk-info content of your genome, but a lot bigger than its true-info content.