Seeing the structure and the surprises¶
Counting for something¶
We will give a few examples of finding useful patterns in data and ways of validating them. Here is an example with DNA. One of the main discoveries that led to understanding the structure of DNA was made by Erwin Chargaff. Prior to his work, researchers had thought that DNA was made up of equal amounts of four bases : adenine, guanine, cytosine, and thymine, but Chargaff showed in the late 1940’s that there was a definite departure from this pattern.
Box: Chargaff’s experiment was on salmon sperm. He was able to do this with paper chromatography and ultraviolet spectrophotometer.
The most useful way of formulating these results is to consider each of the 4 categories of nucleotides (A,T,G,C) as four boxes into which we drop balls, what Levene postulated is what we call a multinomial distribution with equal probabilities \((p1=0.25, p_2=0.25, p_3=0.25, p_4=0.25)\), that can be thought of as taking the boxes of the same width when we drop the balls into them. In fact experiments show the following percentages of each of the four types:
Relative Proportions (%) of Bases in DNA
Organism | A | T | G | C |
Human | 30.9 | 29.4 | 19.9 | 19.8 |
Myobacterium Tub. | 15.1 | 14.6 | 34.9 | 35.4 |
Chicken | 28.8 | 29.2 | 20.5 | 21.5 |
Grasshopper | 29.3 | 29.3 | 20.5 | 20.7 |
Sea Urchin | 32.8 | 32.1 | 17.7 | 17.3 |
Wheat | 27.3 | 27.1 | 22.7 | 22.8 |
Yeast | 31.3 | 32.9 | 18.7 | 17.1 |
E coli | 24.7 | 23.6 | 26.0 | 25.7 |
We’ll enter the data as a vector to begin with:
> Chargaff=c(30.9, 29.4, 19.9, 19.8, 15.1, 14.6, 34.9, 35.4, 28.8, 29.2, + 20.5, 21.5, 29.3, 29.3, 20.5, 20.7, 32.8, 32.1, 17.7, 17.3, 27.3, + 27.1, 22.7, 22.8, 31.3, 32.9, 18.7, 17.1, 24.7, 23.6, 26, 25.7) > Chargafftable=matrix(Chargaff,byrow=T,ncol=4) > Chargafftable [,1] [,2] [,3] [,4] [1,] 30.9 29.4 19.9 19.8 [2,] 15.1 14.6 34.9 35.4 [3,] 28.8 29.2 20.5 21.5 [4,] 29.3 29.3 20.5 20.7 [5,] 32.8 32.1 17.7 17.3 [6,] 27.3 27.1 22.7 22.8 [7,] 31.3 32.9 18.7 17.1 [8,] 24.7 23.6 26.0 25.7 >
Here is the code.
R code for cut and paste¶
#####
Chargaff=c(30.9, 29.4, 19.9, 19.8, 15.1, 14.6, 34.9, 35.4, 28.8, 29.2,
20.5, 21.5, 29.3, 29.3, 20.5, 20.7, 32.8, 32.1, 17.7, 17.3, 27.3,
27.1, 22.7, 22.8, 31.3, 32.9, 18.7, 17.1, 24.7, 23.6, 26, 25.7)
Chargafftable=matrix(Chargaff,byrow=T,ncol=4)
Chargafftable
Do you agree with the tetranucleotide hypothesis of uniformity, after seeing this data ?
Chargaff postulated that there was a definite pattern of base pairing explaining why the amount of adenine (A) in the DNA of an organism is exactly matched by the amount of thymine (T) (this is called Chargaff’s rule). Similarly, whatever the amount of guanine (G), the amount of cytosine (C) is the same.
The departure from the balanced multinomial distribution is not subtle and we can quantify it with a “score” that quantifies the departure from the expected distribution. Computing such a score is a common situation in statistics, and in fact the name statistics comes from the name of these scores.
The table also demonstrates that the distribution varies from one species to another.
Question: Can we tell the species from the distribution? Given a stretch of DNA, look at the probabilities and guess which species would have been most likely to produce such a distribution.
How sure are we of the answer?