HW1 Stats 366 - Stats 166¶
Note
Due by 11am, April 13th, 2012 (timestamp used as proof).
Send a complete pdf file to ahead@stanford.edu
Reading and revision¶
- Review slides from this week.
- What are ten words you looked up on wikipedia for clarification?
Do the Lab, make sure to provide a copy of your R commands and the output. Provide comments for both the functions you used and the input. If you can use Sweave, it is highly recommended, in any case, even if you use word, please hand in a pdf file.
Note
Here is a simple example of good comments
GCcontent <- function(myseq)
{ #####A function to compute GC content of the input sequence myseq
mytable <- count(myseq, 1) # make a table with the count of As, Cs, Ts, and Gs
mylength <- length(myseq) # find the length of the whole sequence
myCs <- mytable[[2]] # number of Cs in the sequence
myGs <- mytable[[3]] # number of Gs in the sequence
myCG <- (myCs + myGs)/mylength
return(myCG)
}
About Sweave (for the hardcore programmers who like latex only)¶
Some links to Sweave documentation:
- http://users.stat.umn.edu/~geyer/Sweave/
- Sweave with Rstudio - http://www.math.montana.edu/~jimrc/classes/Rseminar/SweaveIntro.html
- You prefer videos, here’s a 25 minute presentation of R and Sweave http://www.youtube.com/watch?v=jhj_97SNl7Y
LAB component¶
- Look up on the NCBI nucleotide page the genome of Mycobacterium tuberculosis (reference strain H37Rv). Explain the term fasta.
- Using either seqinr or geneR do not download the full genome of Mycobacterium tuberculosis H37Rv (this is large), but just the first contig (positions 1-341957).
- What is the length of the full genome of Mycobacterium tuberculosis H37Rv accession number AL123456 ?
- What are the last 12 nucleotides of the genome?
- From within R write a fasta file called mycontig1.fasta with the first 341957 base pairs.
- Check that you can read in the file with the read.fasta function.
- What is the GC content of the first 341957 base pairs?
- How many CG digrams are there?
- Is this close to the expected number under an independence model? (ie supposing that the occurrence of a G following a C is independent)
- Do a chisquare test to see if the difference is significant.
- Follow the instructions in chapter 2 of the little book of r to make a plot of GC content of a moving window of width 500 along the first contig (positions 1-341957).
- Download the genome with NCBI reference: NC_016893.1, call this seqyale.
- Put the frequencies (counts/total) of a,c,g,t in a vector of length 4 called pest.
- Simulate a new sequence of the same length as seqyale called simuseq by drawing independently from a multinomial distribution with probabilities \((p_a, p_t, p_c, p_g)=\) pest. (do not use a loop)
- Make a comparative plot of the frequencies of AT digrams in the simulated data simuseq and the seqyale data along the genome.