Binary Table/Contingency Table representation

Reading in and forming tables from discrete data

A warm up problem: Plato’s sentence endings


>  load("/Users/susan/sphinx/data/platofreq.RData")
>  dimnames(platofreq)[[1]]=c("UUUUU","-UUUU","U-UUUU","UU-UU","UUU-U","UUUU- 
",
+  "--UUU","-U-UU","-UU-U","-UUU-",
+  "U--UU","U-U-U","U-UUU-","UU--U","UU-U-","UUU--",
+  "---UU","--U-U","--UU-" ,"-U--U","-U-U-","-UU--",
+  "UU---","U-U--","U--U-","U---U","U----","-U---",
+  "--U--","---U-","----U","-----")
>  apply(platofreq,2,sum)
 Rep Laws Crit Phil  Pol Soph  Tim 
3783 3788  150  957  772  919  765 
>  apply(platofreq,1,sum)
 UUUUU  -UUUU U-UUUU  UU-UU  UUU-U  UUUU-  --UUU  -U-UU  -UU-U  -UUU-  U--UU 
   219    316    260    255    313    344    308    249    173    655    419 
 U-U-U U-UUU-  UU--U  UU-U-  UUU--  ---UU  --U-U  --UU-  -U--U  -U-U-  -UU-- 
   185    283    264    370    483    450    246    222    262    418    251 
 UU---  U-U--  U--U-  U---U  U----  -U---  --U--  ---U-  ----U  ----- 
   332    265    650    259    396    423    368    701    285    510 
>  require(vcd)
Loading required package: vcd
Loading required package: MASS
Loading required package: grid
Loading required package: colorspace
>  mosaicplot(platofreq[1:10,],color=TRUE,shade=TRUE,main="SubPlato")
>  require(ade4)
>  res.plato=dudi.coa(platofreq,nf=2,scannf=F)
>  res.plato
Duality diagramm
class: coa dudi
$call: dudi.coa(df = platofreq, scannf = F, nf = 2)

$nf: 2 axis-components saved
$rank: 6
eigen values: 0.09128 0.02121 0.009105 0.006023 0.002836 ...
  vector length mode    content       
1 $cw    7      numeric column weights
2 $lw    32     numeric row weights   
3 $eig   6      numeric eigen values  

  data.frame nrow ncol content             
1 $tab       32   7    modified array      
2 $li        32   2    row coordinates     
3 $l1        32   2    row normed scores   
4 $co        7    2    column coordinates  
5 $c1        7    2    column normed scores
other elements: N 
> 
_images/06db00ad68.png

>  scatter(res.plato)
> 
_images/14c4bf8cb9.png

The output of this correspondence analysis is the same as for a pca, the variance explained is replaced by the chisquare distance from independence explained.

What percentage of the total inertia is represented by this first plane?

> firstplane=round(sum(res.plato$eig[1:2])/sum(res.plato$eig),2)
> firstplane
[1] 0.85
> 

We see that 85% of the total chisquare is explained by the first two axes. Beware if you use the function scatter on data tables with too many rows or too many columns the data become unreadable. We need a way of only projecting only columns or only rows.


>  s.label(res.plato$co)
> 
_images/da3d8e1ffe.png

Some facts about tables and CA

  1. Often need to transform the original matrix of data to get the contingency table.
  2. Correspondence Analysis, homogeneity analysis is appropriate for binary or contingency tables.
  3. Gradients can be detected in the same way as in MDS (in fact CA is a type of MDS).
  4. The eigenvalues show how much of the Chisquare Inertia is explained by each successive eigenvector/axis.
  5. More Details in the Stats206 course notes.

Communities of species in Mice Gut Microbiome

With David Relman and Julie Theriot’s Labs we are studying the variability in mice gut ecological communities, we present here part of an analyses we did with Yana Hoy who kindly allowed us to share a subset of the data with you.

We are going to look at a set of mice at their resting states before intervention, however the mice are not homogeneous in many aspects. We want to study the baseline variability and the quality of the data before going further into the study.

The original data were passed through the usual qiime pipeline and the data were transformed into a read abundance table called py.nz.RData, contingent species (OTU) information is contained in the file otu.idy.RData. We load these files into the R workspace:

> load("/Users/susan/sphinx/data/py.nz.RData")
> summary(apply(py.nz,1,sum))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
     1.0      2.0     11.0    725.9    144.0 125200.0 
> 

We will do a correspondence analysis on the data using dudi.coa. We only plot the columns (mice), as there are too many taxa, what do we see?


> res.py=dudi.coa(py.nz,nf=4,scannf=F)
> p1=round(100*sum(res.py$eig[1:2])/sum(res.py$eig))
> s.label(res.py$co,boxes=F,clabel=0.9)
> title(paste("Preinfected, axes 1:2",p1,"% inertia"))
> 
_images/25b3289033.png

So we need to take out a few of the mice that are outliers:


> py.nz.coa=dudi.coa(py.nz[,-(70:71)],scannf=F,nf=4)
> s.label(py.nz.coa$co,boxes=F,clabel=0.5)
> title("pre-inf. without 154.28 and 154.29,cols")
> 
_images/31c94a62d2.png

> load("/Users/susan/sphinx/data/cage96.RData")
> cagelabel=factor(cage96[-(70:71)])
> s.chull(py.nz.coa$co,cagelabel)
> 
_images/7559a5da98.png

R code for cut and paste

rm(X)
 load("/Users/susan/sphinx/data/platofreq.RData")
 dimnames(platofreq)[[1]]=c("UUUUU","-UUUU","U-UUUU","UU-UU","UUU-U","UUUU-",
 "--UUU","-U-UU","-UU-U","-UUU-",
 "U--UU","U-U-U","U-UUU-","UU--U","UU-U-","UUU--",
 "---UU","--U-U","--UU-" ,"-U--U","-U-U-","-UU--",
 "UU---","U-U--","U--U-","U---U","U----","-U---",
 "--U--","---U-","----U","-----")
 apply(platofreq,2,sum)
 apply(platofreq,1,sum)
 #Two dimensional summary
 require(vcd)
 mosaicplot(platofreq[1:10,],color=TRUE,shade=TRUE,main="SubPlato")
 require(ade4)
 res.plato=dudi.coa(platofreq,nf=2,scannf=F)
 res.plato
 scatter(res.plato)
firstplane=round(sum(res.plato$eig[1:2])/sum(res.plato$eig),2)
firstplane
 s.label(res.plato$co)
load("/Users/susan/sphinx/data/py.nz.RData")
summary(apply(py.nz,1,sum))
res.py=dudi.coa(py.nz,nf=4,scannf=F)
p1=round(100*sum(res.py$eig[1:2])/sum(res.py$eig))
s.label(res.py$co,boxes=F,clabel=0.9)
title(paste("Preinfected, axes 1:2",p1,"% inertia"))
py.nz.coa=dudi.coa(py.nz[,-(70:71)],scannf=F,nf=4)
s.label(py.nz.coa$co,boxes=F,clabel=0.5)
title("pre-inf. without 154.28 and 154.29,cols")
######First numbers in the label seem to group##
load("/Users/susan/sphinx/data/cage96.RData")
cagelabel=factor(cage96[-(70:71)])
s.chull(py.nz.coa$co,cagelabel)

Encoding communities as graphs

We want to use the above data to build a community graph, ie a graph where there is an edge between two OTUs/species

they co-occur in many samples at the same time, ie, they seem to be part of the

same community.

> require(vegan)
Loading required package: vegan
Loading required package: permute
This is vegan 2.0-2

Attaching package: ‘vegan’

The following object(s) are masked from ‘package:ade4’:

    cca

> require(igraph)
Loading required package: igraph
> threshold=1
> abund=py.nz[,-(70:71)]
> abundance = abund[rowSums(abund)>threshold,]
> dim(abundance)
[1] 481  94
> abundance[1:10,1:4]
   mLS.161.33 mLS.161.34 mLS.165.34 mLS.165.35
1           0          0          0          0
7           0          0          0          2
9           0          0          0          0
10          0          0          0          0
16          0          0          0          0
17          0          0          1          0
18          5         56         30         42
19          0          0          0          0
24          1         11          1          7
26          0          0          0          0
> 

Binary Data and Graphs

The simplest encoding of a graph is done through its adjacency matrix. The nodes correspond to the rows and columns of the matrix and we put a 1 in the matrix if the the two corresponding nodes are connected.

As an example we do this for the taxa in the abundance matrix above. We replace the abundances by presence (1) or absence (0). We look at the Jaccard dissimilarities/similarities to decide how many taxa co-occcur in at least 40% of the samples.

> presenceAbsence = (abundance >0) - 0
> presenceAbsence[1:4,1:4]
   mLS.161.33 mLS.161.34 mLS.165.34 mLS.165.35
1           0          0          0          0
7           0          0          0          1
9           0          0          0          0
10          0          0          0          0
> jaccpa=vegan::vegdist(presenceAbsence, "jaccard")
> n=nrow(abundance)
> jaacm=as.matrix(jaccpa)
> coinc=matrix(0,n,n)
> ind1=which((jaacm>0 & jaacm<(1-0.4)),arr.ind=TRUE)
> coinc[ind1]=1
> dimnames(coinc)=list(dimnames(abundance)[[1]],
+                dimnames(abundance)[[1]])
> coinc[415:420,430:440]
    828 830 831 832 833 836 838 839 840 841 842
801   0   0   0   0   0   0   0   0   0   0   0
802   0   0   0   0   0   0   0   0   0   0   0
804   0   1   1   0   0   1   0   1   1   1   1
808   0   1   1   0   0   1   0   1   1   1   1
809   0   0   0   0   0   0   0   0   0   0   0
811   0   0   0   0   0   0   0   0   0   0   0
> 
> table(coinc)
  coinc
          0      1
     215147  16214

So we can see that the matrix is rather sparse (less than 10% 1s).

R code for cut and paste

require(vegan)
require(igraph)
threshold=1
abund=py.nz[,-(70:71)]
abundance = abund[rowSums(abund)>threshold,]
dim(abundance)
abundance[1:10,1:4]
presenceAbsence = (abundance >0) - 0
presenceAbsence[1:4,1:4]
jaccpa=vegan::vegdist(presenceAbsence, "jaccard")
n=nrow(abundance)
jaacm=as.matrix(jaccpa)
###We make a co-occurrence matrix
coinc=matrix(0,n,n)
ind1=which((jaacm>0 & jaacm<(1-0.4)),arr.ind=TRUE)
coinc[ind1]=1
dimnames(coinc)=list(dimnames(abundance)[[1]],
               dimnames(abundance)[[1]])
coinc[415:420,430:440]

A first graph/network package: igraph

> require(igraph)
> g1=graph.adjacency(coinc,mode="undirected")
> 

> plot(g1, layout=layout.fruchterman.reingold,vertex.size=1, vertex.label=NA, 
edge.width=2)
> 
_images/90c8eae376.png

We can see a lot of isolated points (not part of a community, we take these out):


> delete.isol <- function(graph) {
+    isolates <- which(degree(graph, mode = 'all') == 0) - 1
+    igraph::delete.vertices(graph, isolates)
+    }
> gno <- delete.isol(g1)
> plot(gno, layout=layout.fruchterman.reingold,vertex.size=1, vertex.label=NA 
,edge.width=2)
> 
_images/e26c5ce3b7.png

We would like to automatically label connected components, here we want to see what happens if we are a little more loose in defining our threshold so the graph is more connected.

> coinc2=matrix(0,n,n)
> ind2=which((jaacm>0 & jaacm<(1-0.35)),arr.ind=TRUE)
> coinc2[ind2]=1
> g2=graph.adjacency(coinc2,mode="undirected")
> gno2 <- delete.isol(g2)
> 

> plot(gno2, layout=layout.fruchterman.reingold, vertex.size=1, vertex.label= 
NA, edge.width=2)
> 
_images/d95ea8d35e.png

R code for cut and paste

require(igraph)
g1=graph.adjacency(coinc,mode="undirected")
plot(g1, layout=layout.fruchterman.reingold,vertex.size=1, vertex.label=NA,edge.width=2)
delete.isol <- function(graph) {
   isolates <- which(degree(graph, mode = 'all') == 0) - 1
   igraph::delete.vertices(graph, isolates)
   }
gno <- delete.isol(g1)
plot(gno, layout=layout.fruchterman.reingold,vertex.size=1, vertex.label=NA,edge.width=2)
coinc2=matrix(0,n,n)
ind2=which((jaacm>0 & jaacm<(1-0.35)),arr.ind=TRUE)
coinc2[ind2]=1
g2=graph.adjacency(coinc2,mode="undirected")
gno2 <- delete.isol(g2)
plot(gno2, layout=layout.fruchterman.reingold, vertex.size=1, vertex.label=NA, edge.width=2)

References

Here is a link to a useful wiki about igraph.