A Tutorial for Clustering with XCluster
Many people have requested additional documentation for using XCluster
(not surprising since there wasn't any). This is a first attempt at a
tutorial, and is based around using the Mac version.
- Create An Input File
The input file must be tab-delimited. You should check that
this is in fact the case for your input file. If it is space
delimited, it simply will not work. The easiest way (unless you can
program in Perl) to generate your input file is by cutting and pasting
your columns of data in Excel. Your file needs to have the first
three columns as follows:
- Name
This column will contain a name you want to
ascribe an entity, such as ORF name, or CLONEID. The column itself
can be named anything, but MUST contain some text on every row.
- Description
This column can contain descriptive
information about the entity, eg process or function, or gene symbol.
It too can be named anything. It can optionally be left blank, but
the column itself must be present.
- GWEIGHT
This column allows you to weight genes with
different weights, for instance if a gene appears on an array twice,
you may want to give them a weight of 0.5 each. For the most part
people leave this column with a value of 1 for every gene. This
column must be present, and each row must have an entry.
In addition the file must begin with the following two rows:
- Row 1
This contains the column headers as described
above for columns 1, 2 and 3, then contains the experiment names for
all the data columns that exist in the file. Each data column must
have a text entry as a name for that column.
- Row 2
This is the EWEIGHT row. The first column
should say EWEIGHT, then for each experiment, there should be an
EWEIGHT value. This will usually be 1, but if the same experiment is
duplicated twice, you may want to give these repeats an EWEIGHT of
0.5.
The remaining cells in the file contain the actual data, such that the
row and column specifies to which gene and which experiment a
particular piece of data corresponds. If you had created your file in
Excel, it would look something like this:
You should then choose Save As... from the File menu,
and elect to the the file as type Text (Tab delimited), as
indicated below:
- The Data
In general, if you are using ratio data, you will want to cluster
the data after it has been log transformed. XCluster can log
transform your data for you (see below). Hence there must be no
negative or zero values in you dataset prior to log transformation. You
should remove them prior to log transforming. If you are using
Affymetrix style data, which are absolute values, one way of
converting such measurements to a ratio style measurement is to
calculate row averages for each gene (this is easy in Excel), then
divide the measurement for each gene by it's row average. These
values can then be log transformed.
- Clustering
Place your input data file into the same folder as the clustering
application :
then double click on the clustering application.
- Enter the filename
- Specify whether the data need log transformation
- Specify whether you want to partition the data
- Specify a distance metric
- Specify whether you want to cluster genes
- If so, and you are using the Pearson Correlation, specify
whether you want to use a centered metric.
- Specify whether you want to cluster experiments
- If so, and you are using the Pearson Correlation, specify
whether you want to use a centered metric.
Clustering will then proceed. An example of how your window may look
is shown below, where no log transformation was needed, no partitioning
was used, pearson correlation was used as a similarity metric, and
only genes were clustered, using a non-centered version of the pearson
correlation:
- The output
For each cluster you make, you will generate a .cdt file, which will
contain the original data, but reordered, to reflect the clustering.
In addition, if you clustered by genes, you will get a .gtr file (gene
tree), and if you clustered by experiments you will get a .atr file
(array tree). These tree files reflect the history of how the cluster
was built, and can be used to contruct how the tree(s) should look.
An example of these three files is shown below, as they would look if
you opened them in Excel:
- The .cdt file
- The .gtr file
- The .atr file
- So how do I view the data?
The best software to do this is Mike Eisen's
TreeView,
or Java TreeView.
Partitioning Methods
XCluster currently supports two different partitioning methods, Self
Organizing Maps, and k-means clustering. The first question is why
would you want to partition the data anyway?
When you simply hierarchically cluster your entire dataset, you are
asking that all things be joined at some level, regardless of whether
they have any similarity to each other. Consequently the closer you
get to the root of the tree, the less it means. In addition
hierachical clustering, when joining two nodes, takes the node as being
equal to the average of the genes it contains. As nodes start to
contain unrelated things, their being joined to something else becomes
meaningless. For these reasons, it may seem a reasonable idea to
split the dataset up into several smaller datasets, where the members
of each dataset show at least some similarity to each other. This
idea is akin to if you were going to build trees of protein
similarity. You would likely first split up your data into protein
families, based on BLAST scores, then you would cluster the contents of
the families separately. XCluster does the equivalent of this for
gene expression data. Using either SOMs or k-means it splits the data
up into smaller subsets, and then applies hierarchical clustering to
each of the subsets.
- Self Organizing Maps
SOMs were devised by Tuevo
Kohonen, and first used by Tamayo
et al to analyze gene expression data. The description here
pertains to my SOM implementation, which may or may not be identical
to Tamayo's one.
- The SOM is initialized
XCluster will ask you what you want to use as the x and y dimensions of
your self organizing map. The x and y dimensions will determine the
number of partitions that you segregate your data into, eg. if you
choose 3 x 3, you will get 9 partitions. Internally XCluster will set
up 9 vectors, one for each partition. The dimensionality of the
vectors is determined by the number of experiments that there are in
your datafile. For instance is you have 10 experiments, then the
vectors will be 10 dimensional, ie have 10 coordinates. Each of these
coordinates is randomly initialized. In addition these partitions
have a relationship to each other, in that they sit in a 2-dimensional
grid, as defined by the user (eg 3 x 3). In this grid, it could be
said that one partition is physically closer to another partition than
it is to a 3rd partition. eg the top left partition in a 3 x 3 grid
is closer to the middle left partition, than it is to the bottom right
partition. This concept is important when we consider how we refine
the SOM. At the beginning the relationship of the partitions in
physical space has no bearing on the relationships between the vectors
that associate with them. Thus the map is unorganized.
- The SOM is refined
A gene from your list of genes in your data file is then picked at
random, and its expression pattern is compared to each of the vectors
that were initialized randomly in the first step, one vector for each
partition. The vector to which the expression of the picked gene is
most similar is then modified, so that it more closely resembles the
expression pattern of that gene. In addition the vectors that belong
to the partitions that are physically closest (in the 2 dimensional
grid) to the partition whose vector was just modified are also
modified, so that they too resemble the gene's expression pattern a
little more closely. This process (picking a gene at random; seeing
to which vector it is most similar; modifying that vector; modifying
the vectors of the partitions that are close to the partition which had
the most similar vector) is repeated 100,000 times. As the iteration
number increases, the amount by which the vectors are modified
decreases. Also the definition of which partitions are close to each
other changes. eg in the first iteration all partitions may be
considered close to each other, but after 50,000 iterations a
partition may be considered close to another, only if it is less than
half of the width of the entire grid away. At the end no partitions
may be considered close to each other. Essentially a circle is drawn
on the 2 dimensional grid, with it's center being in the center of the
partition whose vector was most similar to the picked gene's
expression pattern. Any partitions that fall within that circle are
considered 'close', and so their vectors are modified. The radius of
this circle decreases with each iteration. Hence with each iteration
fewer vectors are modified by smaller amounts, so that the map
eventually settles down. It is important to note that algorithm leads
to the vectors of neighbouring partitions being somewhat similar to
each other, and vectors of partitions that are physically distant are
dissimilar to each other. Thus the map has become organized.
- The genes are partitioned
After the 100,000 iterations have occured, the genes are then
partitioned. This is simply done by assigning a gene to the partition
whose associated vector is most similar to its expression pattern. A
gene may currently only be put in one partition (though this could be
changed), and a partition may theoretically contain all or none of the genes. In
practice though this rarely occurs, unless there are few genes, or few
partitions.
- The partitions are clustered
The contents of each partition is then clustered by the hierarchical
clustering algorithm which will be detailed below.
- K-means Clustering
K-means clustering is a simple partitioning method that has been used
for decades, and is similar in concept to SOMs, though it is
mechanistically different. In addition the different partitions that
are created by k-means are not connected to each other, which is
unlike SOMs.
- Initialization
XCluster will ask you how many partitions (k) you want to make.
Similarly to SOMs, XCluster will set up a vector for each partition,
each of which is randomly initialized.
- The genes are iteratively partitioned
All the genes are then compared to all of the vectors that correspond
to each partition. Each gene is partitioned to 'belong' to the
partition which has the most similar associated vector. After
partioning of all genes, the vectors of each partition are calculated
as the mean of the genes that were partitioned to that vector. This
process is repeated iteratively, with repartioning of the genes, and
recalculation of the vectors until all genes map to the same
partitions twice in a row. I have noticed occassionally that XCluster
appears to loop forever when doing k-means. I haven't tracked this to
whether it flips between two or more states forever, or between
infinite states (unlikely) or whether it's simply a bug. In a future
version, I'll probably just limit the number of iterations, so that if
convergence isn't reached, it won't turn your computer into an
expensive paperweight until you kill it. After final assignment of
the genes to the individual partitions, the contents of each partition
is clustered, per the settings that you entered, using the algorithm
detailed below.
Hierarchical Clustering
XCluster currently supports only a single method for organizing the
data, Hierarchical Clustering. The algorithm employed by XCluster for
hierarchical clustering is Average Linkage, as described in Eisen
et al (which contains the formulae for both centered and
uncentered Pearson correlation), and is as follows:
- Calculate all pairwise distances between each genes' vectors.
Using the specified distance metric (usually Pearson correlation), the
program treats the expression data for each gene as a vector, and then
calculates the 'distance' between each gene's expression vector, and
every other gene's expression vector. When using Pearson correlation,
two expression vectors which point in exactly the same direction will
have a correlation of 1, whereas those which point in opposite
directions will have a correlation of -1 (ie are anti-correlated).
Those vectors at 90° to each other are uncorrelated, and will
have a correlation of zero. If using Euclidean distance as the
distance metric, those vectors that are identical (ie define the same
point in space) will have a Euclidean distance of zero between them.
Vectors pointing in different directions, or having different lengths
(even if they point in the same direction) will define points with a
distance larger than 0 between them.
- Join the most similar two expression patterns into a node.
The two expression patterns that have either the highest Pearson
correlation, or the smallest Euclidean distance between them
(depending on which metric you're using), will be 'joined' into a
node. A new vector, for which each coordinate is the average of the
corresponding coordinates of the contributing vectors will be made.
The 'distance' of this vector to all other unjoined vectors is then
calculated.
- Repeat step (ii) until all expression vectors are joined into a
single hierarchical tree.
As the algorithm proceeds, nodes are created that contain more and
more expression vectors, until only one node, containing all vectors,
remains. During node joining, the history of which genes/nodes are
joined, and with what correlation, is recored in the .gtr file. This
file can then be used to construct the tree that is often seen next to
clustered expression data.
What about the rotation of the nodes in the tree?
There are (somthing like) 2n-2 possible topologically
equivalent trees, when clustering n genes, and considering the two
alternative rotations of every single node. While there may be an
optimal (or several optimal) trees, depending on how you measure
whether the tree is optimal, XCluster does nothing to systematically
solve the problem of finding an optimal tree. It does however employ
a simple node-flipping algorithm, such that when two nodes are joined,
it rotates them such that the two most similar outermost members of
the nodes are placed adjacent to each other. While there is no
evidence that this produces a 'better' tree, anecdotal accounts
suggest that the tree produces expression data that looks smoother, or
less dicontinuous.