A Tutorial for Clustering with XCluster

Many people have requested additional documentation for using XCluster (not surprising since there wasn't any). This is a first attempt at a tutorial, and is based around using the Mac version.

Create An Input File
The input file must be tab-delimited. You should check that this is in fact the case for your input file. If it is space delimited, it simply will not work. The easiest way (unless you can program in Perl) to generate your input file is by cutting and pasting your columns of data in Excel. Your file needs to have the first three columns as follows:
- Name
  This column will contain a name you want to ascribe an entity, such as ORF name, or CLONEID. The column itself can be named anything, but MUST contain some text on every row.
- Description
  This column can contain descriptive information about the entity, eg process or function, or gene symbol. It too can be named anything. It can optionally be left blank, but the column itself must be present.
- GWEIGHT
  This column allows you to weight genes with different weights, for instance if a gene appears on an array twice, you may want to give them a weight of 0.5 each. For the most part people leave this column with a value of 1 for every gene. This column must be present, and each row must have an entry.
In addition the file must begin with the following two rows:
- Row 1
  This contains the column headers as described above for columns 1, 2 and 3, then contains the experiment names for all the data columns that exist in the file. Each data column must have a text entry as a name for that column.
- Row 2
  This is the EWEIGHT row. The first column should say EWEIGHT, then for each experiment, there should be an EWEIGHT value. This will usually be 1, but if the same experiment is duplicated twice, you may want to give these repeats an EWEIGHT of 0.5.
The remaining cells in the file contain the actual data, such that the row and column specifies to which gene and which experiment a particular piece of data corresponds. If you had created your file in Excel, it would look something like this:

You should then choose Save As... from the File menu, and elect to the the file as type Text (Tab delimited), as indicated below:
The Data
In general, if you are using ratio data, you will want to cluster the data after it has been log transformed. XCluster can log transform your data for you (see below). Hence there must be no negative or zero values in you dataset prior to log transformation. You should remove them prior to log transforming. If you are using Affymetrix style data, which are absolute values, one way of converting such measurements to a ratio style measurement is to calculate row averages for each gene (this is easy in Excel), then divide the measurement for each gene by it's row average. These values can then be log transformed.
Clustering
Place your input data file into the same folder as the clustering application :

then double click on the clustering application.
- Enter the filename
- Specify whether the data need log transformation
- Specify whether you want to partition the data
- Specify a distance metric
- Specify whether you want to cluster genes
  - If so, and you are using the Pearson Correlation, specify whether you want to use a centered metric.
- Specify whether you want to cluster experiments
  - If so, and you are using the Pearson Correlation, specify whether you want to use a centered metric.
  Clustering will then proceed. An example of how your window may look is shown below, where no log transformation was needed, no partitioning was used, pearson correlation was used as a similarity metric, and only genes were clustered, using a non-centered version of the pearson correlation:
The output
For each cluster you make, you will generate a .cdt file, which will contain the original data, but reordered, to reflect the clustering. In addition, if you clustered by genes, you will get a .gtr file (gene tree), and if you clustered by experiments you will get a .atr file (array tree). These tree files reflect the history of how the cluster was built, and can be used to contruct how the tree(s) should look. An example of these three files is shown below, as they would look if you opened them in Excel:
1. The .cdt file
2. The .gtr file
3. The .atr file
So how do I view the data?
The best software to do this is Mike Eisen's TreeView, or Java TreeView.

Partitioning Methods

XCluster currently supports two different partitioning methods, Self Organizing Maps, and k-means clustering. The first question is why would you want to partition the data anyway?
When you simply hierarchically cluster your entire dataset, you are asking that all things be joined at some level, regardless of whether they have any similarity to each other. Consequently the closer you get to the root of the tree, the less it means. In addition hierachical clustering, when joining two nodes, takes the node as being equal to the average of the genes it contains. As nodes start to contain unrelated things, their being joined to something else becomes meaningless. For these reasons, it may seem a reasonable idea to split the dataset up into several smaller datasets, where the members of each dataset show at least some similarity to each other. This idea is akin to if you were going to build trees of protein similarity. You would likely first split up your data into protein families, based on BLAST scores, then you would cluster the contents of the families separately. XCluster does the equivalent of this for gene expression data. Using either SOMs or k-means it splits the data up into smaller subsets, and then applies hierarchical clustering to each of the subsets.

Self Organizing Maps
SOMs were devised by Tuevo Kohonen, and first used by Tamayo et al to analyze gene expression data. The description here pertains to my SOM implementation, which may or may not be identical to Tamayo's one.
1. The SOM is initialized
  XCluster will ask you what you want to use as the x and y dimensions of your self organizing map. The x and y dimensions will determine the number of partitions that you segregate your data into, eg. if you choose 3 x 3, you will get 9 partitions. Internally XCluster will set up 9 vectors, one for each partition. The dimensionality of the vectors is determined by the number of experiments that there are in your datafile. For instance is you have 10 experiments, then the vectors will be 10 dimensional, ie have 10 coordinates. Each of these coordinates is randomly initialized. In addition these partitions have a relationship to each other, in that they sit in a 2-dimensional grid, as defined by the user (eg 3 x 3). In this grid, it could be said that one partition is physically closer to another partition than it is to a 3rd partition. eg the top left partition in a 3 x 3 grid is closer to the middle left partition, than it is to the bottom right partition. This concept is important when we consider how we refine the SOM. At the beginning the relationship of the partitions in physical space has no bearing on the relationships between the vectors that associate with them. Thus the map is unorganized.
2. The SOM is refined
  A gene from your list of genes in your data file is then picked at random, and its expression pattern is compared to each of the vectors that were initialized randomly in the first step, one vector for each partition. The vector to which the expression of the picked gene is most similar is then modified, so that it more closely resembles the expression pattern of that gene. In addition the vectors that belong to the partitions that are physically closest (in the 2 dimensional grid) to the partition whose vector was just modified are also modified, so that they too resemble the gene's expression pattern a little more closely. This process (picking a gene at random; seeing to which vector it is most similar; modifying that vector; modifying the vectors of the partitions that are close to the partition which had the most similar vector) is repeated 100,000 times. As the iteration number increases, the amount by which the vectors are modified decreases. Also the definition of which partitions are close to each other changes. eg in the first iteration all partitions may be considered close to each other, but after 50,000 iterations a partition may be considered close to another, only if it is less than half of the width of the entire grid away. At the end no partitions may be considered close to each other. Essentially a circle is drawn on the 2 dimensional grid, with it's center being in the center of the partition whose vector was most similar to the picked gene's expression pattern. Any partitions that fall within that circle are considered 'close', and so their vectors are modified. The radius of this circle decreases with each iteration. Hence with each iteration fewer vectors are modified by smaller amounts, so that the map eventually settles down. It is important to note that algorithm leads to the vectors of neighbouring partitions being somewhat similar to each other, and vectors of partitions that are physically distant are dissimilar to each other. Thus the map has become organized.
3. The genes are partitioned
  After the 100,000 iterations have occured, the genes are then partitioned. This is simply done by assigning a gene to the partition whose associated vector is most similar to its expression pattern. A gene may currently only be put in one partition (though this could be changed), and a partition may theoretically contain all or none of the genes. In practice though this rarely occurs, unless there are few genes, or few partitions.
4. The partitions are clustered
  The contents of each partition is then clustered by the hierarchical clustering algorithm which will be detailed below.
K-means Clustering
K-means clustering is a simple partitioning method that has been used for decades, and is similar in concept to SOMs, though it is mechanistically different. In addition the different partitions that are created by k-means are not connected to each other, which is unlike SOMs.
1. Initialization
  XCluster will ask you how many partitions (k) you want to make. Similarly to SOMs, XCluster will set up a vector for each partition, each of which is randomly initialized.
2. The genes are iteratively partitioned
  All the genes are then compared to all of the vectors that correspond to each partition. Each gene is partitioned to 'belong' to the partition which has the most similar associated vector. After partioning of all genes, the vectors of each partition are calculated as the mean of the genes that were partitioned to that vector. This process is repeated iteratively, with repartioning of the genes, and recalculation of the vectors until all genes map to the same partitions twice in a row. I have noticed occassionally that XCluster appears to loop forever when doing k-means. I haven't tracked this to whether it flips between two or more states forever, or between infinite states (unlikely) or whether it's simply a bug. In a future version, I'll probably just limit the number of iterations, so that if convergence isn't reached, it won't turn your computer into an expensive paperweight until you kill it. After final assignment of the genes to the individual partitions, the contents of each partition is clustered, per the settings that you entered, using the algorithm detailed below.

Hierarchical Clustering

XCluster currently supports only a single method for organizing the data, Hierarchical Clustering. The algorithm employed by XCluster for hierarchical clustering is Average Linkage, as described in Eisen et al (which contains the formulae for both centered and uncentered Pearson correlation), and is as follows:

Calculate all pairwise distances between each genes' vectors.
Using the specified distance metric (usually Pearson correlation), the program treats the expression data for each gene as a vector, and then calculates the 'distance' between each gene's expression vector, and every other gene's expression vector. When using Pearson correlation, two expression vectors which point in exactly the same direction will have a correlation of 1, whereas those which point in opposite directions will have a correlation of -1 (ie are anti-correlated). Those vectors at 90° to each other are uncorrelated, and will have a correlation of zero. If using Euclidean distance as the distance metric, those vectors that are identical (ie define the same point in space) will have a Euclidean distance of zero between them. Vectors pointing in different directions, or having different lengths (even if they point in the same direction) will define points with a distance larger than 0 between them.
Join the most similar two expression patterns into a node.
The two expression patterns that have either the highest Pearson correlation, or the smallest Euclidean distance between them (depending on which metric you're using), will be 'joined' into a node. A new vector, for which each coordinate is the average of the corresponding coordinates of the contributing vectors will be made. The 'distance' of this vector to all other unjoined vectors is then calculated.
Repeat step (ii) until all expression vectors are joined into a single hierarchical tree.
As the algorithm proceeds, nodes are created that contain more and more expression vectors, until only one node, containing all vectors, remains. During node joining, the history of which genes/nodes are joined, and with what correlation, is recored in the .gtr file. This file can then be used to construct the tree that is often seen next to clustered expression data.

What about the rotation of the nodes in the tree?

There are (somthing like) 2^n-2 possible topologically equivalent trees, when clustering n genes, and considering the two alternative rotations of every single node. While there may be an optimal (or several optimal) trees, depending on how you measure whether the tree is optimal, XCluster does nothing to systematically solve the problem of finding an optimal tree. It does however employ a simple node-flipping algorithm, such that when two nodes are joined, it rotates them such that the two most similar outermost members of the nodes are placed adjacent to each other. While there is no evidence that this produces a 'better' tree, anecdotal accounts suggest that the tree produces expression data that looks smoother, or less dicontinuous.