Simply double-click the application, and answer the questions as they come up. Your input file must conform to the expected file format, described below.
UNIXYou can either simply type cluster, and answer the questions as they come up, or use the following command line arguments:
* -f filename * -g 0|1|2 ; 0 indicates no gene clustering, 1 indicates non-centered metric, 2 indicates centered metric when clustering genes. 1 is default. * -e 0|1|2 ; 0 indicates no experiment clustering, see above for 1 and 2. 0 is the default. * -p 0|1 ; whether to use pearson correlation (1), or Euclidean distance (0). 1 is the default. * -s 0|1 ; whether to make a SOM (1). 0 is the default. * -x specify x dimension of SOM * -y specify y dimension of SOM * -r 0|1 ; whether to seed the random number generator with the time when making a SOM. 1 is the default. * -k num ; how many k-means clusters to make. num is an integer, greater than 1, and preferably less than a gazillion. * -l 0|1 ; whether to log transform the data. * -u string ; a unique identifier by which to name the output files, instead of basing their names on the input file. eg -u 888 will generate 888.cdt as an output file.
If there's any questions over:History:
9-21-99 - Finished version 2.51 Minor changes to code, to streamline parsing of user input. - Made it so that genes don't have to be clustered. 7-19-99 - Finished version 2.5 Has the ability to generate k-means clusters. - Added in code to log transform the data (in base 2) in case it wasn't already log transformed. - Optimized SOM code somewhat. - Changed parsing code so that it will compile and run unmodified on all platforms. 5-13-99 - Finished Version 2. Has the ability to generate Self Organizing Maps. - The Input data is filtered when making a SOM, such that all expression profiles that do not vary at least 2 fold from their max to min are removed. These are printed out into a file with a .fil extension - The user has the ability to decide whether to not seed the random number generator with the current time. If they want to compare SOMs with and without clustering by experiment, this is useful, because you can guarantee the same organization if you don't seed the random number generator. - Fixed bug where using relative pathways gave the wrong output filenames. Now can use: 1. /home/users/sherlock/sample.txt OR, 2. ../../sample.txt equally well. - Did some optimisations to the CalculateCorrelation function to speed it up. 5-6-99 - Added in the ability to choose between using Euclidean distance, or pearson correlation for use as a similarity metric. Can be set via the command line option -p, where 0 means use Euclidean, and 1 means use pearson. 1 is the default. 5-1-99 - Found stupid and nasty bug such that even though the code was correctly switching the node orders to bring together similar genes/experiments, these switches were not reflected in the tree files. This bug was fixed, and the fix tested, and it's now better! 4-16-99 - Was storing each correlation in two places, which Paul pointed out was unnecessary, so now only store it in one place. Makes it slightly faster. 4-12-99 - Fixed a couple of small bugs eg. forgetting to malloc the extra byte for the '\0' in ** all ** the strings! - made the code somewhat more readable, since: *(darray+x*y) is equivalent to darray[x*y] the second version is a lot more readable I think. - Made linked list implementation more efficient, such that when a correlation is to be inserted into a full list, then the memory of the lowest correlationRec is reused, and the pointers manipulated accordingly. Saves a call to malloc and free. - Added the ability to take command line arguments, so can make automatable: Usage: cluster -f filename /* filename (duh!) */ -g 1|2 /* genes: 2=use centered metric 1 is default */ -e 0|1|2 /* experiments: 0=don't cluster 1=cluster 2=use centered metric 0 is default */ eg. cluster -f sample.txt -g 2 -e 2 cluster -f sample.txt cluster -e 2 -f sample.txt ...you get the idea If you supply any command line arguments, you must supply the filename. 4-12-99 The clustering program is implemented in C. To use resources more efficiently it only stores the 13 highest correlations for any particular gene (I played around with this number on a set of ~6200 genes by 80 experiments, and 13 was the fastest). Also, since sqrt(a) * sqrt(b)=sqrt(a*b), I was able to save a call to sqrt. The program, was compiled using: gcc -Wall latest_cluster.c -o cluster -lm -O3 -ansi As a couple of benchmarking exercises, cluster takes: 11 seconds: 800 genes x 82 experiments (alberich) 8 minutes: 6200 genes x 82 experiments (alberich) 1 hour: 16384 genes x 250 experiments (daisy) The program prompts the user for an input file, asks whether the data should be clustered by experiments, in addition to genes, and asks whether to cluster the data using a centered correlation metric. The progress is given by indicating every 100 correlations that have been calculated, and every 100 nodes that have been clustered (for genes) or every node that has been clustered (for experiments). The program then outputs 2 or 3 files, identically named as the input file, but with the extensions .cdt, .gtr, and .atr (if experiments were clustered).File formats:
The program expects a specific input file format. If this is violated the program will give unexpected results. If the input format specified here is unsatisfactory, or too limiting, we can discuss making changes. Line 1: UID\tNAME\tGWEIGHT\tExp1\tExp2...Expn Line 2: EWEIGHT\t\t\t1\t1\t...1 Line 3: ORF1\tNAME1\t1\t0.23\t0.45\t...0.43 Line 4: ORF2\tNAME2\t1\t0.56\t-0.10\t..0.34 ie: The first line contains three columns (they can be named anything), followed by the names of all the experiments (tab delimited). The second lines contains 3 columns (they can be named anything), followed by the EWEIGHTS for each experiment (usually have the value 1, but treated internally as a float). The third, and subsequent lines have the ORF name as the first column, the gene name as the second column, and the GWEIGHT as the third column (again, whose value is usually 1, but is treated internally as a float). The ORF name and GENE name fields can be up to 1024 characters in length, so can be descriptive data instead. The remainder of the line is made up of tab delimited experimental data. Where there is a null data point, there will be two tabs next to each other. an example file is : UID NAME GWEIGHT spo0 spo30 spo2 spo5 spo7 spo9 spo11 EWEIGHT 1 1 1 1 1 1 1 YAL003W EFB1 1 0.23 -1.79 -1.29 -1.56 -0.27 YAL004W YAL004W 1 0.41 -0.38 -0.89 -1.06 -1.6 -1.84 -1.6 YAL005C SSA1 1 0.61 -0.07 -1.29 -1.29 -2 -1.84 -2.25 YAL010C MDM10 1 0.16 -0.15 -0.76 -1.25 -1.89 -1.74 -1.6 YAL012W CYS3 1 0.03 1.39 -0.84 -1.64 -2.84 -2.47 -2.4 YAL015C NTG1 1 -0.18 -0.18 -0.62 -1.32 -1.69 -1.43 -1.79 ...etc Ouput files: The .cdt file looks like this: GID YORF NAME GWEIGHT spo0 spo2 spo30 spo11 spo9 spo7 spo5 AID ARRY0X ARRY2X ARRY1X ARRY6X ARRY5X ARRY4X ARRY3X GENE37X YBR025C YBR025C 1 0.50 -1.94 -2.64 0.73 0.51 -0.18 -1.00 GENE26X YBL042C YBL042C 1 0.26 -1.06 -2.06 1.58 1.33 0.54 -0.43 ...etc where everything is tab delimited. The second line is only printed if experiments, as well as genes were clustered. The ARRY0X, and GENE37X are used by the TreeViewer program to determine how to draw the binary trees, in conjunction with the .gtr and .atr files (below): .gtr: NODE1X GENE51X GENE24X 0.997965 NODE2X GENE41X GENE40X 0.997813 NODE3X GENE90X GENE89X 0.997782 NODE4X GENE84X GENE32X 0.996293 NODE5X GENE8X GENE3X 0.995884 NODE6X GENE91X GENE54X 0.995747 NODE7X GENE80X GENE42X 0.995286 NODE8X GENE70X GENE31X 0.994681 NODE9X GENE63X GENE36X 0.994233 ...etc .atr NODE1X ARRY4X ARRY3X 0.948027 NODE2X ARRY6X ARRY5X 0.940996 NODE3X NODE2X NODE1X 0.902631 NODE4X ARRY2X ARRY1X 0.785001 NODE5X NODE4X NODE3X 0.662225 NODE6X NODE5X ARRY0X -0.082621