There are two ways of generating gene-level expression indexes.
GeneBASE In the first method, the background-corrected probe intensities are output to files. These files are processed by python and R scripts to produce GeneBASE estimates. A description about how to output background-corrected probe intensities can be found on the instructions to output background-corrected probe intensities page. Instructions for processing the files using python and R is given in the file ReadMe_ProbeSelection. |
GeneBASE and GeneBASE-xhyb The second method outputs gene expression estimates directly. The possible options include outputing GeneBASE estimates or GeneBASE-xhyb estimates which summarizes expression in the same manner as GeneBASE but excluding probes believed to be affected by cross-hybridization. Details of these two options are discussed below. |
Instructions are available on the download page.
A log file to store the progress of the computation and to output any error messages. |
Exon array annotation, including pgf, clf and probeset annotation files. These can be downloaded from the annotation page. |
The exon array data. We recommend a set of diverse samples of data for probe selection. The Affymetrix tissue panel data may be combined with small sample sizes of exon array data. |
An output file stores the resulting gene-level expression indexes. |
The model parameters specify a choice of background correction, normalization and gene summarization algorithm. |
We provide several sample parameter files which can be modified. Detailed descriptions of the parameters are given below.
***Note that flags and parameter values are separated by tabs***
GeneBASE parameter file (MAT background correction, scalar normalization). Expression is computed using the MAT background corrected, scalar normalized probe intensities. A subset of probes with highly correlated intensities across the different samples are used for summarization of gene-level estimates. |
GeneBASE parameter file (MAT background correction, no normalization). Expression is computed using the MAT background corrected probe intensities. A subset of probes with highly correlated intensities across the different samples are used for summarization of gene-level estimates. |
GeneBASE-xhyb parameter file(MAT background correction, scalar normalization). Expression is computed usign the MAT background corrected, scalar normalized probe intensities. Probes with high correlation with matching off-target transcripts expression patterns are excluded. The remaining set of probes are used for selection of a subset of highly correlated intensities across the different samples for summarization of gene-level estimates. |
GeneBASE-xhyb parameter file(MAT background correction, no normalization). Expression is computed usign the MAT background corrected probe intensities. Probes with high correlation with matching off-target transcripts expression patterns are excluded. The remaining set of probes are used for selection of a subset of highly correlated intensities across the different samples for summarization of gene-level estimates. |
[log] logfile The name of the file to log progress, errors, etc. |
[exon_annotation] probeset_annotation The probeset annotation file specifies the grouping of probesets into transcript clusters and the level of annotation supporting each probeset. See the annotation page to download. |
pgf_file The pgf file specifies the grouping of probes into probesets. See the annotation page to download. |
clf_file The clf file describes the position of each probe on the chip. Those clf files with a description "crosshyb_x" including the mapping information of probes to off-targets allowing an edit distance of "x" base-pairs. To generate GeneBASE-xhyb estimates, a clf file with "crosshyb_x" must be specified. See the annotation page for details. |
[exon_data] folder The folder storing the array cel files. |
exon_cel_files A list of cel files, each array separated by a single "," and no spaces. |
[output] output_model_fit The output file. |
[model] array_type The type of array analyzed. Here a value of "exon" should be specified. |
method Background-correction method. One of (mat, median_gc, none) |
summarize_expression This value should be set to "true" to output gene expression indexes. |
summary_method Method for gene summarization of gene expression. One of ("selection", "correlation_filter", "liwong"). |
mat_training_probe_type The probes used for training the background model. One of (background, full). Defaults to background. |
normalization_method Normalization method. One of (core_probe_scaling, none, quantile). The core_probe_scaling method applies a scalar to each array so that the median of background-corrected core probe intensities is equal to 100. The none method applies no normalization in addition to the background correction. The quantile method applies a quantile normalization (followed by background correction). |
The program is run using the parameter file "parameterFile.txt" on the command line with the following command:
./ProbeEffects2.0 -par parameterFile.txt
The program outputs a log file which should be checked for errors.
The gene expression estimates will be output in the file specified by the "output_model_fit" parameter in the parameter file.