Output gene expression indexes


Overview

There are two ways of generating gene-level expression indexes.
GeneBASE
In the first method, the background-corrected probe intensities are output to files. These files are processed by python and R scripts to produce GeneBASE estimates. A description about how to output background-corrected probe intensities can be found on the instructions to output background-corrected probe intensities page. Instructions for processing the files using python and R is given in the file ReadMe_ProbeSelection.
GeneBASE and GeneBASE-xhyb
The second method outputs gene expression estimates directly. The possible options include outputing GeneBASE estimates or GeneBASE-xhyb estimates which summarizes expression in the same manner as GeneBASE but excluding probes believed to be affected by cross-hybridization. Details of these two options are discussed below.

Step 1: Download the program

Instructions are available on the download page.

Step 2: Create a parameter file

A parameter file specifies the following types of information
A log file to store the progress of the computation and to output any error messages.
Exon array annotation, including pgf, clf and probeset annotation files. These can be downloaded from the annotation page.
The exon array data. We recommend a set of diverse samples of data for probe selection. The Affymetrix tissue panel data may be combined with small sample sizes of exon array data.
An output file stores the resulting gene-level expression indexes.
The model parameters specify a choice of background correction, normalization and gene summarization algorithm.

Examples

We provide several sample parameter files which can be modified. Detailed descriptions of the parameters are given below.

***Note that flags and parameter values are separated by tabs***

GeneBASE parameter file (MAT background correction, scalar normalization).
Expression is computed using the MAT background corrected, scalar normalized probe intensities. A subset of probes with highly correlated intensities across the different samples are used for summarization of gene-level estimates.
GeneBASE parameter file (MAT background correction, no normalization).
Expression is computed using the MAT background corrected probe intensities. A subset of probes with highly correlated intensities across the different samples are used for summarization of gene-level estimates.
GeneBASE-xhyb parameter file(MAT background correction, scalar normalization).
Expression is computed usign the MAT background corrected, scalar normalized probe intensities. Probes with high correlation with matching off-target transcripts expression patterns are excluded. The remaining set of probes are used for selection of a subset of highly correlated intensities across the different samples for summarization of gene-level estimates.
GeneBASE-xhyb parameter file(MAT background correction, no normalization).
Expression is computed usign the MAT background corrected probe intensities. Probes with high correlation with matching off-target transcripts expression patterns are excluded. The remaining set of probes are used for selection of a subset of highly correlated intensities across the different samples for summarization of gene-level estimates.

Descriptions
[log]
logfile
The name of the file to log progress, errors, etc.
[exon_annotation]
probeset_annotation
The probeset annotation file specifies the grouping of probesets into transcript clusters and the level of annotation supporting each probeset. See the annotation page to download.
pgf_file
The pgf file specifies the grouping of probes into probesets. See the annotation page to download.
clf_file
The clf file describes the position of each probe on the chip. Those clf files with a description "crosshyb_x" including the mapping information of probes to off-targets allowing an edit distance of "x" base-pairs. To generate GeneBASE-xhyb estimates, a clf file with "crosshyb_x" must be specified. See the annotation page for details.
[exon_data]
folder
The folder storing the array cel files.
exon_cel_files
A list of cel files, each array separated by a single "," and no spaces.
[output]
output_model_fit
The output file.
[model]
array_type
The type of array analyzed. Here a value of "exon" should be specified.
method
Background-correction method. One of (mat, median_gc, none)
summarize_expression
This value should be set to "true" to output gene expression indexes.
summary_method
Method for gene summarization of gene expression. One of ("selection", "correlation_filter", "liwong").
mat_training_probe_type
The probes used for training the background model. One of (background, full). Defaults to background.
normalization_method
Normalization method. One of (core_probe_scaling, none, quantile). The core_probe_scaling method applies a scalar to each array so that the median of background-corrected core probe intensities is equal to 100. The none method applies no normalization in addition to the background correction. The quantile method applies a quantile normalization (followed by background correction).

Step 3: Run the program

The program is run using the parameter file "parameterFile.txt" on the command line with the following command:

./ProbeEffects2.0 -par parameterFile.txt

Step 4: Examine results

The program outputs a log file which should be checked for errors.

The gene expression estimates will be output in the file specified by the "output_model_fit" parameter in the parameter file.