|
ErmineJ is a software tool for generating "functional class scores" based on scores given to individual genes
or based on the similarity of data vectors (e.g., expression profiles) for each gene.
There is a command line tool which is very primitive and largely undocumented. This page describes how to use the graphical user interface (GUI) but
contains general information relevant to the command line tool as well.
Installing the software
You must have a java runtime installed on your machine. You can get it from Sun but
there is a good chance it is already installed on your machine.
Once you have a java runtime, just copy the software files to your hard drive in some convenient location.
Running the software
Just double-click on 'ermineJ.bat'. You will see a plain text 'console' window but shortly a graphical
window should appear containing several fields for input. To perform a run, you have to set the fields appropriately and then press "OK". You
can perform multiple runs without closing and restarting the software.
Inputs
The following fields must be set by the user on startup.
- Gene Score File (Gene score based) or Data File (Correlation-based):
A user-supplied file containing the data. This is either a set of gene scores (currently these should
be p-values), or the raw data for the genes. See below.
- GO Biological Names File: A file containing the names of the gene classes. We supply this.
- Probe to Unigene Mapping File: A file containing a list of the 'duplicates' for the array. This is needed if you use 'weights'. We supply this
- Probe to GO mapping file: A file containing a list of the class membership of probes on the array. We supply this.
- Output File: A name of an output file. Your results will be saved to this file.
Options
- Mean, Quantile, or Mean above Quantile: Choose one method to be used for class scoring. See descriptions
below for details. Note that
ROC and hypergeometric p values are always generated as well. Default=mean
- Use weights: Usually you will want to use this. Default=yes (checked)
- Iterations: This sets the number of random trials which will be used to generate background distributions.
The higher you set this, the longer it will take to run, but your p-values may be more precise. We don't suggest setting
this above 100,000 unless you find you have many classes which have 'maxed out' p values. Default=10000
- Quantile: This option only applies if you use the "quantile" or "mean above quantile" method. Default=50 (median).
- Min class size: Classes with fewer genes than this will not be evaluated. Default=4
- Max class size: Classes with more genes than this will not be evaluated. Default=100
- P-value cutoff: This option sets the gene p-value threshold which will
be applied for the hypergeometric distribution p values. Default=0.00001
Outputs
- A text file containing 8 columns: class size raw score pval virtual_size hyper pval aroc rate rocpval. The meanings:
- class : Name of the class, with the GO id number.
- size : How many probes on the array are in the class.
- raw score : The raw class statistic based on the selected method (mean, quantile or mean above quantile). High values are better.
- pval : The p value based on the raw score. Low values are better.
- virtual_size : How many independent genes are in the class. This is always less than or equal to the size since some genes are represented
multiple times on an array. This is the value used in the calculations.
- hyper pval : p-value for the classed based on the hypergeometric distribution (see below for details)
- aroc rate : The area under the receiver operator characteristic (AROC).
- rocpval : The p-value for the AROC.
The output file can be opened in Excel and sorted easily.
More information
ErmineJ
computes raw score, class p-value, hypergeometric p-value, and ROC rate for
every unigene (or single gene) in the gene score file. The class p-value is
a way to estimate how interesting a class is, but it has some drawbacks like
randomization trials, such procedure make the output unstable, especially in
the case of limited iterations. To enhance the outcome, we can apply large
iterations(100k+), but it would affect the efficiency a lot. Therefore we
seek other ways that are stable and also provide good estimation for how
interesting the classes are. Hypergeometric p-value and ROC rate are two
candidates, the computation of both methods have nothing to do with
randomization, so the experiments are repeatable under fixed settings. More
details about the two methods are described in method section.
Methods
I. Class pvalues based on sampling distribution of gene scores.
Gene Ontology (GO) class p-value is
computed based on the assumption of random distribution. The steps of
calculation of GO class p-value are:
- Generate background distribution -- given the size of a GO class (S) and
the number of iterations (I), in each trial the program randomly pick up S
unigenes/genes from gene score file, then take the method (mean or median)
as the result of this trial. Such trial should be repeated for I times to
generate the background distribution. This distribution means how the score
distributed if a certain size of elements are independently chosen.
- Transform background distribution to corresponding p-value -- the score
distribution is split into bins (column) according to the score, then each
bins has its own p-value, which means the percentage of trials that have
better scores then this bin. After this the program generates a look-up
table for different class size (row), each row has related p-value for every
column.
- Determine class p-value -- for each GO class in gene score file,
calculate its class raw score according to the method (mean or median), then
look into the background distribution table to find its corresponding
p-value. With enough iterations (100k+), this value can be a good estimation of how
special the data are. The precision of this method is 0.5/(number of
iterations), therefore we need to have an trade-off agreement between
precision and efficiency.
II. Hypergeometric p-value --
This p-value is computed from hypergeometric distribution. The steps of
calculation of hypergeometric p-value are:
- 1. User has to set a threshold (T) for this method. For the whole gene score
dataset, there are N1 unigenes/genes have p-value greater than T and N2
unigenes/genes have p-value smaller than T. For each GO class, there are n1
unigenes/genes have p-value greater than T and n2 unigenes/genes have
p-value smaller than T.
- 2. The hypergeometric p-value is:
choose(n1, N1)*choose(n2, N2)/choose(n1+n2, N1+N2), where choose(x, y) =
(x!*(y-x)!)/y!
In implementation, it's not practical to compute factorials during such
p-values computation since the number can be very big, which may cause
overflow and huge time cost. However, if we look into the equation, it can
be transformed like below:
choose(x, y) = (x!*(y-x)!)/y! = x!/(y*(y-1)*(y-2)*...*(y-x+1)) =
(x/y)*((x-1)/(y-1))*((x-2)/(y-2))*...*((1)/(y-x+1))
Such computation is easy to implement, it is fairly fast, and it prevents
the overflow problem since the value is always between 0 to 1 during the
computation. But there are still some fundamental drawbacks inherited from
hypergeometric distribution method that can't be get rid of:
- The choice of threshold affects result tremendously, and there is no gold
standard for the threshold. It means we have to change the threshold again
and again, then check the all the outputs to see which threshold give us
more interesting result. However, even we get an extremely interesting
output from a certain threshold, we cannot say it is the best threshold,
since it may have exaggerated the relation between data.
- the threshold is not consistent among data, it means even you can found a
good threshold for certain data,
very probable the same threshold is not a good one for other data.
III. Area under Receiver Operating Characteristic (AROC)
This is a new method and is thus quite experimental. AROC means area under roc curve, this method cares only the rank of each
genes in each class. the algorithm for ROC is: (modified from text written
by Dr. Paul Pavlidis)
inputs: set T of positives. (i.e., 10 genes which are in the class)
set F of negatives (i.e., the other 10000 genes)
set P of scores for all T and F, ranked so the best score is first. (i.e.,
the gene p-values)
count = 0
For each P (from best to worst)
if P is in T then count++
else if P is in F then area+=count
if count == T.size then area += remaining area, done
AROC = area/(T * F)
return AROC
end
If we get an extremely interesting class(all the elements are ranked as the
very top ones), the ROC rate will be 1; in worst case it can be as low as 0.
The ROC rate can be transformed into p-value (rocpval), which is an
indicator of how special the class is.
ROC scores are converted to pvalues using an approximate formula that holds as long as the number of genes in the class
is reasonably small compared to the number of total genes (fraction less than 0.5 or so)
Related
How the program runs |