Introduction to ErmineJ

ErmineJ is a software tool for generating "functional class scores" based on scores given to individual genes or based on the similarity of data vectors (e.g., expression profiles) for each gene.

There is a command line tool which is very primitive and largely undocumented. This page describes how to use the graphical user interface (GUI) but contains general information relevant to the command line tool as well.

Installing the software

You must have a java runtime installed on your machine. You can get it from Sun but there is a good chance it is already installed on your machine.

Once you have a java runtime, just copy the software files to your hard drive in some convenient location.

Running the software

Just double-click on 'ermineJ.bat'. You will see a plain text 'console' window but shortly a graphical window should appear containing several fields for input. To perform a run, you have to set the fields appropriately and then press "OK". You can perform multiple runs without closing and restarting the software.

Inputs

The following fields must be set by the user on startup.

  • Gene Score File (Gene score based) or Data File (Correlation-based): A user-supplied file containing the data. This is either a set of gene scores (currently these should be p-values), or the raw data for the genes. See below.
  • GO Biological Names File: A file containing the names of the gene classes. We supply this.
  • Probe to Unigene Mapping File: A file containing a list of the 'duplicates' for the array. This is needed if you use 'weights'. We supply this
  • Probe to GO mapping file: A file containing a list of the class membership of probes on the array. We supply this.
  • Output File: A name of an output file. Your results will be saved to this file.

Options

  • Mean, Quantile, or Mean above Quantile: Choose one method to be used for class scoring. See descriptions below for details. Note that ROC and hypergeometric p values are always generated as well. Default=mean
  • Use weights: Usually you will want to use this. Default=yes (checked)
  • Iterations: This sets the number of random trials which will be used to generate background distributions. The higher you set this, the longer it will take to run, but your p-values may be more precise. We don't suggest setting this above 100,000 unless you find you have many classes which have 'maxed out' p values. Default=10000
  • Quantile: This option only applies if you use the "quantile" or "mean above quantile" method. Default=50 (median).
  • Min class size: Classes with fewer genes than this will not be evaluated. Default=4
  • Max class size: Classes with more genes than this will not be evaluated. Default=100
  • P-value cutoff: This option sets the gene p-value threshold which will be applied for the hypergeometric distribution p values. Default=0.00001

Outputs

  • A text file containing 8 columns: class size raw score pval virtual_size hyper pval aroc rate rocpval. The meanings:
    • class : Name of the class, with the GO id number.
    • size : How many probes on the array are in the class.
    • raw score : The raw class statistic based on the selected method (mean, quantile or mean above quantile). High values are better.
    • pval : The p value based on the raw score. Low values are better.
    • virtual_size : How many independent genes are in the class. This is always less than or equal to the size since some genes are represented multiple times on an array. This is the value used in the calculations.
    • hyper pval : p-value for the classed based on the hypergeometric distribution (see below for details)
    • aroc rate : The area under the receiver operator characteristic (AROC).
    • rocpval : The p-value for the AROC.

The output file can be opened in Excel and sorted easily.

More information

ErmineJ computes raw score, class p-value, hypergeometric p-value, and ROC rate for every unigene (or single gene) in the gene score file. The class p-value is a way to estimate how interesting a class is, but it has some drawbacks like randomization trials, such procedure make the output unstable, especially in the case of limited iterations. To enhance the outcome, we can apply large iterations(100k+), but it would affect the efficiency a lot. Therefore we seek other ways that are stable and also provide good estimation for how interesting the classes are. Hypergeometric p-value and ROC rate are two candidates, the computation of both methods have nothing to do with randomization, so the experiments are repeatable under fixed settings. More details about the two methods are described in method section.

Methods

I. Class pvalues based on sampling distribution of gene scores.

Gene Ontology (GO) class p-value is computed based on the assumption of random distribution. The steps of calculation of GO class p-value are:

  1. Generate background distribution -- given the size of a GO class (S) and the number of iterations (I), in each trial the program randomly pick up S unigenes/genes from gene score file, then take the method (mean or median) as the result of this trial. Such trial should be repeated for I times to generate the background distribution. This distribution means how the score distributed if a certain size of elements are independently chosen.
  2. Transform background distribution to corresponding p-value -- the score distribution is split into bins (column) according to the score, then each bins has its own p-value, which means the percentage of trials that have better scores then this bin. After this the program generates a look-up table for different class size (row), each row has related p-value for every column.
  3. Determine class p-value -- for each GO class in gene score file, calculate its class raw score according to the method (mean or median), then look into the background distribution table to find its corresponding p-value. With enough iterations (100k+), this value can be a good estimation of how special the data are. The precision of this method is 0.5/(number of iterations), therefore we need to have an trade-off agreement between precision and efficiency.

II. Hypergeometric p-value --

This p-value is computed from hypergeometric distribution. The steps of calculation of hypergeometric p-value are:
  1. 1. User has to set a threshold (T) for this method. For the whole gene score dataset, there are N1 unigenes/genes have p-value greater than T and N2 unigenes/genes have p-value smaller than T. For each GO class, there are n1 unigenes/genes have p-value greater than T and n2 unigenes/genes have p-value smaller than T.
  2. 2. The hypergeometric p-value is:
    choose(n1, N1)*choose(n2, N2)/choose(n1+n2, N1+N2), where choose(x, y) = (x!*(y-x)!)/y!
    In implementation, it's not practical to compute factorials during such p-values computation since the number can be very big, which may cause overflow and huge time cost. However, if we look into the equation, it can be transformed like below:
    choose(x, y) = (x!*(y-x)!)/y! = x!/(y*(y-1)*(y-2)*...*(y-x+1)) = (x/y)*((x-1)/(y-1))*((x-2)/(y-2))*...*((1)/(y-x+1))
    Such computation is easy to implement, it is fairly fast, and it prevents the overflow problem since the value is always between 0 to 1 during the computation. But there are still some fundamental drawbacks inherited from hypergeometric distribution method that can't be get rid of:
  • The choice of threshold affects result tremendously, and there is no gold standard for the threshold. It means we have to change the threshold again and again, then check the all the outputs to see which threshold give us more interesting result. However, even we get an extremely interesting output from a certain threshold, we cannot say it is the best threshold, since it may have exaggerated the relation between data.
  • the threshold is not consistent among data, it means even you can found a good threshold for certain data,
    very probable the same threshold is not a good one for other data.

III. Area under Receiver Operating Characteristic (AROC)

This is a new method and is thus quite experimental. AROC means area under roc curve, this method cares only the rank of each genes in each class. the algorithm for ROC is: (modified from text written by Dr. Paul Pavlidis)

inputs: set T of positives. (i.e., 10 genes which are in the class)
set F of negatives (i.e., the other 10000 genes)
set P of scores for all T and F, ranked so the best score is first. (i.e., 
the gene p-values)
count = 0
For each P (from best to worst)
    if P is in T then count++
     else if P is in F then area+=count
     if count == T.size then area += remaining area, done
AROC = area/(T * F)
return AROC
end
	 

If we get an extremely interesting class(all the elements are ranked as the very top ones), the ROC rate will be 1; in worst case it can be as low as 0. The ROC rate can be transformed into p-value (rocpval), which is an indicator of how special the class is.

ROC scores are converted to pvalues using an approximate formula that holds as long as the number of genes in the class is reasonably small compared to the number of total genes (fraction less than 0.5 or so)

Related

How the program runs