Home

ErmineJ: Projects

Overview

ErmineJ is a port of the existing Ermgene code into Java, providing a GUI and also more consistency and flexibility in the structure of the program. There are several reseach directions to pursue in this project. Primarily they center around the issue of establishing which method for scoring classes 'works best'.

Implement gene weighting methods

DONE This is basically the last thing that needs to be done before we can really use the software.

Additional class scoring methods (experiment score)

The methods we already have available. Here 'score' means the raw score for a gene, typically the -log(p value) for group comparisions.

  • DONEMean of scores in the class
  • DONEMedian of scores in the class
  • DONEQuantile: Variant of median: using an arbitrary quantile, such as the 75th.
  • DONEVariant of quantile/mean: Using the mean of values above a quantile.

Additional methods for experiment scores:

These require binning the scores, and may also need additional translation into methods we can apply. See Mirnics et al., 2000. These have the advantage of not requiring any randomization trials.

  • T-test comparing binned scores in the class the all the genes.
  • Chi-squared test, comparing the distribution of the binned scores in the class to all the genes. The kolmogorov smirnov test may be more appropriate.
  • NO: ANOVA with post-hoc testing, comparing genes in the class with the rest of the data. (This will not work correctly when the classes overlap, as they do for us)

How the binning is done: Mirnics binned the expression ratios for each gene. We can consider more generally the distribution of raw scores for genes in the experiment. The procedure would break the distribution into bins (perhaps 10), and compare the distribution of scores in the class to the distribution for the whole data set.

DONE An additional method uses the hypergeometric distribution (no binning required). It differs from all the other methods in that it requires selecting a set of genes with a threshold, and then seeing if the number of genes in a class are concentrated in that section of the data. Note that this is similar in flavor to the knn method described in our paper.

DONE A final method, inspired by the previous, is to measure the AUC (area under the receiver operator characteristic) for the genes in the class vis the ranking provided by the gene scores. This does not require using a threshold. However, it does require calibration. Unlike the other methods, this is not data dependent (but it is still class size dependent, as a proportion of the full data set).

Randomization trials for multiple test correction

DONE(sort of) An interesting issue is how, or if, to correct for multiple testing when considering 100s of classes. I have implemented Westfall-Young, Benjamini-Hochberg, and Bonferroni.

Optimization of background distribution calculation

DONE: The code runs reasonably fast.

The slow step in class scoring where empirical pvalues are found is determining the background distribution. There are several approaches to improving performance:

  • Only calculate backgrounds for class sizes which are populated. - if no classes have that size, skip it.
  • For larger classes, we can probably skip every 2 or 3 classes.
  • DONEHow many iterations do we need before the ranking at the top stops changing? 10k? 100k? 500k?
  • We can interactively refine the background distributions for classes which have high scores. In other words, we would first calculate the raw class scores. Then, the background calculation would proceed on the basis of how well the distribution accounts for the score. This is complex because it requires interaction between two parts of the software.
  • For large classes (>100?) we can appeal to the central limit theorem and use a gaussian distribution, the mean and standard deviation of which we would estimate by a small number of interations.
  • Similarly, there may be a reasonable function which can be fit to the distributions.

Evaluation on multiple data sets

Factors which I want to quantify and evaluate:

  • DONEThe effect of weighting/not weighting repeated appearances of genes. (you have to do weighting)
  • The effect of different scoring methods on the results obtained.
  • The effect of filtering the data first: removing genes which are not expressed, for example.

Enumerate the classes which have top scores in each experiment, and set up comparisions between data sets.

Evaluation

  • One suggestion is that the class scores should reflect the data, and similar data sets should get similar class scores. I.e., 'cell-growth' shows up in most cancer data sets - which methods do a better job of this?
  • Class scores should be robust: they should not be affected by removing genes. Bootstrap resampling?
  • :-( Expert evaluation.

References

--