| Home |
ErmineJ: Projects |
|
|
OverviewErmineJ is a port of the existing Ermgene code into Java, providing a GUI and also more consistency and flexibility in the structure of the program. There are several reseach directions to pursue in this project. Primarily they center around the issue of establishing which method for scoring classes 'works best'. Implement gene weighting methodsDONE This is basically the last thing that needs to be done before we can really use the software. Additional class scoring methods (experiment score)The methods we already have available. Here 'score' means the raw score for a gene, typically the -log(p value) for group comparisions.
Additional methods for experiment scores:These require binning the scores, and may also need additional translation into methods we can apply. See Mirnics et al., 2000. These have the advantage of not requiring any randomization trials.
How the binning is done: Mirnics binned the expression ratios for each gene. We can consider more generally the distribution of raw scores for genes in the experiment. The procedure would break the distribution into bins (perhaps 10), and compare the distribution of scores in the class to the distribution for the whole data set. DONE An additional method uses the hypergeometric distribution (no binning required). It differs from all the other methods in that it requires selecting a set of genes with a threshold, and then seeing if the number of genes in a class are concentrated in that section of the data. Note that this is similar in flavor to the knn method described in our paper. DONE A final method, inspired by the previous, is to measure the AUC (area under the receiver operator characteristic) for the genes in the class vis the ranking provided by the gene scores. This does not require using a threshold. However, it does require calibration. Unlike the other methods, this is not data dependent (but it is still class size dependent, as a proportion of the full data set). Randomization trials for multiple test correctionDONE(sort of) An interesting issue is how, or if, to correct for multiple testing when considering 100s of classes. I have implemented Westfall-Young, Benjamini-Hochberg, and Bonferroni. Optimization of background distribution calculationDONE: The code runs reasonably fast. The slow step in class scoring where empirical pvalues are found is determining the background distribution. There are several approaches to improving performance:
Evaluation on multiple data setsFactors which I want to quantify and evaluate:
Enumerate the classes which have top scores in each experiment, and set up comparisions between data sets. Evaluation
References
|
| Paul Pavlidis. | |