Home

ErmineJ: class scoring code

Preliminaries

Definitions: (operational for this project, not general)

  • Gene: a biological genetic entity that is studied using a probe or probe set on a microarray. Sometimes unfortunatelyy used synonymously with probe or probe set.
  • Probe or probe set: A unique thing on a microarray that is designed to test for the level of a particular gene. Each probe is uniquely identified with a probe id.
  • Probe id: an alphanumeric identifier for a probe set. For affymetrix arrays, these look like for example X1890_at but vary in format.
  • Probe pval: A p-value or score for a probe. Sometimes referred to incorrectly by Paul as a gene pval.. Assumption: each probe on a microarray can be assigned a pvalue. These pvalues are GIVEN. Not to be confused with class pvals.
  • probe data, also referred to as "data", "raw data", or "microarray data": The expression measurements for the probes, typically across multiple micorarrays. Thus for each probe, there is a set of data points. Together, the data for all probes is referred to as the probe (or, mistakenly, gene) data. The data for a particular probe is referred to specificially as "the data for a particular probe"
  • Replicate or duplicate: The occurrence of the same "actual gene" multiple times on a microarray, when a gene is represented by multiple probes.
  • Replicate group: A group of probes which are all replicates of one another (i.e, each represents the same gene). In our algorithms, we attempt to consider each replicate group as a single "gene".
  • Gene ontology id: A alphanumeric identifier for a biological function. The format is GO:NNNNNNN, where NNNNNNN is a seven-digit number, padded with leading zeros if necessary. A gene ontology id merely refers to a term. The mapping of gene ontology id to probes and genes is a major input to the software.
  • >Class: short for a gene class, a set of genes which have related function as determined by having a common gene ontology annotation.
  • Data for genes in a class: The probe data for the genes in a class. This means the set of probes referring to the genes in the class.
  • >pvalues for the probes for genes in a class: The probe pvals for the genes in a class.
  • Class pval: short for Class p-value: The score generated by our class scoring software for a class. A class pval is calculated from a class score, using a background distribution to convert it to a class pval. The starting inputs are gene pvals - specifically, the pvals for the genes in the class, (experiment score) or from the raw microarray data - specifically, the data for the genes in the class (correlation score), the background distribution, and the class size for the class we are trying to get a class pval for.
  • class score: Also Raw class score or raw score: A value calculated from the probe pvals or probe data for a class, in an attempt to quantify how interesting the class is based on the microarray data. Must be converted to a class pval.
  • Class size: Nominally, the number of probes that belong in a class. However, we are really interested in the number of genes in a class. Because of replicates, we refer to the 'virtual class size'.
  • Probe weight: The weight a probe is given based on how many replicates are in its group. Thus, if probes A, B, and C are all replictates of a gene, then each gets a weight of 1/3 when calculating averages. (the use of weighting varies depending on the algorithm).
  • Virtual class size: The class size based on reweighting of replicates.
  • Random class: A set of probe pvals or probe data which are selected at random from the probe pvals or probe data. The size of the random class is selected.
  • Real class: A set of probes determined by actual annotaitons, not random selection of probes. This is the key input to the class scoring algorithm, however to convert these to class pvals we need the background distribution calcualted from random classes.
  • Background distribution or histogram: The output of repeatedly making random classes and calculating class scores for each random class.
  • Matrix object: (capitalized Matrix) an instance of the Matrix class. This class allows accessing rows and columns both by a numerical index, or by a string key (i.e., probe id).
  • matrix (lowercase matrix) A mathemetical abstraction of data which is accessed by 'row' and 'column' indexes.
  • Array: a block of memory we can access using [i] notation. Always used for storing primitives such as doubles or ints. A Matrix object internally stores its data as a double dimensional array i.e. [x][y]
  • Vector: A Java data structure which is like an array, but slower and useful only for storing objects such as strings.
  • vector or array in 'lowercase' context: A mathematical abstraction of a data structure containing a list of things.
  • Hash: A hashmap or hash table, allowing O(1) searching for keys. Values are data objects such as strings, Vectors, or arrays.
  • Experiment score: Any class score which is calculated using probe pvals.
  • Correlation score: Any class score which is calcualted on the basis of pairwise comparisons of probe data.
  • Class browser: A graphical user interface allowing point-and-click surfing of classes. Click on a class id, get the probe data and probe pvals for the probes in the class, as well as the class score(s) for the class.

On mappings of genes to classes

  • A gene can belong to zero or more classes.
  • A class can contain zero or more genes.
  • Classes are not mutally exclusive: a class can overlap with another class in terms of the genes it contains.

On the mappings of genes to probes/probe sets

  • Each probe set measures a particular gene. (in theory anyway)
  • Each gene is measured by one or more probe sets. (we ignore the fact that many genes are not even considered in our experiment)

Based on the above, we note that there is a more complex mapping of probes to classes

  • A probe can belong to zero ore more classes.
  • A class can contain zero ore more probes.
  • A class can contain more probes than genes (thanks to replicates). Note that a class can contain more than one replicate group.
  • Classes are not mutually exclusive at the probe level.
  • The probe grouping rule: If probes A and B are replicates, and A and B are in class J, and B is in class K, then A is in class K. This means that all replicates of a gene are always found together in a class. If the rule is found to be broken, this is an error.

Inputs to class scoring

  • probe pvalues: A tab-delimited file, where the first column is the probe ids, and another column contains the probe pvalues. Although not needed for correlation scoring, we need it for display of the results.
  • Probe data: a tab-delmited file. Although not needed for experiment scoring, we need it for display of the results.
  • probe->class mapping file: A tab-delimited file, where each row contains the mapping for one probe. The list of classes the probe belongs to is a |-delimited list of class ids.
  • Replicate probe mapping: The format is to be determined, but this file lets us see which probes are duplicates of which other probes.
  • Gene ontology definitions: The mapping of GO ids to human-readable descriptions, such as "protein kinase". This is needed to display the results. The native format is the GO XML, parsed by the GO API.

Internal storage of the inputs.

  • probe pvalues: We must be able to rapidly find the pvalues for any probe, as well as efficiently randomly select pvalues for making background distributions. Thus the probe pvalues are stored in a Matrix object. The keys to the hashmap are the probe ids. The values are the rows of the Matrix (in this case, containing a single value, the probe pvalues)
  • probe to class mapping: A hash where keys are probe ids and values are Vectors of class ids.
  • Replicate probe mapping: Probably, a hash where keys are probe ids and values are Vectors or probe ids which are replicates of the key. For some purposes (experiment score) we can simplify this by using a mapping just to the weight for the probe id, but this won't work for the correlation score.
  • probe data: A Matrix object.
  • Gene ontology definitions: provided by the GO API.

Some Requirements for the code and software

  • We must be able to rapidly retrieve the probe data given a class id.
  • We must be able to rapidly retireve prove pvals given a class id.
  • Rapidly calculate the class score and class pval for a given class id.
  • Display, print, save to disk , or list the probe data given a class id.
  • Display print, save to disk or list the class NAME given a class id, and vice versa.
  • Print or save background distributions. Read background distributions from a file.
  • Background distributions must be updatable. That is, we can add additional random classes to it.
  • Background distributions can be calculated in the background i.e., in a separate thread from the main program. This allows us to update the class scores and class pvals even though the background distribution is not finished. This is only relevant in the context of a class browser.
  • (ideally) Easily add class scoring methods and background distribution determination methods.

Steps in the analysis from the user point of view.

  1. Read in probe data, select the identity of the microarray type used so appropriate class mappings can be used.
  2. Read in probe pvals from a file.
  3. Read in probe to class mapping. (should be done automatically based on the microarray type)
  4. Read in replicate probe mapping. (should be done automatically based on the microarray type)
  5. Read in a background distribution if one has been done already.
  6. Choose a class scoring method and parameters for background distribution calculation. Using a GUI or command line.
  7. Calculate a background distribution, or extend an existing background distribution if one was loaded.
  8. Calculate class scores for the REAL classes of a given size range (i.e. 5 to 100)
  9. Calculate class pvalues for REAL classes. (done in conjuction with previous step)
  10. Display, print, save to disk or list the class pvalues for the REAL classes, also including raw scores and diagnostics such as virtual class sizes.
  11. Display, print, save to disk or list the background distributions (for diagnostic purposes, or for loading later.)
  12. Allow browsing of the results in a class browser.

About weighting and detecting replicates

The main complication in calculating class scores is the weighting of replicates. The first stage is detecting replicates and counting them; the second is using this information in the class score calculation.

See class-correls and class-pvals for the way this is done in Perl for correlation and experiment scores respectively.

Given: replicate probe mapping, the class definitions, and the pvalues for the probes or the data for the probes (experiment and correl scores respectively).

That information is used as follows. In the pseudocode, k: x => y means to define (or add to) a hash table named k where x is a key and y is the value for x. k(x) means to retrieve y from k given x.

Correlation scoring:


	/* Get the information about replicates established: */
	foreach class c
		get the probes p that are in the class
		q = 0;
		foreach p in the class c
			count the size of the replicate group for p (set to 1 if no replicates or weighting is not used)
			weights: p => 1/(size of replicate group)    (1/weight = the size of the replicate group for p)
			groups: p => q
		end
		
		assign each grouping in c a unique index. groupnum: p => index, and then store in groupnums: c => groupnum.
		(see code of class-correls)
		
		numgroups:map  c => number of replicate groups for c (number of groupings in the class) (this is used for diagnostics only)
	end
	
	/* Get the class correls */
	foreach class c
		clear arrays D, P, R.
		effective_size = 0;
		nominal_size = 0;
		foreach member m of c 
			add the probe data or probe pvalues for m to array D
			add weights(p) to array W.
			add groupnum(p) to array R.
			effective size += weights(p)
			nominal size++;
		end
		if nominal size is out of the allowable range, go to the next class
		C = classcorrel(D, R, W)
		print C etc.
	end
	
	/* The function classcorrel calculates the average correlation between members of the class; the basis for this
	is essentially the correlation matrix for the data in the class. Replicates are not compared
	to each other. And, if A and B are replicates to be compared to C, the correlatoins A:C and B:C are given lower weights. 
	Note that the perl version does an optimization to make the correlation calcualtion faster; this is not shown here. */
	function classcorrel {
		arguments D, R, W
		count, tw, tc = 0;
		foreach item j in the class
			foreach item i in the class starting from j (so we don't do both j vs i and i vs j)
				if i == j, skip it.
				if R(j) == R(i) skip it (i and j are in the same replicate group)
				wi = W(i)
				wj = W(j)
				tw = wi + wj;
				correl = correlation(D(i),D(j))
				tc += tw * correl;
				count+= tw;
			end
		end
		return tc / count
	}
	
	

Assertions that must be true for the above include enforcing that if there are no replicates, the total weight should be the class size, and the average correlation should be the same as that calculated using the results of the regular correl matrix function.

For experiment scores

See class-pvals for perl implementation.


	/* we start similar to class correl, except we don't need to worry about replicate groups. Assume we already
	have the weights for all the probes in W, probe pvals in P */
	/* not shown: the complexity added by doing quantiles. See perl version */
	foreach class c
		weight = 0;
		sum = 0;
		count = 0;
		foreach p in c
			weight += W(p)
			sum += P(p) * W(p)
			count++;
		end
		score = sum / count.
	end
	
	

References

--