| Home |
ErmineJ: Requirements |
|
|
This is out of date! This document provides only a user-level requirement outline and tries to stay away from any issues of implementation or system architecture, which will be described in other documents. Project OverviewErmineJ (the name is provisional) will be a software tool for gene expression data analysis. It will be implemented in Java and include a graphical user interface. The target platform is Windows, with unix (solaris/linux) and Macintosh compatibility being secondary in consideration. ErmineJ will encompass two main functionalities: Analysis and Visualization/Browsing. The overall goal, however is to provide a software framework for expansion of the functionality. The initial goals are aimed at implementing methods we have currently been running in Unix from the command line or via web browser interfaces, implmeneted in C and Perl. Analysis sectionThe analysis aspect of ErmineJ is aimed at the tools we have developed which are not available elsewhere, rather than reimplementing the wheel. However, some simple "redundant" tools are going to be necessary to make this really a useful tool. These include:
Visualization/BrowsingThis will provide a point-and-click interface to simple "block" visualizations such as those provided by matrix2png. This will include a GO browser as well. Further details (in overview) of requirements for these two aspects follow. To be determined is the overall design of the system; will the visualization live in a separate window frame from the analysis? Or will they coexist with a common set of menus etc.? Prioritization of the requirements also is to be determined. AnalysisReading data filesThe analysis software must be able to read data files in the simple text format specified here. Parsing of cluster format files must be available as well. The software must load the annotations files provided here (gene ontology classes) and here (gene annotations for genebank etc.). The files are found in either a user-configued location or in a default location. Loading of these files should happen 'behind the scenes' but the information in them is available to the user (via the interface) and analysis methods. Template matching: Reading Template files: the format for these is specified here. Not described there is the 'multi-template file', which is similar except that there is one template per line and there is an additional column of template names (the first column). Reading layout and classification files: These files are described here (layouts) and here (classification files) and describe the various categories of samples present. Reading p-value files: The software must open and read score files provided by the user. These will consist of one column of gene names (probe set ids typically) and one column of scores. A header line will be present optionally OutputThe analysis software must provide text output in the same data input format mentioned above (i.e., for data from specific gene classes etc.). The option of exporting cluster format files must be available. This way the output can be used in the Stanford cluster software. (visualization output issues are described elsewhere) Pvalues output: The pvalues or other analysis results are exported in a text file. Score distribution output: Background score histograms are exported in a text file. Preferences: the settings used in the analysis are saved in a file and read in at the beginning of a session. Analysis methodsThe primary goal will be to implement the "correlation" and "experiment" gene class score methods described in the paper found here. The latter requires the use or generation of gene-wise pvalues. Probably initially we will import pvalues that the user provides. Later we would like some simple methods of generating pvalues or using pvalues generated by other tools in the system such as template matching. For the experiment score, the user will have the option of using "median" or other quantile instead of "mean" for calculating raw class scores. Both methods are computationally intensive, as they have to calculate empirical background distributions; the software should provide a quick heuristic method (or even raw scores) so that a quick result can be viewed before deciding to calculate the full distributions. Display of the results in the visualization, organized by GO classification. Ideally we will show a classification tree. The mechanism for the display must be sufficiently flexible to show any classification scheme (including those which are not really trees) and not be restricted to GO. Database connectivityIn the long term, the software must be capable of getting data from a database, and storing the results in a database, and providing output in XML or other database-readable formats (MAGE-ML etc.) as appropriate. User interfacestatus bar: A status line is provided to show the state of the program. progress screens: during lengthy tasks, a progress indicator is shown (perhaps on the status bar) menu driven: All functions are driven by menus or toolbar buttons. web links: HTML links are provided to all gene annotations so that a web browser can be invoked automatically to view publically-available details about genes. tables: in some cases, the most efficient way to display numerical data is in a table. This will be used to show ranked lists of classes. by pvalue. Example of a how a user session might go
Jotted notesweb links for everything stubs for database connectivity display of go tree as a 'file tree' plotting gene graphs output: postscript, png for everything graphical. (very important!) templates: choose a gene; compare to other datasets. mechanism for adding data sets for template matching. mechanism for designing templates. References
|
| Paul Pavlidis. | |