Computational Diagnostics Group compdiag MPI for Molecular Genetics

Group Members
Dept. Vingron
Group Seminar
NGFN Microarray Data Analysis Resource
StAM logo

Molecular Symptoms

Claudio Lottaz


From a machine learning point of view, classification of gene expression patterns is a very particular task. Typically, training data consists of few samples (small number of experiments) but contains many variables (expression levels measured in each experiment). In this context classical machine learning methods may cause various difficulties. For instance:

  • Statistical models, particularly those with many parameters, may overfit the training data. Thereby, they rather adapt to noise in the data than learn the desired phenomenon.
  • Moreover, common machine learning methods do not provide an intuitive and biologically meaningful explanation of their results. However, such explanations help users to trust a computational analysis.
In the Molecular Symptoms project, we try to cope with these two problems in the context of medical diagnosis. We actually suggest an algorithm we call StAM (Structured Analysis of Microarrays) based on the Gene Ontology

In the diagnostic context we moreover focus our attention to the possibility to use our method for additional molecular stratification of patients. Patients presenting a homogenous phenotype may show different gene expression profiles. This is because their common phenotype is caused by different molecular mechanisms. We suggest classification results which are linked to biological aspects to provide a resolved diagnosis and give the corresponding stratification a biological meaning.

Gene Ontology driven classification

We conjecture that the mentioned problems can be tackled by giving the classifier a biologically meaningful structure, i.e., by dividing the classification task into subtasks according to biological criteria. Structuring biological knowledge is one of the central goals of the Gene Ontology database. Biological terms related to molecular functions, biological processes and cellular components are collected into a directed acyclic graph where each node represents a term and child-terms are either members or representatives of their parent-terms. Moreover, genes are attributed to GO-nodes according to their functions, involvement into biological processes and location within the cell. We suggest to use this structure in a classifier as follows.

For each GO-node, one classifier is implemented using a classical machine learning method, providing a probability for each class as classification result. Each of these classifiers is given the same classification task, while input variables are the expression values corresponding to the genes annotated to the classifier's GO-node as well as the classification results of its children. Thus, the classifiers corresponding to the leaf-nodes of the Gene Ontology must be trained first. The overall classification result is provided by the root node's classifier.

In this procedure each classifier bases its decision only on information about the biological function it is attributed to. Therefore, when considering an overall classification result, its rationale can be deduced from the various classifier results. Moreover, the partitioning of the input variables among many classifiers, weakens the mentioned overfitting problem.

Hover over a node description to see its ID, click on it to see detailed results.

Illustration of patient stratification using molecular symptoms. Rows correspond to GO-node based classifiers, columns represent patients with mixed lineage leukemia. Colors in the image encode classifiers results. Green regions represent presence and black regions absence of molecular symptoms. Presence/absence patterns of molecular symptoms provide a molecular patient stratification.


Try the suggested method using our Bioconductor compliant R package

Publications, talks and posters

  • Molecular Decomposition of Complex Clinical Phenotypes using Biologically Structured Microarray Analysis
    Claudio Lottaz and Rainer Spang
    Bioinformatics, 2005, 21(9):1971-1978
    [ advance access published on January, 27, 2005 ]
  • stam - a Bioconductor compliant R package for structured analysis of microarray data
    Lottaz C, Spang R
    BMC Bioinformatics. 2005 Aug 25;6(1):211
    [ PMID: 16122395 | doi:10.1186/1471-2105-6-211 ]
  • Structured Analysis of Microarrays - User's Guide to the Bioconductor package StAM
    Claudio Lottaz and Rainer Spang
    CompDiag Technical Report Nr. 2004/03 Nov. 2004
    [ pdf ]
  • Decomposing Complex Clinical Phenotypes using Biologically Structured Microarray Analysis
    Claudio Lottaz
    Workshop on Statistics in Functional Genomics, Ascona, Switzerland, 2004 Jun 27 - Jul 2.
    [ pps | Workshop website ]
  • Gene Expression Based Tumor Classification using Biologically Informed Models
    Rainer Spang and Claudio Lottaz
    54th Session of the International Statistics Institute, 2003 Aug 13-20, Berlin, Germany.
    [ pps | conference website ]
  • Gene Ontology Driven Classification of Gene Expression Patterns
    Claudio Lottaz, Stefan Bentink and Rainer Spang
    European Conference on Computational Biology, October 6-9, 2002, Saarbrücken, Germany
    [ abstract: pdf | poster: pdf ]

Imprint  Comments on this webpage