Computational Diagnostics Group @ Max Planck Institute for Molecular Genetics, Berlin

	Home
	Contact
	Group Members
	Dept. Vingron

	Research
	Publications
	Software

	Workshops
	Group Seminar

	NGFN Microarray Data Analysis Resource
	Collaborators

Molecular Symptoms

Claudio Lottaz

Introduction

From a machine learning point of view, classification of gene expression patterns is a very particular task. Typically, training data consists of few samples (small number of experiments) but contains many variables (expression levels measured in each experiment). In this context classical machine learning methods may cause various difficulties. For instance:

Statistical models, particularly those with many parameters, may overfit the training data. Thereby, they rather adapt to noise in the data than learn the desired phenomenon.
Moreover, common machine learning methods do not provide an intuitive and biologically meaningful explanation of their results. However, such explanations help users to trust a computational analysis.

In the Molecular Symptoms project, we try to cope with these two problems in the context of medical diagnosis. We actually suggest an algorithm we call StAM (Structured Analysis of Microarrays) based on the Gene Ontology

In the diagnostic context we moreover focus our attention to the possibility to use our method for additional molecular stratification of patients. Patients presenting a homogenous phenotype may show different gene expression profiles. This is because their common phenotype is caused by different molecular mechanisms. We suggest classification results which are linked to biological aspects to provide a resolved diagnosis and give the corresponding stratification a biological meaning.

Gene Ontology driven classification

We conjecture that the mentioned problems can be tackled by giving the classifier a biologically meaningful structure, i.e., by dividing the classification task into subtasks according to biological criteria. Structuring biological knowledge is one of the central goals of the Gene Ontology database. Biological terms related to molecular functions, biological processes and cellular components are collected into a directed acyclic graph where each node represents a term and child-terms are either members or representatives of their parent-terms. Moreover, genes are attributed to GO-nodes according to their functions, involvement into biological processes and location within the cell. We suggest to use this structure in a classifier as follows.

For each GO-node, one classifier is implemented using a classical machine learning method, providing a probability for each class as classification result. Each of these classifiers is given the same classification task, while input variables are the expression values corresponding to the genes annotated to the classifier's GO-node as well as the classification results of its children. Thus, the classifiers corresponding to the leaf-nodes of the Gene Ontology must be trained first. The overall classification result is provided by the root node's classifier.

In this procedure each classifier bases its decision only on information about the biological function it is attributed to. Therefore, when considering an overall classification result, its rationale can be deduced from the various classifier results. Moreover, the partitioning of the input variables among many classifiers, weakens the mentioned overfitting problem.

Hover over a node description to see its ID, click on it to see detailed results.

Illustration of patient stratification using molecular symptoms. Rows correspond to GO-node based classifiers, columns represent patients with mixed lineage leukemia. Colors in the image encode classifiers results. Green regions represent presence and black regions absence of molecular symptoms. Presence/absence patterns of molecular symptoms provide a molecular patient stratification.

Software

Try the suggested method using our Bioconductor compliant R package

Publications, talks and posters

Molecular Decomposition of Complex Clinical Phenotypes using Biologically Structured Microarray Analysis
Claudio Lottaz and Rainer Spang
Bioinformatics, 2005, 21(9):1971-1978
[ advance access published on January, 27, 2005 ]

stam - a Bioconductor compliant R package for structured analysis of microarray data
Lottaz C, Spang R
BMC Bioinformatics. 2005 Aug 25;6(1):211
[ PMID: 16122395 | doi:10.1186/1471-2105-6-211 ]

Structured Analysis of Microarrays - User's Guide to the Bioconductor package StAM
Claudio Lottaz and Rainer Spang
CompDiag Technical Report Nr. 2004/03 Nov. 2004
[ pdf ]

Decomposing Complex Clinical Phenotypes using Biologically Structured Microarray Analysis
Claudio Lottaz
Workshop on Statistics in Functional Genomics, Ascona, Switzerland, 2004 Jun 27 - Jul 2.
[ pps | Workshop website ]
Gene Expression Based Tumor Classification using Biologically Informed Models
Rainer Spang and Claudio Lottaz
54th Session of the International Statistics Institute, 2003 Aug 13-20, Berlin, Germany.
[ pps | conference website ]
Gene Ontology Driven Classification of Gene Expression Patterns
Claudio Lottaz, Stefan Bentink and Rainer Spang
European Conference on Computational Biology, October 6-9, 2002, Saarbrücken, Germany
[ abstract: pdf | poster: pdf ]

Imprint Comments on this webpage