From a machine learning point of view, classification of gene expression patterns is a very particular task. Typically, training data consists of few samples (small number of experiments) but contains many variables (expression levels measured in each experiment). In this context classical machine learning methods may cause various difficulties. For instance:
In the diagnostic context we moreover focus our attention to the possibility to use our method for additional molecular stratification of patients. Patients presenting a homogenous phenotype may show different gene expression profiles. This is because their common phenotype is caused by different molecular mechanisms. We suggest classification results which are linked to biological aspects to provide a resolved diagnosis and give the corresponding stratification a biological meaning.
Gene Ontology driven classification
We conjecture that the mentioned problems can be tackled by giving the classifier a biologically meaningful structure, i.e., by dividing the classification task into subtasks according to biological criteria. Structuring biological knowledge is one of the central goals of the Gene Ontology database. Biological terms related to molecular functions, biological processes and cellular components are collected into a directed acyclic graph where each node represents a term and child-terms are either members or representatives of their parent-terms. Moreover, genes are attributed to GO-nodes according to their functions, involvement into biological processes and location within the cell. We suggest to use this structure in a classifier as follows.
For each GO-node, one classifier is implemented using a classical machine learning method, providing a probability for each class as classification result. Each of these classifiers is given the same classification task, while input variables are the expression values corresponding to the genes annotated to the classifier's GO-node as well as the classification results of its children. Thus, the classifiers corresponding to the leaf-nodes of the Gene Ontology must be trained first. The overall classification result is provided by the root node's classifier.
In this procedure each classifier bases its decision only on information about the biological function it is attributed to. Therefore, when considering an overall classification result, its rationale can be deduced from the various classifier results. Moreover, the partitioning of the input variables among many classifiers, weakens the mentioned overfitting problem.
Hover over a node description to see its ID, click on it to see detailed results.
Illustration of patient stratification using molecular symptoms. Rows correspond to GO-node based classifiers, columns represent patients with mixed lineage leukemia. Colors in the image encode classifiers results. Green regions represent presence and black regions absence of molecular symptoms. Presence/absence patterns of molecular symptoms provide a molecular patient stratification.
Try the suggested method using our Bioconductor compliant R package
Publications, talks and posters