「Machine Learning for Exploratory Fusion and Visualization of High-throughput Data」

Samuel Kaski
Laboratory of Computer and Information Science
Helsinki University of Technology, Finland

  High-throughput measurement data, and data banks in which they are stored, have brought a new data analysis problem to biology and medicine: how to infer the relevant effects from the data mass. Each single information source, be it gene expression, protein interaction, or Gene Ontology, contains unknown amounts and types of noise, that is, irrelevant or uninteresting variation. The task of distinguishing between relevant and irrelevant variation is particularly hard in the initial exploratory task of "looking at the data," when the hypotheses are still vague and hence there are no strong models yet to help constrain the exploration. I will discuss how to make sense of data masses with information visualization methods, and machine learning methods designed to bring out relevant clusters and components by fusing several data sources. Supervised mining or "supervised unsupervised learning" searches for clusters or components relevant or informative of classes such as gene ontology. Methods such as learning metrics, discriminative clustering, and discriminative components follow this principle. Mutual dependency mining separates task or source-specific variation from variation shared by all sources. The treatment-specific variation is less relevant in defining yeast stress response, for instance. Methods include local and non-parametric dependent components. I will pick examples from one of the main application areas, analysis of gene expression, where we have combined data from several treatments or different organisms, or from different measurement techniques, to focus the analyses according to the task.

More information at http://www.cis.hut.fi/projects/mi.