High-throughput measurement data, and data banks in which they are
stored, have brought a new data analysis problem to biology and
medicine: how to infer the relevant effects from the data mass. Each
single information source, be it gene expression, protein interaction,
or Gene Ontology, contains unknown amounts and types of noise, that
is, irrelevant or uninteresting variation. The task of distinguishing
between relevant and irrelevant variation is particularly hard in the
initial exploratory task of "looking at the data," when the hypotheses
are still vague and hence there are no strong models yet to help
constrain the exploration. I will discuss how to make sense of data
masses with information visualization methods, and machine learning
methods designed to bring out relevant clusters and components by
fusing several data sources. Supervised mining or "supervised
unsupervised learning" searches for clusters or components relevant or
informative of classes such as gene ontology. Methods such as learning
metrics, discriminative clustering, and discriminative components
follow this principle. Mutual dependency mining separates task or
source-specific variation from variation shared by all sources. The
treatment-specific variation is less relevant in defining yeast stress
response, for instance. Methods include local and non-parametric
dependent components. I will pick examples from one of the main
application areas, analysis of gene expression, where we have
combined data from several treatments or different organisms, or from
different measurement techniques, to focus the analyses according to
the task.
More information at
http://www.cis.hut.fi/projects/mi.