「Informatics Innovations in Glycogenomics」

Hiroshi Mamitsuka
Pathway Engineering
Bioinformatics Center
Institute for Chemical Research
Kyoto University

  Until quite recently, informatics research on glycobiology was slow in comparison to DNA and proteins, largely due to difficulties in the biological analysis of glycan structures. This situation has been changed, overlapping with the five years of the COE project. The major informatics tools for glycobiology, which have been developed in these five years, can be summarized into the following three types: 1) Databases, 2) Tree matching tools and 3) Mining tools from glycan databases. For example, KEGG (Kyoto Encyclopedia of Genes and Genomes) now has more than 10,000 glycans in part as KEGG Glycan, and is equipped with a tool for finding a similar (or the same) glycan structure to the input query using a very sophisticated user-interface. This tool is based on a dynamic-programming procedure for globally or locally matching given two labeled unordered trees. With these advancements on databases and accompanying tools, our group has developed a variety of data mining methods for glycan structures, based on probabilistic modeling. In this short report, focusing on the third of the above three types, we briefly describe the models and learning algorithms we have developed in these five years:
  The most well-known probabilistic model for mining strings is a hidden Markov model (HMM), which has been widely used in a lot of real applications including speech recognition, natural language processing, handwritten letter recognition and bioinformatics. HMM, which is for strings, was already extended to a model for labeled trees, known as HTMM (or hidden tree Markov model) in another application. This extension is straight-forward from the left-to-right dependency in strings to the parent-to-child dependency to labeled trees. However, as is shown by that linkages for a monosaccharide are ordered by carbon numbers, HTMM is not enough for glycans or labeled ordered trees. In light of this nature of glycans, we have extended HTMM to a model, which we call PSTMM (or probabilistic sibling-dependent tree Markov model), in which not only parent-to-child dependencies but also sibling dependencies are considered. Furthermore, we have developed a new algorithm for estimating probability parameters of this model from a given set of labeled ordered trees, which is extended from a dynamic programming procedure employed by the so-called Baum-Welch algorithm for training HMMs. A noteworthy feature of our model and learning algorithm is that they achieved a higher predictive performance than that of the methods described above such as HTMM, keeping the computational complexity of a practical bound. In addition, our method significantly reduced the computational complexity of a standard algorithm of estimating parameters of a Bayesian probabilistic network, a more general probabilistic model. However, PSTMM still has a rather high time-complexity, though it is a practical bound. Thus we further developed new probabilistic models from PSTMM in two different ways. One is that we restrict the state transitions of PSTMM, like Profile HMM which aligns multiple sequences, to obtain a new model which we call Profile PSTMM, by which we can align labeled ordered trees with a practically high speed. The other approach is that we defined a new Markov model, which we call OTMM (or ordered tree Markov model), in which each state is dependent upon its elder sibling except the state for the eldest sibling which depends on its parent. As well as PSTMM, we further developed a new algorithm for estimating probability parameters of OTMM, based on a dynamic programming manner again, which can reduce the computational complexity of PSTMM significantly, and we experimentally showed that this model keeps the expressive power of PSTMM at almost the same level using synthetic as well as real datasets. We finally obtained some empirical findings by running OTMM on KEGG-Glycan, part of which was totally consistent with current knowledge in glycobiology.

References on this work in 2003-2007
1. A New Efficient Probabilistic Model for Mining Labeled Ordered Trees, Hashimoto, K., Aoki-Kinoshita, K. F., Ueda, N., Kanehisa, M. and Mamitsuka, H., Proceedings of the Twelfth ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (KDD 2006) , 177-186, Philadelphia, PA, USA, August 2006, ACM Press. Acceptance rate: 11% (50 long papers accepted out of 457)
2. ProfilePSTMM: Capturing Tree-structure Motifs in Carbohydrate Sugar Chains, Aoki, K. F., Ueda, N., Mamitsuka, H. and Kanehisa, M. Proceedings of the Fourteenth International Conference on Intelligent Systems for Molecular Biology (ISMB 2006) , (Bioinformatics, 22 (14), e25-e34), Fortaleza, Brazil, August, 2006, Oxford University Press. Acceptance rate: 16.6% (67 papers accepted out of 404)
3. Probabilistic Model for Mining Labeled Ordered Trees: Capturing Patterns in Carbohydrate Sugar Chains. Ueda, N., Aoki-Kinoshita, K. F., Yamaguchi, A., Akutsu, T. and Mamitsuka, H. IEEE Transactions on Knowledge and Data Engineering , 17 (8), 1051-1064, 2005.
4. A Score Matrix to Reveal the Hidden Links in Glycans. Aoki, K. F., Mamitsuka, H.,Akutsu, T. and Kanehisa, M. Bioinformatics , 21 (8), 1457-1463, 2005.
5. Application of a New Probabilistic Model for Recognizing Complex Patterns in Glycans. Aoki, K. F., Ueda, N., Yamaguchi, A., Kanehisa, M., Akutsu, T. and Mamitsuka, H. Proceedings of the Twelfth International Conference on Intelligent Systems for Molecular Biology (ISMB/ECCB 2004) , (Bioinformatics, 20, Supplement 1, i6-i14), Glasgow, UK, August, 2004, Oxford University Press. Acceptance rate: 13.4% (67 papers accepted out of around 500)
6. KCaM (KEGG Carbohydrate Matcher): A Software Tool for Analyzing the Structures of Carbohydrate Sugar Chains. Aoki, K. F., Yamaguchi, A., Ueda, N., Akutsu, T., Mamitsuka, H., Goto, S. and Kanehisa, M. Nucleic Acids Research , 32, W267-W272, 2004.
7. Managing and Analyzing Carbohydrate Data. Aoki, K. F., Ueda, N., Yamaguchi, A., Akutsu, T., Kanehisa, M. and Mamitsuka, H. ACM SIGMOD Record , 33 (2), 33-38, 2004.
8. A General Probabilistic Framework for Mining Labeled Ordered Trees. Ueda, N., Aoki, K. F. and Mamitsuka, H. Proceedings of the Fourth SIAM International Conference on Data Mining (SDM 2004) , 357-368, Orlando, FL, April, 2004, SIAM. Acceptance rate: 14.3% (23 full papers accepted out of 161)
9. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Aoki, K. F., Yamaguchi, A., Okuno, Y., Akutsu, T., Ueda, N., Kanehisa, M. and Mamitsuka, H. Proceedings of the Fourteenth International Conference on Genome Informatics (GIW 2003, Genome Informatics , 14), 134-143, Yokohama, Japan, December, 2003, Universal Academy Press.