Weiterentwicklung maschineller Lernmethoden für Sequenzen mit Anwendung zur rechnergestützter Generkennung
Final Report Abstract
In the course of this project, we were able to significantly improve upon the state-of-the-art for sequence learning. In specific, we provided solutions for the following problems: We were able to improve memory- and runtime behavior for binary SVMs significantly making application to large-scale genomes possible. In case of the hidden Markov SVM, we were able to improve the training time to only a fraction of the previous state-of-the-art. Furthermore, the quality of optimization can now be measured in terms of duality gap without further computational costs. We increased the flexibility of machine learning models and, in turn, gained a better representation of the underlying problem which, ultimately, resulted in higher prediction performance. During the project, we developed various extensions and enhancements to, e.g. go beyond simple decomposable loss functions such as Hamming-loss. Also, we developed efficient algorithms for slack rescaling in structured SVMs and were able to speed-up the training of structured SVMs by devising a novel optimization method based on bundle methods. Complex machine learning models demand a certain amount of training data to achieve high detection performance. We developed new binary SVMs and hidden Markov SVMs that leverage information according to a given structure (i.e. a task taxonomy). Building upon previous successful approaches, POIM and FIRM, we developed an automatic motif reconstruction method that was able to identify human splice site motif factors much more accurate than other competitors. Furthermore, we extended the previous constraint methodology to arbitrary learning machines and feature representations. We applied our novel methods to a variety of biological relevant tasks, e.g. de novo gene finding for human and mouse genomes, human splice site detection, and motif extraction for human splice sites. Indeed, most of our work features at least one real-world genomic application. With our Oqtans online web service, we provide a very convenient and sophisticated way of quantitative transcriptome analysis for researchers around the globe. Furthermore, the SHOGUN machine learning toolbox gained much interested and could, due to the help of many volunteers, be expanded significantly. Moreover, we are committed to Open Science developed methods and source code are publicly available to the scientific communities via • https://git.ratschlab.org • https://github.com/nicococo • https://github.com/shogun-toolbox With the rise of deep learning techniques and their recent success in computational biology tasks, we would like to improve our novel derived models to automatically learning feature representations using deep neural networks as input features. Naturally, these more expressive models will pose a challenge to our explanation methods which need to be adapted accordingly. Another major challenge poses the massively increasing amount and diversity of data available for genomic tasks. One possible extension of our evidence driven mtim method, which segments gene transcript based on RNA-seq data, would be the extension towards multiple RNA-seq observations from, e.g. various tissues.
Publications
- “mGene: Accurate SVM-based gene finding with an application to nematode genomes,” Genome Research, vol. 19, no. 11, pp. 2133–2143, 2009
G. Schweikert, A. Zien, G. Zeller, J. Behr, C. Dieterich, S. O. Cheng, P. Philips, F. De Bona, L. Hartmann, A. Bohlen, N. Krüger, S. Sonnenburg, and G. Rätsch
- “Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization,” Journal of Machine Learning Research, vol. 10, M. Sebag, Ed., pp. 2157– 2192, Oct. 2009
V. Franc and S. Sonnenburg
- , “Leveraging sequence classification by taxonomy-based multitask learning,” in Research in Computational Molecular Biology (RE-COMB), vol. 6044 LNBI, 2010, pp. 522–534
C. Widmer, J. Leiva, Y. Altun, and G. Rätsch
- “The SHOGUN Machine Learning Toolbox,” Journal of Machine Learning Research, vol. 11, pp. 1799–1802, 2010
S. Sonnenburg, G. Raetsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. De Bona, A. Binder, C. Gehl, and V. Franc
- “lp-Norm Multiple Kernel Learning,” JMLR, vol. 12, 953–997, 2011
M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien
- “Multiple reference genomes and transcriptomes for Arabidopsis thaliana.,” Nature, vol. 477, no. 7365, pp. 419–423, 2011
X. Gan, O. Stegle, J. Behr, J. G. Steffen, P. Drewe, K. L. Hildebrand, R. Lyngsoe, S. J. Schultheiss, E. J. Osborne, V. T. Sreedharan, A. Kahles, R. Bohnert, G. Jean, P. Derwent, P. Kersey, E. J. Belfield, N. P. Harberd, E. Kemen, C. Toomajian, P. X. Kover, R. M. Clark, G. Rätsch, and R. Mott
- “Efficient Training of Graph-Regularized Multitask SVMs,” in ECML, 2012, pp. 1–16
C. Widmer, M. Kloft, N. Görnitz, and G. Raetsch
- “SVM2Motif — Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor,” Tech. Rep., 2012, pp. 1–23
M. M.-C. Vidovic, N. Görnitz, K.-R. Müller, G. Rätsch, and M. Kloft
- “MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples,” Bioinformatics, vol. 29, no. 20, pp. 2529–2538, 2013
J. Behr, A. Kahles, Y. Zhong, V. T. Sreedharan, P. Drewe, and G. Rätsch
(See online at https://doi.org/10.1093/bioinformatics/btt442) - “Toward Supervised Anomaly Detection,” Journal of Artificial Intelligence Research (JAIR), vol. 46, pp. 235–262, 2013
N. Görnitz, M. Kloft, K. Rieck, and U. Brefeld
- “Oqtans: the RNA-seq workbench in the cloud for complete and reproducible quantitative transcriptome analysis,” Bioinformatics, p. 731, 2014
V. T. Sreedharan, S. J. Schultheiss, G. Jean, A. Kahles, R. Bohnert, P. Drewe, P. Mudrakarta, N. Görnitz, G. Zeller, and G. Rätsch
(See online at https://doi.org/10.1093/bioinformatics/btt731) - “Regularization-based multitask learning: With applications to genome biology and biological imaging,” KI - Künstliche Intelligenz, pp. 29–33, 28 2014
C. Widmer, M. Kloft, X. Lou, and G. Rätsch
(See online at https://doi.org/10.1007/s13218-013-0283-y) - “Hidden Markov Anomaly Detection,” in Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015
N. Görnitz, M. Braun, and M. Kloft
- “SVM2Motif — Reoconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor,” PLOS ONE, pp. 1–23, 2015
M. M.-C. Vidovic, N. Görnitz, K.-R. Müller, G. Rätsch, and M. Kloft
(See online at https://doi.org/10.1371/journal.pone.0144782) - “Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies,” Scientific Reports, vol. 6, p. 36 671, 2016
B. Mieth, M. Kloft, J. A. Rodriguez, S. Sonnenburg, R. Vobruba, C. Morcillo-Suárez, X. Farré, U. M. Marigorta, E. Fehr, T. Dickhaus, G. Blanchard, D. Schunk, A. Navarro, and K.-R. Müller
(See online at https://doi.org/10.1038/srep36671) - “Feature Importance Measure for Non-linear Learning Algorithms,” in NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems, Nov. 2016 (Best Paper Award)
M. M.-C. Vidovic, N. Görnitz, K.-R. Müller, and M. Kloft
- (2017) ML2Motif—Reliable extraction of discriminative sequence motifs from learning machines. PloS one, 12 (3) e0174392
Vidovic, Marina M-C; Kloft, Marius; Müller, Klaus-Robert; Görnitz, Nico
(See online at https://doi.org/10.1371/journal.pone.0174392) - “Accurate maximum-margin training for parsing with context-free grammars,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), vol. 28, no. 1, pp. 44–56, 2017
A. Bauer, M. Braun, and K.-R. Müller
(See online at https://doi.org/10.1109/TNNLS.2015.2497149)