Preference-based Monte Carlo Tree Search
Final Report Abstract
The main outcome of this project are algorithms that allow to employ Monte-Carlo tree search, a very successful strategy for solving sequential decision problems, to scenarios where numerical feedback is not naturally available or has to carefully tuned. Instead, we only assume the availability of pairwise comparisons between different lines of actions (in the general case), or, more specifically, the availability of an ordering of the available reward signals, but without the use of numerical distances. We developed a proof-of-concept implementation which works for the general case, but was expectedly too inefficient for practical purposes, the main problem being the quadratic number of comparisons that has to be performed. To tackle this problem, we developed an algorithm that works for the case of ordinal preference signals, an important subclass of the problem with many practical applications. For this setting, we worked out an efficient base algorithm, which we subsequently improved. As the computational complexity of this algorithm crucially depended on the number of different ordinal feedback levels, a key idea for improvement was the an on-line algorithm that allows to dynamically bin feedback levels into a set of ordinal buckets, thereby simplifying the decision problem via abstraction. We showed that our ordinal MCTS algorithm could obtain competitive results in various games of the general video game playing AI competition (GVGAI), where we have developed one of the leading systems. We also worked on employing Monte Carlo tree search in sequential recommendation problems, but developed a slightly different solution for this scenario, where iterative exploration was limited by the off-line available data points. The key idea to this approach was to first abstract available preferences into a numerical utility function, which could then be used by a conventional MCTS variant for efficiently selecting the best sequence and hence a product that maximizes this learned utility. Finally, we also transposed our framework for handling ordinal feedback values to the related problem of reinforcement learning, which has – in the context of deep learning – received increased attention recently. The main result of this line of work was the first algorithm for deep ordinal reinforcement learning, for which we could show that in many cases, it is competitive to its numerical counterparts even though it uses only the order information of the received feedback signals, and is thus much robust and less sensitive to a careful tuning of a numerical reward signal.
Publications
- Informed hybrid game tree search for general video game playing. IEEE Transactions on Games, 10(1):78–90, March 2018
Joppen, T., Moneke, M., Schröder, N., Wirth, C., and Fürnkranz, J.
(See online at https://doi.org/10.1109/TCIAIG.2017.2722235) - Monte Carlo tree search without numerical rewards. In From Multiple Criteria Decision Aid to Preference Learning (Proc. of DA2PL- 18), Poznan, Poland, 2018
Joppen, T.
- Preference-based Monte Carlo tree search. In Trollmann, F. and Turhan, A.-Y. (eds.), Proceedings of the 41st German Conference on Artificial Intelligence (KI-18), volume 11117 of Lecture Notes in Computer Science, pp. 327–340, Berlin, Germany, 2018. Springer
Joppen, T., Wirth, C., and Fürnkranz, J.
(See online at https://doi.org/10.1007/978-3-030-00111-7_28) - Ordinal bucketing for game trees using dynamic quantile approximation. In Proceedings of the IEEE Conference on Games (CoG-19), London, United Kingdom, 2019. IEEE
Joppen, T., Strübig, T., and Fürnkranz, J.
(See online at https://doi.org/10.1109/CIG.2019.8847965) - Personalized transaction kernels for recommendation using MCTS. In Benzmüller, C. and Stuckenschmidt, H. (eds.), Proceedings of the 42nd German Conference on AI (KI-19), volume 11793 of Lecture Notes in Computer Science, pp. 338–352, Kassel, Germany, 2019. Springer
Tavakol, M., Joppen, T., Brefeld, U., and Fürnkranz, J.
(See online at https://doi.org/10.1007/978-3-030-30179-8_31) - Deep ordinal reinforcement learning. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML- PKDD-19), 2019
Zap, A., Joppen, T., and Fürnkranz, J.
(See online at https://doi.org/10.1007/978-3-030-46133-1_1)