Presentation 2014-11-17
An Online Policy Gradient Algorithm for Continuous State and Action Markov Decision Processes with Bandit Feedback
Yao MA, Masashi SUGIYAMA,
PDF Download Page PDF download Page Link
Abstract(in Japanese) (See Japanese page)
Abstract(in English) We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret - the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret O(√) for T steps under a certain concavity assumption and O(log T) under a strong concavity assumption with bandit feedback. To the best of our knowledge, this is the first work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the online policy gradient method through experiments.
Keyword(in Japanese) (See Japanese page)
Keyword(in English) Markov decision process / online learning / reinforcement learning
Paper # IBISML2014-53
Date of Issue

Conference Information
Committee IBISML
Conference Date 2014/11/10(1days)
Place (in Japanese) (See Japanese page)
Place (in English)
Topics (in Japanese) (See Japanese page)
Topics (in English)
Chair
Vice Chair
Secretary
Assistant

Paper Information
Registration To Information-Based Induction Sciences and Machine Learning (IBISML)
Language ENG
Title (in Japanese) (See Japanese page)
Sub Title (in Japanese) (See Japanese page)
Title (in English) An Online Policy Gradient Algorithm for Continuous State and Action Markov Decision Processes with Bandit Feedback
Sub Title (in English)
Keyword(1) Markov decision process
Keyword(2) online learning
Keyword(3) reinforcement learning
1st Author's Name Yao MA
1st Author's Affiliation Department of Computer Science, Tokyo Institute of Technology()
2nd Author's Name Masashi SUGIYAMA
2nd Author's Affiliation Department of Complexity Science and Engineering, University of Tokyo
Date 2014-11-17
Paper # IBISML2014-53
Volume (vol) vol.114
Number (no) 306
Page pp.pp.-
#Pages 8
Date of Issue