An Online Policy Gradient Algorithm for Continuous State and Action Markov Decision Processes with Bandit Feedback

Presentation	2014-11-17 An Online Policy Gradient Algorithm for Continuous State and Action Markov Decision Processes with Bandit Feedback Yao MA, Masashi SUGIYAMA,
PDF Download Page	PDF download Page Link
Abstract(in Japanese)	(See Japanese page)
Abstract(in English)	We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret - the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret O(√) for T steps under a certain concavity assumption and O(log T) under a strong concavity assumption with bandit feedback. To the best of our knowledge, this is the first work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the online policy gradient method through experiments.
Keyword(in Japanese)	(See Japanese page)
Keyword(in English)	Markov decision process / online learning / reinforcement learning
Paper #	IBISML2014-53
Date of Issue

Paper Information
Registration To	Information-Based Induction Sciences and Machine Learning (IBISML)
Language	ENG
Title (in Japanese)	(See Japanese page)
Sub Title (in Japanese)	(See Japanese page)
Title (in English)	An Online Policy Gradient Algorithm for Continuous State and Action Markov Decision Processes with Bandit Feedback
Sub Title (in English)
Keyword(1)	Markov decision process
Keyword(2)	online learning
Keyword(3)	reinforcement learning
1st Author's Name	Yao MA
1st Author's Affiliation	Department of Computer Science, Tokyo Institute of Technology()
2nd Author's Name	Masashi SUGIYAMA
2nd Author's Affiliation	Department of Complexity Science and Engineering, University of Tokyo
Date	2014-11-17
Paper #	IBISML2014-53
Volume (vol)	vol.114
Number (no)	306
Page	pp.pp.-
#Pages	8
Date of Issue