An Online Policy Gradient Algorithm for Continuous State and Action Markov Decision Processes with Bandit Feedback

講演名	2014-11-17 An Online Policy Gradient Algorithm for Continuous State and Action Markov Decision Processes with Bandit Feedback ,
PDFダウンロードページ	PDFダウンロードページへ
抄録(和)
抄録(英)	We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret - the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret O(√) for T steps under a certain concavity assumption and O(log T) under a strong concavity assumption with bandit feedback. To the best of our knowledge, this is the first work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the online policy gradient method through experiments.
キーワード(和)
キーワード(英)	Markov decision process / online learning / reinforcement learning
資料番号	IBISML2014-53
発行日

講演論文情報詳細
申込み研究会	Information-Based Induction Sciences and Machine Learning (IBISML)
本文の言語	ENG
タイトル（和）
サブタイトル（和）
タイトル（英）	An Online Policy Gradient Algorithm for Continuous State and Action Markov Decision Processes with Bandit Feedback
サブタイトル（和）
キーワード(1)（和/英）	/ Markov decision process
第 1 著者氏名（和/英）	/ Yao MA
第 1 著者所属（和/英）	Department of Computer Science, Tokyo Institute of Technology
発表年月日	2014-11-17
資料番号	IBISML2014-53
巻番号（vol）	vol.114
号番号（no）	306
ページ範囲	pp.-
ページ数	8
発行日