方策オフ型Natural Actor-Critic法

森 健; 中村 泰; 石井 信

講演名	2005/7/20 方策オフ型Natural Actor-Critic法森健, 中村泰, 石井信,
PDFダウンロードページ	PDFダウンロードページへ
抄録(和)	近年提案されたNatural Actor-Critic法(NAC)は, actorの学習に自然方策勾配法, criticの学習にLSTD-Q(λ)法を用いたもので, 高次元の力学系に対する比較的効率の良いモデルフリー強化学習法として注目されている.しかしながらNACは, 方策オン型学習法であるため二つの問題がある.第一に, 現在の方策勾配の推定に過去の方策の下で生成した系列を用いることができない.第二に, 探索・搾取の制御の導入に大きな制約がある.これらの問題を解決するために, 我々は方策オフ型のLSTD-Q(λ)法を提案しNACのcriticの学習法として採用する.これを方策オフ型NACと呼ぶ.提案手法では, 過去の方策の下で生成された多数の系列を用いて現在の方策に対する方策勾配を推定することで, 方策勾配推定のバリアンスを下げることができる.また, 方策最適化と別に探索制御を行うことで, 探索・搾取を効果的に制御することができる.ヘビ型運動シミュレータを用いた計算機実験により, NACよりも少ないサンプル数で, かつ安定して学習できることを示す.
抄録(英)	Recently-developed Natural Actor-Critic (NAC), which employs natural policy gradient learning for the actor and LSTD-Q(λ) for the critic, has provided a good model-free reinforcement learning scheme applicable to high-dimensional systems. Since NAC is an on-policy learning method, however, past sample sequences cannot be reused for estimating the policy gradient under current policy. Moreover, the control of exploration and exploitation has a large constraint on introducing an exploratory factor. To overcome these problems, we propose an off-policy NAC in this study, in which the policy gradient is estimated by using past system trajectories, and the exploration can be controlled from the outside of the policy optimization. Computer experiments using a snake-like robot simulator show our new method is so effective that the number of required trajectories is much smaller than that by the on-policy method.
キーワード(和)	強化学習 / 方策オフ型手法 / 自然方策勾配法 / 最小二乗方策評価法 / actor-critic法
キーワード(英)	reinforcement learning / off-policy method / natural polilcy gradient / least squares policy evaluation / actor-critic
資料番号	NC2005-34
発行日

研究会情報
研究会	NC
開催期間	2005/7/20(から1日開催)
開催地（和）
開催地（英）
テーマ（和）
テーマ（英）
委員長氏名（和）
委員長氏名（英）
副委員長氏名（和）
副委員長氏名（英）
幹事氏名（和）
幹事氏名（英）
幹事補佐氏名（和）
幹事補佐氏名（英）

講演論文情報詳細
申込み研究会	Neurocomputing (NC)
本文の言語	ENG
タイトル（和）	方策オフ型Natural Actor-Critic法
サブタイトル（和）
タイトル（英）	Off-Policy Natural Actor-Critic
サブタイトル（和）
キーワード(1)（和/英）	強化学習 / reinforcement learning
キーワード(2)（和/英）	方策オフ型手法 / off-policy method
キーワード(3)（和/英）	自然方策勾配法 / natural polilcy gradient
キーワード(4)（和/英）	最小二乗方策評価法 / least squares policy evaluation
キーワード(5)（和/英）	actor-critic法 / actor-critic
第 1 著者氏名（和/英）	森健 / Takeshi MORI
第 1 著者所属（和/英）	奈良先端科学技術大学院大学 Nara Institute of Science and Technology
第 2 著者氏名（和/英）	中村泰 / Yutaka NAKAMURA
第 2 著者所属（和/英）	奈良先端科学技術大学院大学 Nara Institute of Science and Technology
第 3 著者氏名（和/英）	石井信 / Shin ISHII
第 3 著者所属（和/英）	奈良先端科学技術大学院大学 Nara Institute of Science and Technology
発表年月日	2005/7/20
資料番号	NC2005-34
巻番号（vol）	vol.105
号番号（no）	211
ページ範囲	pp.-
ページ数	6
発行日