エピソードタスクにおける方策オフ型LSTD(λ)法とその収束性(機械学習によるバイオデータマインニング,一般)

森 健; 前田 新一; 石井 信

講演名	2007-06-15 エピソードタスクにおける方策オフ型LSTD(λ)法とその収束性(機械学習によるバイオデータマインニング,一般) 森健, 前田新一, 石井信,
PDFダウンロードページ	PDFダウンロードページへ
抄録(和)	近年提案された線形関数近似器を用いた方策オフ型TD(λ)法は,サンプル再利用や探索・搾取などの強化学習における重要な問題を効果的に解決する可能性があるため,注目されている.しかしながら,サンプル系列を用いて推定した価値関数の分散は系列の長さについて指数関数的に大きくなり,極限において発散するという問題がある.このため,系列の長さを有限にしたエピソードタスクを仮定して分散を抑える必要があるが,一方でTD(λ)法が収束する条件が満たされなくなり,やはり発散する可能性がある.つまり,線形関数近似器を用いた方策オフ型TD(λ)法では,サンプル系列の長さを有限にしても無限にしても,収束は保証されない.本報告では,最小二乗法に基づく方策オフ型LSTD(λ)法を提案し,サンプル系列の長さを有限にした場合に,収束が保証されることを示す.方策オフ型TD(λ)法が発散するエピソードタスクにおいて,提案手法が収束することをシミュレーション実験により確認した.
抄録(英)	Recently-developed off-policy temporal difference (TD) learning with linear function approximation has attracted attention because of the possibility of sample reuse and dealing effectively with exploration and exploitation. However, the variance of the value function becomes exponentially large as the length of trajectory grows and hence the learning diverges. It is then necessary to truncate the length of trajectory, but the bias of such a finite horizon trajectory can be so harmful that the value function also diverges. Therefore, both in such infinite and finite horizon problems, the off-policy TD learning has no convergence guarantee. In this study, we propose an off-policy least-squares temporal difference (LSTD) learning- and show the convergence in finite horizon problems. Computer simulation shows that our method converges in a finite horizon problem whereas the off-policy TD learning diverges.
キーワード(和)	強化学習 / 方策オフ型法 / 重点サンプリング / LSTD(λ)法 / エピソードタスク
キーワード(英)	reinforcement learning / off policy learning / importance sampling / least-squares temporal difference learning / finite horizon problem
資料番号	NC2007-14
発行日

研究会情報
研究会	NC
開催期間	2007/6/7(から1日開催)
開催地（和）
開催地（英）
テーマ（和）
テーマ（英）
委員長氏名（和）
委員長氏名（英）
副委員長氏名（和）
副委員長氏名（英）
幹事氏名（和）
幹事氏名（英）
幹事補佐氏名（和）
幹事補佐氏名（英）

講演論文情報詳細
申込み研究会	Neurocomputing (NC)
本文の言語	JPN
タイトル（和）	エピソードタスクにおける方策オフ型LSTD(λ)法とその収束性(機械学習によるバイオデータマインニング,一般)
サブタイトル（和）
タイトル（英）	Of policy least-squares temporal difference learning and its convergence guarantee in finite horizon problems
サブタイトル（和）
キーワード(1)（和/英）	強化学習 / reinforcement learning
キーワード(2)（和/英）	方策オフ型法 / off policy learning
キーワード(3)（和/英）	重点サンプリング / importance sampling
キーワード(4)（和/英）	LSTD(λ)法 / least-squares temporal difference learning
キーワード(5)（和/英）	エピソードタスク / finite horizon problem
第 1 著者氏名（和/英）	森健 / Takeshi MORI
第 1 著者所属（和/英）	奈良先端科学技術大学院大学 Nara Institute of Science and Technology
第 2 著者氏名（和/英）	前田新一 / Shin-ichi MAEDA
第 2 著者所属（和/英）	奈良先端科学技術大学院大学 Nara Institute of Science and Technology
第 3 著者氏名（和/英）	石井信 / Shin ISHII
第 3 著者所属（和/英）	奈良先端科学技術大学院大学 Nara Institute of Science and Technology
発表年月日	2007-06-15
資料番号	NC2007-14
巻番号（vol）	vol.107
号番号（no）	92
ページ範囲	pp.-
ページ数	6
発行日