時間差分学習のソースを表示

{{Machine learning bar}}
'''時間差分学習'''（じかんさぶんがくしゅう、{{lang-en-short|temporal difference learning}}）や'''TD学習'''とは、現在の状態価値関数の推定から[[ブートストラップ]]で学習するモデルフリーの[[強化学習]]の手法。[[モンテカルロ法]]のように環境からサンプルを取り、動的計画法のように現在の推定に基づいて状態価値関数を更新する。<ref>{{cite book |title=Reinforcement Learning: An Introduction |first1=Richard S. |last1=Sutton |first2=Andrew G. |last2=Barto |edition=2nd |publisher=MIT Press |place=Cambridge, MA |year=2018 |url=http://www.incompleteideas.net/book/the-book.html |page=133}}</ref>

状態価値関数 <math>V(s)</math> は、現在および将来に得られる報酬（reward）になるように学習させる。ただし、将来分の報酬は、[[経済学]]でも使われる[[割引率]]（discount rate）をかけた物を使用する。これを割引収益（discounted return）と呼ぶ。

考え方自体は少なくとも1959年の時点でArthur Samuelが[[チェッカー]]をプレーする人工知能のプログラムで使用しているが、temporal difference learningという呼び方は1988年にリチャード・サットンが命名している。<ref>{{cite journal
| last1 = Sutton
| first1 = Richard S.
| date = 1988-08-01
| title = Learning to predict by the methods of temporal differences
| journal = Machine Learning
| volume = 3
| issue = 1
| pages = 9–44
| doi = 10.1007/BF00115009
| url = https://doi.org/10.1007/BF00115009
}}</ref>

== アルゴリズム ==
状態 <math>S_t</math> のエージェントが行動 <math>A_t</math> を選び、報酬 <math>R_{t+1}</math> を得て、状態が <math>S_{t+1}</math> に遷移したとする。このとき状態価値関数 <math>V(S_t)</math> を次の式で更新する。
:<math>V(S_t) \leftarrow (1 - \alpha) V(S_t) + \alpha\left[R_{t+1} + \gamma V(S_{t+1})\right]</math>
ここで <math>\alpha</math> は学習率といい、<math>0 < \alpha < 1</math> とする。<math>\gamma</math> は[[割引率]]といい、<math>0 < \gamma < 1</math> な定数である。

行動 <math>A_t</math> は、状態価値関数を使用して選択する。

更新式は
:<math>V(S_t) \leftarrow V(S_t) + \alpha\left[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\right]</math>
とも書けるが、<math>R_{t+1} + \gamma V(S_{t+1}) - V(S_t)</math> をTD誤差（TD error）と呼ぶ。<ref>{{Cite book
| author    = Richard S. Sutton
| author2   = Andrew G. Barto
| year      = 2018
| title     = Reinforcement Learning, second edition: An Introduction
| publisher = Bradford Books
| isbn      = 978-0262039246
| url       = http://incompleteideas.net/book/the-book-2nd.html
}}</ref>

== 参照 ==
{{reflist}}

== 関連項目 ==
* [[強化学習]]
* [[Q学習]]
* [[SARSA法]]

{{デフォルトソート:しかんさふんかくしゆう}} 
[[Category:機械学習アルゴリズム]]
[[Category:強化学習]]