モンテカルロ木探索のソースを表示

'''モンテカルロ木探索'''（モンテカルロきたんさく、{{lang-en-short|Monte Carlo tree search}}、略称'''MCTS'''）とは、[[モンテカルロ法]]を使った[[木構造 (データ構造)|木]]の[[探索]]の事。[[決定過程]]に対する、[[ヒューリスティクス]]（＝途中で不要な探索をやめ、ある程度の高確率で良い手を導ける）な探索[[アルゴリズム]]である。

モンテカルロ木検索は、主に[[囲碁]]・[[チェス]]・[[将棋]]などのゲームの次の着手の決定などに使用される。また、リアルタイムPCゲームや、[[大富豪]]、[[ポーカー]]などの相手の手の内が全て分かるわけではないゲームへも使用される。

== 歴史 ==
=== モンテカルロ法 ===
他のアプローチでは解決不可能または困難な決定問題を、ランダム性を使用する[[モンテカルロ法]]で解決する試みは、[[1940年代]]に始まった。ブルース・アブラムソンは、1987年の博士論文で、通常の静的評価関数ではなく、[[ミニマックス法]]をランダムなゲームプレイアウトに基づく期待結果モデルと組み合わせた<ref name="Abramson">{{cite book|last=Abramson|first=Bruce|title=The Expected-Outcome Model of Two-Player Games|publisher=Technical report, Department of Computer Science, Columbia University|year=1987|url=http://academiccommons.columbia.edu/download/fedora_content/download/ac:142327/CONTENT/CUCS-315-87.pdf|accessdate=23 December 2013}}</ref>。アブラムソンは、期待結果モデルは「正確・高精度で簡単に推定でき、効率的に計算でき、ドメインに依存しないことが示された」と述べた<ref name="Abramson"/>。アブラムソンは、[[三目並べ]]、[[オセロ (ボードゲーム)|リバーシ]]、[[チェス]]の機械生成の評価関数を詳細に実験した<ref name="Abramson"/>。

この方法は、1989年に、W・エルテル・シューマンとC・ズットナーによって、定理の自動証明の分野に適用され、幅優先探索、深さ優先探索、反復深化などの探索アルゴリズムにおいて、指数関数的な探索時間を改善することが発見された<ref>{{cite book|editor1=J. Retti|editor2=K. Leidlmair|title=5. Österreichische Artificial-Intelligence-Tagung. Informatik-Fachberichte 208,pp. 87-95.|chapter=Learning Heuristics for a Theorem Prover using Back Propagation. |author2= Johann Schumann| author3=Christian Suttner |author1= Wolfgang Ertel |publisher=Springer |year=1989|chapter-url=http://www.hs-weingarten.de/~ertel/veroeff_bib.html#ESS89}}</ref><ref>{{cite book|title=CADE90, 10th Int. Conf. on Automated Deduction.pp. 470-484. LNAI 449.|chapter=Automatic Acquisition of Search Guiding Heuristics. | author1=Christian Suttner |author2= Wolfgang Ertel |publisher=Springer |year=1990|chapter-url=http://www.hs-weingarten.de/~ertel/veroeff_bib.html#ES90:CADE}}</ref><ref>{{cite journal|author1=Christian Suttner |author2= Wolfgang Ertel |title=Using Back-Propagation Networks for Guiding the Search of a Theorem Prover.|journal=Journal of Neural Networks Research & Applications|volume= 2|issue=1|pages=3–16|date=1991|url=http://www.hs-weingarten.de/~ertel/veroeff_bib.html#ES90:IJNN}}</ref>。

1992年、B・ブルークマンは、[[コンピュータ囲碁]]のプログラムに初めてそれを採用した<ref name="Bruegmann">{{cite book|last=Brügmann|first=Bernd|title=Monte Carlo Go|url=http://www.ideanest.com/vegos/MonteCarloGo.pdf|publisher=Technical report, Department of Physics, Syracuse University|year=1993}}</ref>。チャンら<ref name="Chang2005">{{cite journal |last=Chang|first=Hyeong Soo |last2=Fu|first2=Michael C.|last3=Hu|first3=Jiaqiao|last4=Marcus|first4=Steven I.|title=An Adaptive Sampling Algorithm for Solving Markov Decision Processes|journal=Operations Research |volume=53|pages=126–139 |year=2005 |url=http://scholar.rhsmith.umd.edu/sites/default/files/mfu/files/cfhm05.pdf?m=1449834091|doi=10.1287/opre.1040.0145|hdl=1903/6264 }}</ref>は、マルコフ決定プロセスモデルのアダプティブマルチステージサンプリング（AMS）アルゴリズムで「適応型」サンプリングを選択して、「再帰的ロールアウトとバックトラック」のアイデアを提案した。AMSは、サンプリング/シミュレーション（モンテカルロ）ツリーの構築におけるUCBベースの探査と開発のアイデアを探求した最初の試みであり、UCT（上位信頼性ツリー）のメインシードだった<ref name="changORMStoday">{{cite journal|author1=Hyeong Soo Chang |author2=Michael Fu |author3=Jiaqiao Hu|author4=Steven I. Marcus |title=Google DeepMind's Alphago: O.R.'s unheralded role in the path-breaking achievement|journal=OR/MS Today|volume=45|issue=5|pages=24–29|year=2016|url=https://www.informs.org/ORMS-Today/Public-Articles/October-Volume-43-Number-5}}</ref>。

=== モンテカルロ木検索===
{{See also|コンピュータ囲碁}}
2006年、これらの研究に触発されて<ref>{{cite book|author=Rémi Coulom|chapter=The Monte-Carlo Revolution in Go|title=Japanese-French Frontiers of Science Symposium|year=2008|chapter-url=http://remi.coulom.free.fr/JFFoS/JFFoS.pdf|author-link=Rémi Coulom}}</ref>、レミ・クーロンは、ゲームツリー検索へモンテカルロ法を適用させ、「Monte Carlo tree search（モンテカルロ木検索）」と命名した<ref>{{cite book|author=Rémi Coulom|chapter=Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search|pages=72–83|others=H. Jaap van den Herik, Paolo Ciancarini, H. H. L. M. Donkers (eds.)|title=Computers and Games, 5th International Conference, CG 2006, Turin, Italy, May 29–31, 2006. Revised Papers |publisher=Springer|year=2007|isbn=978-3-540-75537-1|doi=|citeseerx=10.1.1.81.6817|author-link=Rémi Coulom}}</ref>。L・コチシュとCs・セーペスヴァーリはUCT（ツリーに適用される上限信頼限界）アルゴリズム<ref name="Kocsis-Szepesvari" />を開発した。S・ゲーリーらは、彼らのプログラムMoGoにUCTを実装した<ref name="Gelly-et-al">{{cite book|author1=Sylvain Gelly |author2=Yizao Wang |author3=Rémi Munos |author4=Olivier Teytaud |title=Modification of UCT with Patterns in Monte-Carlo Go|date=November 2006|publisher=Technical report, INRIA|url=http://hal.inria.fr/docs/00/11/72/66/PDF/MoGoReport.pdf}}</ref>。2008年にMoGoは9路盤の囲碁でアマチュア有段者の域に到達し、Fuegoは9路盤でアマチュアの強豪プレーヤーに勝ち始めた<ref>{{cite book|url=http://pug.raph.free.fr/files/Fuego.pdf|title=Fuego – An Open-Source Framework for Board Games and Go Engine Based on Monte Carlo Tree Search|publisher=Technical report, University of Alberta|year=2008|isbn=|location=|pages=|quote=|via=|author1=Markus Enzenberger|author2=Martin Mūller}}</ref>。

2012年1月、モンテカルロ木探索を導入した[[DeepZenGo|Zen]]は、19路盤のアマチュア二段のプレーヤーとの番勝負に3対1で勝利した<ref>{{cite web|url=http://dcook.org/gobet/|title=The Shodan Go Bet|accessdate=2012-05-02}}</ref> 。また、クーロンが開発に携わっている[[Crazy Stone]]も2014年の第2回[[電聖戦]]で[[依田紀基]]九段に19路盤の4子局の[[置き碁]]（ハンデ戦）で勝利するなど、着実に進歩していった<ref>{{Cite web|和書|title=第２回電聖戦|url=http://entcog.c.ooco.jp/entcog/densei/denseisen-2nd.html|accessdate=2021-10-08|publisher=日本棋院}}</ref>。それでもなおトップ棋士に勝てるようになるには10年以上かかると考えられていたが<ref>{{Cite web|和書|title=なぜ「囲碁」だったのか。なぜ「10年かかる」と言われていたのか──AlphaGo前日譚|url=https://wired.jp/2016/03/15/the-mystery-of-go/|accessdate=2021-10-08|publisher=WIRED|date=2016-3-15}}</ref>、2016年1月、[[DeepMind|Google DeepMind社]]は開発を進めていた[[AlphaGo]]について公表した。AlphaGoは2015年10月に[[樊麾]]二段に19路盤でハンディキャップなしに勝利しており、初めてプロ棋士に[[互先]]で勝利したコンピュータ碁プログラムになっていた<ref name="alphago">{{Cite journal|title = Mastering the game of Go with deep neural networks and tree search|journal = [[Nature (journal)|Nature]]| issn= 0028-0836|pages = 484–489|volume = 529|issue = 7587|doi = 10.1038/nature16961|pmid = 26819042|first1 = David|last1 = Silver|author-link1=David Silver (programmer)|first2 = Aja|last2 = Huang|author-link2=Aja Huang|first3 = Chris J.|last3 = Maddison|first4 = Arthur|last4 = Guez|first5 = Laurent|last5 = Sifre|first6 = George van den|last6 = Driessche|first7 = Julian|last7 = Schrittwieser|first8 = Ioannis|last8 = Antonoglou|first9 = Veda|last9 = Panneershelvam|first10= Marc|last10= Lanctot|first11= Sander|last11= Dieleman|first12=Dominik|last12= Grewe|first13= John|last13= Nham|first14= Nal|last14= Kalchbrenner|first15= Ilya|last15= Sutskever|author-link15=Ilya Sutskever|first16= Timothy|last16= Lillicrap|first17= Madeleine|last17= Leach|first18= Koray|last18= Kavukcuoglu|first19= Thore|last19= Graepel|first20= Demis |last20=Hassabis|author-link20=Demis Hassabis|date= 28 January 2016|bibcode = 2016Natur.529..484S}}{{closed access}}</ref><ref>{{Cite web|url=http://googleresearch.blogspot.com/2016/01/alphago-mastering-ancient-game-of-go.html|title=Research Blog: AlphaGo: Mastering the ancient game of Go with Machine Learning|last=|first=|date=27 January 2016|website=Google Research Blog|access-date=2020-04-30}}</ref><ref>{{Cite web|url=https://www.bbc.com/news/technology-35420579|title=Google achieves AI 'breakthrough' by beating Go champion|last=|first=|date=27 January 2016|website=BBC News|access-date=2020-04-30}}</ref>。2016年3月、AlphaGoは国際棋戦優勝多数の[[李世ドル|李世乭]]九段を相手に5番勝負を行い、4勝1敗で勝利した（詳細は[[AlphaGo対李世ドル|AlphaGo対李世乭]]を参照）<ref>{{Cite web|url=https://www.youtube.com/watch?v=vFr3K2DORc8&t=1h57m|title=Match 1 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo|last=|first=|date=9 March 2016|website=Youtube|access-date=2020-04-30}}</ref>。AlphaGoは、以前の囲碁プログラムを大幅に改善しただけでなく、[[機械学習]]は、ポリシー（移動選択）と値に人工[[ニューラルネットワーク]]（[[ディープラーニング]]手法）を使用したモンテカルロ木検索を使用したため、以前のプログラムをはるかに上回る効率が得られた<ref>{{Cite web|url=http://www.zdnet.com/article/google-alphago-ai-clean-sweeps-european-go-champion/|title=Google AlphaGo AI clean sweeps European Go champion|last=|first=|date=28 January 2016|website=ZDNet|access-date=2020-04-30}}</ref>。

MCTSアルゴリズムは、他のボードゲーム（例えば、[[ヘックス (ボードゲーム)|ヘックス]]<ref>{{cite journal|author1=Broderick Arneson |author2=Ryan Hayward |author3=Philip Henderson |title=MoHex Wins Hex Tournament|journal=[[ICGA Journal]]|volume=32|issue=2|pages=114–116|date=June 2009|url=http://webdocs.cs.ualberta.ca/~hayward/papers/rptPamplona.pdf|doi=10.3233/ICG-2009-32218 }}</ref> 、{{仮リンク|ハバナ (ボードゲーム)|label=ハバナ|en|Havannah}}<ref>{{cite book|author=Timo Ewalds|title=Playing and Solving Havannah|publisher=Master's thesis, University of Alberta|year=2011|url=http://havannah.ewalds.ca/static/thesis.pdf}}</ref>、{{仮リンク|アマゾンズ (ボードゲーム)|label=アマゾンズ|en|Game of the Amazons}}<ref>{{cite book|author=Richard J. Lorentz|chapter=Amazons Discover Monte-Carlo|pages=13–24|others=H. Jaap van den Herik, Xinhe Xu, Zongmin Ma, Mark H. M. Winands (eds.)|title=Computers and Games, 6th International Conference, CG 2008, Beijing, China, September 29 – October 1, 2008. Proceedings|publisher=Springer|year=2008|isbn=978-3-540-87607-6}}</ref>、[[アリマア]]<ref>{{cite book|author=Tomáš Kozelek|title=Methods of MCTS and the game Arimaa|publisher=Master's thesis, Charles University in Prague|year=2009|url=http://arimaa.com/arimaa/papers/TomasKozelekThesis/mt.pdf}}</ref>）、リアルタイムビデオゲーム（例えば[[パックマン]]<ref>{{cite journal|author1=Xiaocong Gan |author2=Yun Bao |author3=Zhangang Han |title=Real-Time Search Method in Nondeterministic Game – Ms. Pac-Man|pages=209–222|journal=[[ICGA Journal]]|volume=34|issue=4|date=December 2011|doi=10.3233/ICG-2011-34404 }}</ref><ref>{{cite journal|author1=Tom Pepels |author2=Mark H. M. Winands |author3=Marc Lanctot |title=Real-Time Monte Carlo Tree Search in Ms Pac-Man|pages=245–257|journal=IEEE Transactions on Computational Intelligence and AI in Games|volume=6|issue=3|date=September 2014 |doi=10.1109/tciaig.2013.2291577|doi-access=free}}</ref>、{{仮リンク|Fable Legends|en|Fable Legends}}<ref>{{cite web|url=https://archives.nucl.ai/recording/tactical-planning-and-real-time-mcts-in-fable-legends/|title= Tactical Planning and Real-time MCTS in Fable Legends|last= Mountain|first=Gwaredd  |date= 2015|access-date= 2019-06-08|quote=..&nbsp;we implemented a simulation based approach, which involved modelling the game play and using MCTS to search the potential plan space. Overall this worked well,&nbsp;...}}</ref>）、不完全情報ゲーム（[[スカート (トランプゲーム)|スカート]]<ref>{{cite book|author1=Michael Buro |author2=Jeffrey Richard Long |author3=Timothy Furtak |author4=Nathan R. Sturtevant |chapter=Improving State Evaluation, Inference, and Search in Trick-Based Card Games|pages=1407–1413 |others=Craig Boutilier (ed.)|title=IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial Intelligence, Pasadena, California, USA, July 11–17, 2009 |year=2009 |doi=|citeseerx=10.1.1.150.3077 }}</ref>、 [[ポーカー]]<ref name="cpr">{{cite journal|author1=Jonathan Rubin |author2=Ian Watson |title=Computer poker: A review|journal=Artificial Intelligence |volume=175|issue=5–6|date=April 2011|doi=10.1016/j.artint.2010.12.005|url=https://www.cs.auckland.ac.nz/~jrub001/files/CPReviewPreprintAIJ.pdf |archiveurl=https://web.archive.org/web/20120813081731/https://www.cs.auckland.ac.nz/~jrub001/files/CPReviewPreprintAIJ.pdf |url-status=dead |archive-date=2012-08-13 |pages=958–987}}</ref>、[[マジック・ザ・ギャザリング]]<ref>{{cite book|author1=C.D. Ward |author2=P.I. Cowling |chapter=Monte Carlo Search Applied to Card Selection in Magic: The Gathering|title=CIG'09 Proceedings of the 5th international conference on Computational Intelligence and Games|publisher=IEEE Press |year=2009 |chapter-url=http://scim.brad.ac.uk/staff/pdf/picowlin/CIG2009.pdf |archiveurl=https://web.archive.org/web/20160528074031/http://scim.brad.ac.uk/staff/pdf/picowlin/CIG2009.pdf |archivedate=2016-05-28}}</ref>、[[カタンの開拓者たち]]<ref>{{cite book|author1=István Szita |author2=Guillaume Chaslot |author3=Pieter Spronck |chapter=Monte-Carlo Tree Search in Settlers of Catan |pages=21–32 |editor1=Jaap Van Den Herik |editor2=Pieter Spronck |title=Advances in Computer Games, 12th International Conference, ACG 2009, Pamplona, Spain, May 11–13, 2009. Revised Papers |publisher=Springer |year=2010 |isbn=978-3-642-12992-6 |chapter-url=http://ticc.uvt.nl/icga/acg12/proceedings/Contribution100.pdf}}</ref>）などに応用された。

AlphaGo登場以降はディープラーニングを利用したプログラムが主流となったが、モンテカルロ木探索を利用したプログラムの開発も一部で行われている<ref>{{Cite web |title=Ray - Computer Go Program |url=https://computer-go-ray.com/ |website=computer-go-ray.com |access-date=2025-01-25}}</ref>。

== アルゴリズム ==
モンテカルロ木探索は、最も良い手を選択するために使われ、ランダムサンプリングの結果に基づいて探索木を構築する。ゲームでのモンテカルロ木検索は、最後までプレイしたシミュレーション結果に基づいて構築する。ゲームの勝敗の結果に基づいてノードの値を更新して、最終的に勝率が高いことが見込まれる手を選択する。

最も単純な方法は、それぞれの有効な選択肢に、同数ずつプレイアウトの回数を一様に割り振って、最も勝率が良かった手を選択する方法である<ref name="Bruegmann">{{cite book|last=Brügmann|first=Bernd|title=Monte Carlo Go|url=http://www.ideanest.com/vegos/MonteCarloGo.pdf|publisher=Technical report, Department of Physics, Syracuse University|year=1993}}</ref>。これは単純なモンテカルロ木探索（pure Monte Carlo tree search）と呼ばれる。過去のプレイアウト結果に基づき、よりプレイヤーの勝利に結びつく手にプレイアウトの回数をより多く割り振ると探索効率が向上する。

モンテカルロ木探索は4つのステップからなる<ref name="chaslot2008">{{cite journal|author1=G.M.J.B. Chaslot |author2=M.H.M. Winands |author3=J.W.H.M. Uiterwijk |author4=H.J. van den Herik |author5=B. Bouzy |title=Progressive Strategies for Monte-Carlo Tree Search|journal=New Mathematics and Natural Computation|volume=4|issue=3|pages=343–359|year=2008|url=https://dke.maastrichtuniversity.nl/m.winands/documents/pMCTS.pdf|doi=10.1142/s1793005708001094}}</ref>。
* 選択：根Rから始めて、葉ノードLにたどり着くまで、子ノードを選択し続ける。根が現在のゲームの状態で、葉ノードはシミュレーションが行われていないノード。より有望な方向に木が展開していくように、子ノードの選択を片寄らせる方法は、モンテカルロ木探索で重要なことであるが、探索と知識利用の所で後述する。
* 展開：Lが勝負を決するノードでない限り、Lから有効手の子ノードの中からCを1つ選ぶ。
* シミュレーション：Cから完全なランダムプレイアウトを行う。これはロールアウトとも呼ばれる。単純な方法としては、一様分布から手を選択してランダムプレイアウトを実行する。
* バックプロパゲーション：CからRへのパスに沿って、プレイアウトの結果を伝搬する。

[[File:MCTS (English) - Updated 2017-11-19.svg|center|frame|モンテカルロ木探索の4つのステップ]]

上記のグラフは各ステップの選択を表している。ノードの数字は、そのノードからのプレイアウトの"勝った回数/プレイアウトの回数"を表している<ref>{{Cite web|url=http://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/|title=Introduction to Monte Carlo Tree Search|last=Bradberry|first=Jeff|date=2015-09-07|website=|access-date=2019-04-12}}</ref>。Selectionのグラフでは、今、黒の手番である。根ノードの数字は白が21回中11回勝利していることを表している。裏を返すと黒が21回中10回勝利していることを表していて、根ノードの下の3つのノードは手が3種類あることを表していて、数字を合計すると10/21になる。

シミュレーションで白が負けたとする。白の0/1ノードを追加して、そこから根ノードまでのパスの全てのノードの分母（プレイアウトの回数）に1加算して、分子（勝った回数）は黒ノードだけ加算する。引き分けの際は、0.5加算する。こうすることで、プレイヤーは最も有望な手を自分の手番で選択することが出来る。

計算の制限時間に到達するまで、これを反復し、最も勝率が高い手を選択する。

== 探索と知識利用 ==
子ノードを選択する際の難しい点は、何回かのプレイアウトの結果により高い勝率であるという知識利用（{{lang-en-short|exploitation}}）とプレイアウトの回数が不足してるので探索（{{lang-en-short|exploration}}）することのバランスを取ることである。手法は無数にあり Cameron B. Browne らが2012年2月までに発表された物を論文にまとめている<ref>{{cite journal
|author=C. B. Browne
|author2=E. Powley
|author3=D. Whitehouse
|author4=S. M. Lucas
|author5=P. I. Cowling
|author6=P. Rohlfshagen
|author7=S. Tavener
|author8=D. Perez
|author9=S. Samothrakis
|author10=S. Colton
|journal=IEEE Transactions on Computational Intelligence and AI in Games
|title=A Survey of Monte Carlo Tree Search Methods
|year=2012
|volume=4
|number=
|pages=1-43 
|doi=10.1109/TCIAIG.2012.2186810
|issn=1943-068X
|month=February
|url=http://mcts.ai/pubs/mcts-survey-master.pdf
}}</ref>。

=== UCT (Upper Confidence Tree) ===
探索と知識利用のバランスを取る1つの方法は、Levente Kocsis と Csaba Szepesvári が[[2006年]]に発表した UCT（木に適用したUpper Confidence Bound 1）である<ref name="Kocsis-Szepesvari">{{cite conference |last=Kocsis|first=Levente|last2=Szepesvári|first2=Csaba|title=Bandit based Monte-Carlo Planning|editor-first=Johannes|editor-last=Fürnkranz|editor2-first=Tobias|editor2-last=Scheffer|editor3-first=Myra|editor3-last=Spiliopoulou |book-title=Machine Learning: ECML 2006, 17th European Conference on Machine Learning, Berlin, Germany, September 18–22, 2006, Proceedings|series=Lecture Notes in Computer Science |volume=4212|publisher=Springer|isbn=3-540-45375-X |pages=282–293|year=2006|doi=10.1007/11871842_29|citeseerx=10.1.1.102.1296|url=http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.102.1296}}</ref>。UCT は Auer, Cesa-Bianchi, Fischer が[[2002年]]に発表した UCB1 (Upper Confidence Bound 1)<ref>{{cite journal |last=Auer |first=Peter|last2=Cesa-Bianchi|first2=Nicolò|last3=Fischer|first3=Paul|title=Finite-time Analysis of the Multiarmed Bandit Problem|journal=Machine Learning|volume=47|issue=2/3|pages=235–256 |year=2002 |url=https://link.springer.com/content/pdf/10.1023/A:1013689704352.pdf}}</ref> に基づく方法である。Kocsis と Szepesvári は <math>\frac{w}{n} + c \sqrt{\frac{\ln N}{n}}</math> が最大となる子ノードを選択することを提案している。各変数は以下の通り。
* w は勝った回数
* n はこのノードのシミュレーションの回数
* N は全シミュレーション回数
* c は定数。理論上は <math>\sqrt{2}</math> であるが、実際は探索が上手く行くように調整する。

第1項の勝率は知識利用である。第2項は探索を表現していて、シミュレーション回数が少ないのを選択するようにする。

=== PUCT (Polynomial Upper Confidence Tree) ===
PUCT は David Auger, Adrien Couetoux, Olivier Teytaud が[[2013年]]に発表した手法<ref>{{cite journal
|last= Auger
|first= David 
|last2= Couetoux
|first2= Adrien 
|last3 = Teytaud 
|first3 = Olivier 
|date= 2013
|title= Continuous Upper Confidence Trees with Polynomial Exploration - Consistency
|url= 
|journal= Machine Learning and Knowledge Discovery in Databases
|publisher= Springer Berlin Heidelberg
|volume= 8188
|issue= 
|pages= 194-209
|doi= 10.1007/978-3-642-40988-2_13
|accessdate=2019-04-13
}}</ref>。

木は根は決断ノードとし、決断ノードとランダムノードを交互に繰り返す形で構築する。決断ノードで行為 a を選択し、ランダムノードに遷移する。

*決断ノード z を選択した場合、
** <math>\lfloor n(z)^\alpha \rfloor > \lfloor (n(z) - 1)^\alpha \rfloor</math> ならば、そのノードでシミュレーションを行う
** さもなければ <math>\hat{V}(z,a) + \sqrt{\frac{n(z)^{e(d)}}{n(z,a)}}</math> が最大となる子ノードを選択する
*ランダムノード w を選択した場合、
** <math>\lfloor n(w)^\alpha \rfloor = \lfloor (n(w) - 1)^\alpha \rfloor</math> ならば、最も訪れていない子ノードを選択する
** さもなければ、新しい子ノードを作成する

関数は以下の通り。
* <math>\hat{V}(z,a)</math> - 決断ノード z で行為 a を選択した際のランダムノードでの平均報酬（勝率など）
* <math>n(z)</math> - 決断ノード z の訪問回数
* <math>n(z, a)</math> - 決断ノード z で行為 a を選択した際のランダムノードの訪問回数
* <math>\alpha(d)</math> - 深さ d に対して定めた progressive  widening 係数（定数）
* <math>e(d)</math> - 深さ d に対して定めた探索係数（定数）

=== AlphaZero ===
David Silver らが [[AlphaZero]] にて[[2017年]]に採用した方法は PUCT を更に改変していて、以下の評価値で子ノードを選択する<ref>[https://deepmind.com/blog/alphazero-shedding-new-light-grand-games-chess-shogi-and-go/ AlphaZero: Shedding new light on the grand games of chess, shogi and Go | DeepMind]</ref>。
: <math>Q(s,a) + C(s)P(a \mid s)\frac{\sqrt{N(s)}}{1 + N(s, a)}</math>
: <math>C(s) = \log \frac{1 + N(s) + c_{\mbox{base}}}{c_{\mbox{base}}} + c_{\mbox{init}}</math>

関数は以下の通り。
* <math>Q(s,a)</math> - 状態 s で行為 a を行った際の平均報酬
* <math>P(a \mid s)</math> - 状態 s で行為 a を選択する確率。ニューラルネットワークの出力
* <math>N(s)</math> と <math>N(s, a)</math> - 訪問回数

== 参照 ==
{{reflist}}

{{DEFAULTSORT:もんてかるろきたんさく}}
[[Category:モンテカルロ法]]
[[Category:ヒューリスティック・アルゴリズム]]
[[Category:数学に関する記事]]