多腕バンディット問題のソースを表示

[[画像:Las Vegas slot machines.jpg|thumb|ラスベガスのスロットマシン]]
'''多腕バンディット問題'''（たわんばんでぃっともんだい、Multi-armed bandit problem）は、[[確率論]]と[[機械学習]]において、一定の限られた資源のセットを競合する選択肢間で、期待利得を最大化するように配分しなければならない問題。それぞれの選択肢の特性が、配分時には一部しか分かっておらず、時間が経過したり選択肢に資源が配分されることで理解できる可能性がある<ref name="Gittins89">{{Citation |title=Multi-armed bandit allocation indices |author=[[John C. Gittins]] |year=1989 |publisher=John Wiley & Sons, Ltd. |series=Wiley-Interscience Series in Systems and Optimization. |place=Chichester |isbn=978-0-471-92059-5}}</ref><ref name="BF">{{Citation |title=Bandit problems: Sequential allocation of experiments |author=[[Don Berry (statistician)|Don Berry]] |last2=Fristedt |first2=Bert |year=1985 |publisher=Chapman & Hall |series=Monographs on Statistics and Applied Probability |place=London |isbn=978-0-412-24810-8}}</ref>。これは、探索(exploration)と活用(exploitation)のトレードオフのジレンマを例証する古典的な[[強化学習]]の問題である。この名前は、[[スロットマシン]]（単腕バンディットとも呼ばれる）の列で、どのマシンをプレイするか、各マシンを何回プレイするか、どの順番でプレイするか、現在のマシンを続けるか別のマシンを試すかを決めなければならない[[賭博|ギャンブラー]]を想像することに由来している<ref name="weber">{{Citation |title=On the Gittins index for multiarmed bandits |last=Weber |first=Richard |year=1992 |journal=[[Annals of Applied Probability]] |volume=2 |number=4 |pages=1024-1033 |doi=10.1214/aoap/1177005588 |jstor=2959678}}</ref>。多腕バンディット問題も、広義の確率的スケジューリングに分類される。

== 経験的動機 ==
[[画像:The Jet Propulsion Laboratory (9416811752).jpg|thumb|結果を最大化するために、これらの研究部門間で特定の予算をどのように配分すべきか?]]
多腕バンディット問題は、新しい知識の取得（'''探索 exploration'''）と既存の知識に基づいた意思決定の最適化（'''活用 exploitation'''）を同時に試みるエージェントをモデル化したものである。エージェントは、これらの競合するタスクのバランスをとりながら、考慮される期間中の総価値を最大化しようとする。以下のような例がある。
* 患者の損失を最小限に抑えながら、さまざまな実験的治療の効果を調査する[[治験|臨床試験]]<ref name="Gittins89" /> <ref name="WHP">{{Citation |title=Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research |last=Press |first=William H. |year=2009 |postscript=. |journal=Proceedings of the National Academy of Sciences |volume=106 |number=52 |pages=22387-22392 |doi=10.1073/pnas.0912378106 |pmc=2793317 |bibcode=2009PNAS..10622387P |pmid=20018711}}</ref>
* ネットワークの遅延を最小化するための適応的なルーティングの取り組み
* [[ポートフォリオ (金融)|金融ポートフォリオ]]の設計<ref name="BrochuHoffmandeFreitas">{{Citation |title=Portfolio Allocation for Bayesian Optimization |last=Brochu |first=Eric |last2=Hoffman |first2=Matthew W. |last3=de Freitas |first3=Nando |date=2010-09 |arxiv=1009.5419 |bibcode=2010arXiv1009.5419B}}</ref><ref>{{Citation |title=Portfolio Choices with Orthogonal Bandit Learning |last=Shen |first=Weiwei|last2=Wang |first2=Jun |last3=Jiang |first3=Yu-Gang |last4=Zha |first4=Hongyuan |year=2015 |url=http://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/viewPDFInterstitial/10972/10798 |journal=Proceedings of International Joint Conferences on Artificial Intelligence (IJCAI2015)}}</ref>
このような実用例では、すでに獲得した知識に基づく報酬の最大化と、さらに知識を増やすための新しい行動の思考とのバランスが問題となる。これは、[[機械学習]]における探索 exploration と活用 exploitation のトレードオフとして知られる。

このモデルは、さまざまなプロジェクトへのリソースの動的な配分を制御するために使用されており、それぞれの可能性の難易度と報酬に関する不確実性がある場合、どのプロジェクトに取り組むかという問題に答えている<ref name="farias2011irrevocable">{{Citation |title=The irrevocable multiarmed bandit problem |last=Farias |first=Vivek F |last2=Ritesh |first2=Madan |year=2011 |journal=[[Operations Research (journal)|Operations Research]] |volume=59 |number=2 |pages=383-399 |doi=10.1287/opre.1100.0891}}
</ref>。

[[第二次世界大戦]]で連合国の科学者によって検討されたが、それはあまりに難解なため、ピーター・ホイットルによれば、ドイツの科学者も時間を浪費できるようにと、この問題を[[ドイツ]]に投下することが提案されたのだという<ref name="Whittle79">{{Citation |title=Discussion of Dr Gittins' paper |author=[[Peter Whittle (mathematician)|Peter Whittle]] |year=1979|journal=[[Journal of the Royal Statistical Society]] |series=Series B |volume=41 |number=2| pages=148-177 |doi=10.1111/j.2517-6161.1979.tb01069.x}}</ref>。

現在一般的に分析されているのは、1952年にハーバート・ロビンスによって定式されたバージョンである。

== 多腕バンディットモデル ==
多腕バンディット（略称：バンディットまたは MAB）は、[[確率分布]] <math>B = \{R_1, \dots ,R_K\}</math> の集合と見做すことができる。各確率分布は、<math>K \in \mathbb{N}^+</math> 個のレバーのそれぞれによって配分される報酬に関連する。<math>\mu_1, \dots, \mu_K</math> を報酬分布の平均値とする。ギャンブラーは各ラウンドに1つのレバーを操作し、報酬を観察する。収集された報酬の合計を最大化することが目的である。地平線 <math>H</math> は残りのラウンド数である。バンディット問題は、形式的には1状態の[[マルコフ決定過程]]と同等である。<math>T</math> ラウンド後の後悔 <math>\rho</math> は、最適な戦略による報酬の合計と収集された報酬の合計との間の差の期待値として定義される。
:<math>\rho = T \mu^* - \sum_{t=1}^T \widehat{r}_t</math>
ここで、最大報酬平均 <math>\mu^*</math> は <math>\mu^* = \max_k \{ \mu_k \}</math> を満たす。<math>\widehat{r}_t</math> はラウンド ''t'' の報酬である。

ゼロ後悔戦略とは、ラウンドごとの平均後悔が <math>\rho / T</math> が確率 1 でゼロになる戦略である<ref name="Vermorel2005">{{Citation |title=Multi-armed bandit algorithms and empirical evaluation |last=Vermorel |first=Joannes |last2=Mohri |first2=Mehryar |year=2005 |url=https://bandit.sourceforge.net/Vermorel2005poker.pdf |publisher=Springer |series=In European Conference on Machine Learning |pages=437-448}}
</ref>。直感的には、十分なラウンドがプレイされれば、後悔ゼロの戦略は最適な戦略に収束することが保証される。

== 関連項目 ==
* [[Gittinsインデックス]] - 盗賊の問題を分析するための強力で一般的な戦略。
* [[貪欲法]]
* [[最適停止問題]]
* [[サーチ理論]]
* [[確率的スケジューリング]]

== 脚注 ==
{{Reflist}}

== 参考文献 ==
* {{Citation |title=Approximation algorithms for restless bandit problems |last1=Guha |first1=S. |last2=Munagala |first2=K.|last3=Shi |first3=P. |year=2010 |journal=Journal of the ACM |volume=58 |pages=1-50 |arxiv=0711.3861 |doi=10.1145/1870103.1870106 |s2cid=1654066}}
* {{Citation |title=Index policies for discounted bandit problems with availability constraints |last1=Dayanik |first1=S. |last2=Powell |first2=W. |last3=Yamazaki |first3=K. |year=2008 |journal=Advances in Applied Probability |volume=40 |issue=2 |pages=377-400 |doi=10.1239/aap/1214950209 |doi-access=free}}.
* {{Citation |title=Approximate Dynamic Programming: Solving the Curses of Dimensionality |last=Powell |first=Warren B. |year=2007 |publisher=John Wiley and Sons |contribution=Chapter 10 |location=New York |isbn=978-0-470-17155-4}}.
* {{Citation |title=Some aspects of the sequential design of experiments |author=[[Herbert Robbins]] |year=1952 |journal=[[Bulletin of the American Mathematical Society]] |volume=58 |issue=5 |pages=527-535 |doi=10.1090/S0002-9904-1952-09620-8 |doi-access=free}}.
* {{Citation |title=Reinforcement Learning |last1=Sutton |first1=Richard |last2=Barto |first2=Andrew |year=1998 |url=http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html |archive-url=https://web.archive.org/web/20131211192714/http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html |archive-date=2013-12-11 |publisher=MIT Press |isbn=978-0-262-19398-6 |url-status=dead}}

== 外部リンク ==
* [https://github.com/fmr-llc/mabwiser MABWiser], [[Open-Source|open source]] Python implementation of bandit strategies that supports context-free, parametric and non-parametric contextual policies with built-in parallelization and simulation capability.
* [https://mloss.org/software/view/415/ PyMaBandits], [[Open-Source|open source]] implementation of bandit strategies in Python and Matlab.
* [https://github.com/Nth-iteration-labs/contextual Contextual], [[open source]] [[R言語|R]] package facilitating the simulation and evaluation of both context-free and contextual Multi-Armed Bandit policies.
* [https://bandit.sourceforge.net/ bandit.sourceforge.net Bandit project], open source implementation of bandit strategies.
* [https://github.com/jkomiyama/banditlib Banditlib], [[Open-Source]] implementation of bandit strategies in C++.
* [https://archive.today/20121212095047/http://www.cs.washington.edu/research/jair/volume4/kaelbling96a-html/node6.html Leslie Pack Kaelbling and Michael L. Littman (1996). Exploitation versus Exploration: The Single-State Case].
* Tutorial: Introduction to Bandits: Algorithms and Theory. [http://techtalks.tv/talks/54451/ Part1]. [http://techtalks.tv/talks/54455/ Part2].
* [https://feynmanlectures.caltech.edu/info/exercises/Feynmans_restaurant_problem.html Feynman's restaurant problem], a classic example (with known answer) of the exploitation vs. exploration tradeoff.
* [http://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs_ab.html Bandit algorithms vs. A-B testing].
* [http://homes.di.unimi.it/~cesabian/Pubblicazioni/banditSurvey.pdf S. Bubeck and N. Cesa-Bianchi A Survey on Bandits].
* [[arxiv:1508.03326|A Survey on Contextual Multi-armed Bandits]], a survey/tutorial for Contextual Bandits.
* [https://mpatacchiola.github.io/blog/2017/08/14/dissecting-reinforcement-learning-6.html Blog post on multi-armed bandit strategies, with Python code].
* [https://pavlov.tech/2019/03/02/animated-multi-armed-bandit-policies/ Animated, interactive plots] illustrating Epsilon-greedy, [[Thompson sampling]], and Upper Confidence Bound exploration/exploitation balancing strategies.

{{DEFAULTSORT:たわんはんていつともんたい}}
[[Category:機械学習]]
[[Category:確率的最適化]]
[[Category:数学に関する記事]]