ランダムフォレストのソースを表示

{{Expand English|Random forest|date=2024年5月}}
{{Machine learning bar}}
'''ランダムフォレスト'''（{{lang-en-short|random forest, randomized trees}}）は、2001年に{{仮リンク|レオ・ブレイマン|en|Leo Breiman}}によって提案された<ref>{{cite journal
 |first=Leo |last=Breiman
 |title=Random Forests 
 |journal=Machine Learning
 |year=2001 |volume=45 |issue=1
 |pages=5&ndash;32
 |doi=10.1023/A:1010933404324
}}</ref>[[機械学習]]の[[アルゴリズム]]であり、[[分類 (統計学)|分類]]、[[回帰分析|回帰]]、[[クラスタリング]]に用いられる。[[決定木]]を弱学習器とする[[アンサンブル学習]]アルゴリズムであり、この名称は、ランダムサンプリングされたトレーニングデータによって学習した多数の決定木を使用することによる。ランダムフォレストをさらに多層にしたアルゴリズムに[[ディープ・フォレスト (機械学習)|ディープ・フォレスト]]がある。対象によっては、同じくアンサンブル学習を用いる[[ブースティング]]よりも有効とされる。

== アルゴリズム ==
=== 学習 ===
# 学習を行いたい観測データから、[[ブートストラップ法]]によるランダムサンプリングにより ''B'' 組のサブサンプルを生成する
# 各サブサンプルをトレーニングデータとし、''B'' 本の決定木を作成する
# 指定したノード数 <math>n_\mathrm{min}</math> に達するまで、以下の方法でノードを作成する
## トレーニングデータの説明変数のうち、''m'' 個をランダムに選択する
## 選ばれた説明変数のうち、トレーニングデータを最も良く分類するものとそのときの閾値を用いて、ノードのスプリット関数を決定する

要点は、ランダムサンプリングされたトレーニングデータとランダムに選択された説明変数を用いることにより、相関の低い決定木群を作成すること。

'''パラメータの推奨値'''
* <math>n_\mathrm{min}</math>: 分類の場合は1、回帰の場合は5
* ''m'': 説明変数の総数を''p''とすると、分類の場合は<math>\sqrt{p}</math>、回帰の場合は ''p/3''

=== 評価 ===
最終出力は以下のように決定する
* '''識別:''' 決定木の出力がクラスの場合はその多数決、確率分布の場合はその平均値が最大となるクラス
* '''回帰:''' 決定木の出力の平均値

== 特徴 ==
===長所===
* 説明変数が多数であってもうまく働く
* 学習・評価が高速
* 決定木の学習は完全に独立しており、並列に処理可能
* 説明変数の重要度（寄与度）を算出可能
* Out of Bag エラーの計算により、クロスバリデーションのような評価が可能
* [[AdaBoost]] などと比べて特定の説明変数への依存が少ないため、クエリデータの説明変数が欠損していても良い出力を与える

===短所===
* 説明変数のうち意味のある変数がノイズ変数よりも極端に少ない場合にはうまく働かない

== 実装 ==
'''オープンソースによる実装'''
* [http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm The Original RF] by Breiman and Cutler. Written in Fortran 77. May be difficult to configure.
* [http://www.alglib.net/dataanalysis/decisionforest.php ALGLIB] contains implementation of modified random forest algorithm in C#, C++, Pascal, VBA.
* [http://fast-random-forest.googlecode.com/ FastRandomForest] Efficient implementation in Java, utilitizes multiple cores. Integrates into the [[Weka]] environment.
* [http://www.ailab.si/orange/doc/modules/orngEnsemble.htm orngEnsemble] module within [[Orange (ソフトウェア)|Orange]] data mining software suite
* [http://www.irb.hr/en/research/projects/it/2004/2004-111/ PARF] Written in Fortran 90. Can distribute work over a cluster of computers using MPI. 
* [http://cran.r-project.org/web/packages/party/index.html party] an implementation of Breiman's random forests based on conditional inference trees for [[R言語|R]]
* [http://cran.r-project.org/web/packages/randomForest/index.html randomForest] for R
* [http://www.randomjungle.org Random Jungle] is a fast implementation of for high dimensional data. (C++, parallel computing, sparse memory, Linux+Windows)
* [https://tmva.sourceforge.net/ TMVA] Toolkit for Multivariate Data Analysis implements random forests.
* [http://waffles.sourceforge.net Waffles] A C++ library of machine learning algorithms, including RF.
* [https://code.google.com/archive/p/randomforest-matlab Matlab version.]
'''商業実装'''
* [http://www.salford-systems.com Random Forests.]

== 外部リンク ==
* [http://stat-www.berkeley.edu/users/breiman/RandomForests/cc_home.htm Random Forests classifier description] (Leo Breiman と Adele Cutler による解説)
* [http://cran.r-project.org/doc/Rnews/Rnews_2002-3.pdf Liaw, Andy & Wiener, Matthew "Classification and Regression by randomForest" R News (2002) Vol. 2/3 p. 18] (Discussion of the use of the random forest package for [[R言語|R]])
* [http://cm.bell-labs.com/cm/cs/who/tkh/papers/compare.pdf Ho, Tin Kam (2002). "A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors". Pattern Analysis and Applications 5, p. 102-112] (Comparison of bagging and random subspace method) 
* [https://doi.org/10.1007/978-3-540-74469-6_35 Prinzie, A., Van den Poel, D. (2007). Random Multiclass Classification: Generalizing Random Forests to Random MNL and Random NB, Dexa 2007, Lecture Notes in Computer Science, 4653, 349-358.] Generalizing Random Forests framework to other methods. The paper introduces Random MNL and Random NB as two generalizations of Random Forests.
* [https://doi.org/10.1016/j.eswa.2007.01.029 Prinzie, A., Van den Poel, D. (2008). Random Forests for multiclass classification: Random MultiNomial Logit, Expert Systems with Applications, 34(3), 1721-1732.] Generalization of Random Forests to choice models like the Multinomial Logit Model (MNL): Random Multinomial Logit.
* [http://kappa.math.buffalo.edu Stochastic Discrimination and Its Implementation] (Site of Eugene Kleinberg).

== 文献 ==
{{Reflist}}

{{統計学}}

{{デフォルトソート:らんたむふおれすと}} 
[[Category:分類アルゴリズム]]
[[Category:教師あり学習]]
[[Category:機械学習]]
[[Category:決定木]]