正則化のソースを表示

{{otheruses}}
[[数学]]・[[統計学]]・[[計算機科学]]において、特に[[機械学習]]と[[逆問題]]において、'''正則化'''（せいそくか、{{lang-en-short|regularization}}）とは、[[不良設定問題]]を解いたり[[過学習]]を防いだりするために、情報を追加する手法である。モデルの複雑さに罰則を科すために導入され、なめらかでないことに罰則をかけたり、パラメータの[[ノルム]]の大きさに罰則をかけたりする。

正則化の理論的正当化は[[オッカムの剃刀]]にある。[[ベイジアン]]の観点では、多くの正則化の手法は、モデルのパラメータの事前情報にあたる。

== 統計および機械学習における正則化 ==
[[統計]]および[[機械学習]]において、正則化はモデルのパラメータの学習に使われ、特に[[過学習]]を防ぎ、汎化能力を高めるために使われる。

機械学習において最も一般的なのは L1 正則化 (''p''=1) と L2 正則化 (''p''=2) である。[[損失関数]] <math>E(\boldsymbol{w})</math> の代わりに、
: <math>E(\boldsymbol{w}) + \lambda \frac{1}{p} \| \boldsymbol{w} \|_p^p = E(\boldsymbol{w}) + \lambda \frac{1}{p} \sum_i |w_i|^p</math>
を使用する。<math>\boldsymbol{w}</math> はパラメータのベクトルで、<math>\| \cdot \|_p</math> は L1 [[ノルム]] (''p''=1) や L2 ノルム (''p''=2) などである。<math>\lambda</math> はハイパーパラメータで、正の定数で、大きくするほど正則化の効果が強くなるが、[[交差確認]]などで決める。

損失関数をパラメータで偏微分すると、
; L2 正則化の場合
: <math>\frac{ \partial E(\boldsymbol{w}) }{ \partial w_i } + \lambda w_i</math>
; L1 正則化の場合
: <math>\frac{ \partial E(\boldsymbol{w}) }{ \partial w_i } + \lambda \sgn(w_i)</math>
となり、これは、[[最急降下法]]や[[確率的勾配降下法]]を使用する場合は、L2 正則化はパラメータの大きさに比例した分だけ、L1 正則化は <math>\lambda</math> だけ 0 に近づけることを意味する。

この手法は様々なモデルで利用できる。[[線形回帰]]モデルに利用した場合は、L1 の場合は[[ラッソ回帰]]<ref name="lasso"/>、L2 の場合は[[リッジ回帰]]<ref name="ridge"/>と呼ぶ。[[ロジスティック回帰]]、[[ニューラルネットワーク]]、[[サポートベクターマシン]]、[[条件付き確率場]] などでも使われる。ニューラルネットワークの世界では、L2 正則化は荷重減衰（{{lang-en-short|weight decay}}）とも呼ばれる。

=== L1 正則化 ===
L1 正則化を使用すると、いくつかのパラメータを 0 にすることができる。つまり、[[特徴選択]]を行っていることになり、[[スパースモデル]]になる。0 が多いと[[疎行列]]で表現でき、高速に計算できる。しかし、L1 ノルムは評価関数に絶対値を含むため、非連続で微分不可能な点が存在する。勾配法を利用した最適化問題のアルゴリズムによっては変更が必要な場合がある<ref>{{cite journal
|author=Galen Andrew
|author2=Jianfeng Gao
|year=2007
|title=Scalable training of L₁-regularized log-linear models
|journal=Proceedings of the 24th International Conference on Machine Learning
|doi=10.1145/1273496.1273501
|isbn=9781595937933
}}</ref><ref>{{cite conference
 |last1=Tsuruoka
 |first1=Y.
 |last2=Tsujii
 |first2=J.
 |last3=Ananiadou
 |first3=S.
 |year=2009
 |title=Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty
 |conference=Proceedings of the AFNLP/ACL
 |url=http://aclweb.org/anthology-new/P/P09/P09-1054.pdf
}}</ref>。

損失関数が二乗和誤差の場合、L1 正則化は、パラメータの絶対値が ''λ'' 以下なら 0にし、そうで無いなら ''λ'' だけ 0 に近づけるのと等価である。損失関数をパラメータで偏微分することで確認できる。よって、小さな値のパラメータが 0 になる。

機械学習の手法において、データが平均0分散1に[[正規化]]されていないと上手く動作しないものが多いが、L1 正則化において全てのパラメータで同じように ''λ'' ずつ減らすということは、同じようなスケーリングでなければならず、平均0分散1に正規化されていないと上手く働かない。

=== L0 正則化 ===
L0 正則化 とは 0 では無いパラメータの数で正則化する方法のこと。ただし、組み合わせ最適化問題になるため、計算コストが非常に高いという問題がある。パラメータ数が多い場合は[[貪欲法]]を利用し、近似解を得る。線形モデルであれば残すパラメータを決めるのに一般化交差確認が利用できる。

=== 情報量規準 ===
事前確率を使用するベイジアン学習法では、複雑なモデルにより小さな確率を割り振ることができる。よく使われるモデル選択手法としては、[[赤池情報量規準]]（AIC）、[[最小記述長]]（MDL）、[[ベイズ情報量規準]]（BIC）などがある。

=== 線形モデルでの手法 ===
下記は[[一般化線形モデル]]で使用される正則化の手法の一覧である。

{|class="wikitable sortable"
!モデル
!適合尺度
!エントロピー尺度<ref>{{cite book
|last1=Bishop
|first1=Christopher M.
|title=Pattern recognition and machine learning
|date=2007
|publisher=Springer
|location=New York
|isbn=978-0387310732
|edition=Corr. printing.
}}</ref><ref>{{cite book
|last1=Duda
|first1=Richard O.
|title=Pattern classification + computer manual : hardcover set
|date=2004
|publisher=Wiley
|location=New York [u.a.]
|isbn=978-0471703501
|edition=2.
}}</ref>
|-
|[[赤池情報量規準]]/[[ベイズ情報量規準]]
|<math>\|Y-X\beta\|_2</math>
|<math>\|\beta\|_0</math>
|-
| [[リッジ回帰]]<ref name="ridge">{{Cite journal
 |title=Ridge regression: Biased estimation for nonorthogonal problems
 |author=Arthur E. Hoerl
 |author2=Robert W. Kennard
 |journal=Technometrics
 |volume=12
 |issue=1
 |year=1970
 |pages=55-67
}}</ref>
| <math>\|Y-X\beta\|_2</math> 
| <math>\|\beta\|_2</math>
|-
|[[ラッソ回帰]]<ref name="lasso">{{Cite journal
 | last = Tibshirani
 | first = Robert
 | title = Regression Shrinkage and Selection via the Lasso
 | journal = Journal of the Royal Statistical Society, Series B
 | year = 1996
 | volume = 58
 | issue = 1
 | pages = 267&ndash;288
 | url = http://statweb.stanford.edu/~tibs/lasso/lasso.pdf
 | mr = 1379242 |jstor=2346178 |doi=10.1111/j.2517-6161.1996.tb02080.x |issn=1369-7412
}}</ref>
|<math>\|Y-X\beta\|_2</math>
|<math>\|\beta\|_1</math>
|-
|[[エラスティックネット]]<ref>{{cite journal
|title=Regularization and variable selection via the Elastic Net
|author=by Hui Zou
|author2=Trevor Hastie
|journal=Journal of the Royal Statistical Society, Series B
|year=2005
|url=https://web.stanford.edu/~hastie/Papers/B67.2%20(2005)%20301-320%20Zou%20&%20Hastie.pdf
}}</ref>
|<math>\|Y-X\beta\|_2</math>
|<math>\lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2</math>
|-
| 基底追跡ノイズ除去 
| <math>\|Y-X\beta\|_2</math> 
| <math>\lambda\|\beta\|_1</math>
|-
|Rudin-Osher-Fatemi モデル (TV) || <math>\|Y-X\beta\|_2</math> || <math>\lambda\|\nabla\beta\|_1</math>
|-
| Potts モデル
| <math>\|Y-X\beta\|_2</math> 
| <math>\lambda\|\nabla\beta\|_0</math>
|-
|RLAD<ref>{{Cite conference
 | author = Li Wang, Michael D. Gordon & Ji Zhu
 | title = Regularized Least Absolute Deviations Regression and an Efficient Algorithm for Parameter Tuning
 | book-title = Sixth International Conference on Data Mining
 | date= 2006
 | pages = 690&ndash;700
 | doi = 10.1109/ICDM.2006.134
}}</ref>
| <math>\|Y-X\beta\|_1</math> 
| <math>\|\beta\|_1</math>
|-
|Dantzig 選択器<ref>{{Cite journal
 | last = Candes
 | first = Emmanuel
 | author2=Tao, Terence
 | authorlink2=テレンス・タオ
 | title = The Dantzig selector: Statistical estimation when ''p'' is much larger than ''n''
 | journal = Annals of Statistics
 | year = 2007
 | volume = 35
 | issue = 6
 | pages = 2313&ndash;2351
 | doi = 10.1214/009053606000001523
 | mr = 2382644
 | arxiv = math/0506081
}}</ref>
| <math>\|X^\top (Y-X\beta)\|_\infty</math>
| <math>\|\beta\|_1</math>
|-
|SLOPE<ref>{{Cite journal
 | author = Małgorzata Bogdan, Ewout van den Berg, Weijie Su & Emmanuel J. Candes
 | title = Statistical estimation and testing via the ordered L1 norm
 | journal = arXiv preprint arXiv:1310.1969
 | year = 2013
 | arxiv = 1310.1969v2
}}</ref>
| <math>\|Y-X\beta\|_2</math> 
| <math>\sum_{i=1}^p \lambda_i|\beta|_{(i)}</math>
|}

== 逆問題における正則化 ==
{{see also|逆問題}}
1943年に Andrey Nikolayevich Tikhonov が、L2 正則化をより一般化した Tikhonov 正則化を逆問題に対する手法として発表した<ref>{{Cite journal
 | last=Tikhonov
 | first=Andrey Nikolayevich
 | year=1943
 | title=Об устойчивости обратных задач
 | trans-title=On the stability of inverse problems
 | journal=Doklady Akademii Nauk SSSR
 | volume=39
 | issue=5
 | pages=195–198
}}</ref>。詳細は[[逆問題]]を参照。

== 関連項目 ==
* [[逆問題]]
* [[オッカムの剃刀]]
* [[過剰適合]]

== 参照 ==
{{reflist}}

{{統計学}}

{{デフォルトソート:せいそくか}}
[[Category:統計学]]
[[Category:機械学習]]
[[Category:数学に関する記事]]