確率的勾配降下法のソースを表示

[[Image:stogra.png|350px|thumb|right|ミニバッチを使い上下に行ったり来たりしながら目的関数の値が減少していく例]]

'''確率的勾配降下法'''（かくりつてきこうばいこうかほう、{{lang-en-short|stochastic gradient descent, SGD}}）は、連続[[最適化問題]]に対する[[勾配法]]の[[乱択アルゴリズム]]。バッチ学習である[[最急降下法]]をオンライン学習に改良したアルゴリズムである。目的関数が[[微分可能]]な和の形であることを必要とする。

== 背景 ==
下記の和の形の目的関数を最小化する問題を扱う。
: <math>Q(w) = \sum_{i=1}^n Q_i(w)</math>
パラメータ <math>w^*</math> はQ(w) を最小化するように[[推定量|推定]]する。典型的には、<math>Q_i</math> は i 番目の訓練データ。

古典的な統計学において、和の最小化問題は、[[最小二乗法|最小二乗問題]]や[[最尤推定]]問題などにあらわれる。一般的なケースでは、和を最小化する推定量はM推定量と呼ぶ。しかしながら、[[:en:Thomas S. Ferguson|Thomas S. Ferguson]]の例<ref>{{cite journal
 | last = Ferguson
 | first = Thomas S.
 | title = An inconsistent maximum likelihood estimate
 | journal = Journal of the American Statistical Association
 | volume = 77
 | issue = 380
 | year = 1982
 | pages = 831–834
 | jstor = 2287314
 | doi = 10.1080/01621459.1982.10477894
 }}</ref>などで示されるように、いくつかの最尤推定の問題において、最小解ではなく局所解を要求するだけでも制限が厳しすぎると長い間認識され続けてきた。それゆえ、現代の統計理論家は最尤関数の停留点(微分がゼロになる点)を考慮する事が多くなった。

和の最小化問題は{{仮リンク|経験損失最小化|en|Empirical risk minimization}}の問題にも現れる。<math>Q_i(w)</math> の値が i 番目の訓練データであるならば、Q(w) が経験損失である。

上記の関数 Q を最小化する際、標準的な[[最急降下法]]（バッチ学習）では、下記の反復を繰り返す。
: <math>w := w - \eta \nabla Q(w) = w - \eta \sum_{i=1}^n \nabla Q_i(w)</math>
<math>\eta</math> はステップサイズと呼ばれる。[[機械学習]]においては{{仮リンク|学習率|en|Learning rate}}とも呼ばれる。

確率分布がパラメータが一つの[[指数型分布族]]などで、勾配の総和の計算が、小さな計算量で出来てしまう事もあるが、一つ一つの勾配を計算して総和を取らないといけない事も多い。そのような場合、和の全体ではなく、和の一部分だけを計算する事で、1回の反復の計算量を小さくする事ができる。これは大規模な機械学習の問題で効果的である<ref>{{Cite conference 
|first1=Léon 
|last1=Bottou 
|last2=Bousquet 
|first2=Olivier 
|title=The Tradeoffs of Large Scale Learning 
|url=http://leon.bottou.org/papers/bottou-bousquet-2008 
|conference=Advances in Neural Information Processing Systems
|volume=20 
|pages=161–168 
|year=2008
}}</ref>。

== 反復法 ==
確率的勾配降下法（オンライン学習）では、Q(w) の勾配は、1つの訓練データから計算した勾配で近似する。

上記の更新を1つ1つの訓練データで行い、訓練データ集合を一周する。収束するまで訓練データ集合を何周もする。一周するたびに訓練データはランダムにシャッフルする。AdaGrad などの適応学習率のアルゴリズムを使用すると収束が速くなる。

擬似コードでは、確率的勾配降下法は下記になる。

 パラメータ <math>w</math> と学習率 <math>\eta</math> の初期値を選ぶ
 '''while''' 収束するか所定の反復回数まで反復する '''do'''
     <math>Q_i</math>（訓練データ）をランダムにシャッフルする
     '''for each''' i = 1, 2, ..., n '''do'''
         <math>\! w := w - \eta \nabla Q_i(w)</math>

全てではないが複数の訓練データで勾配を計算する方法をミニバッチと言う。この方法は、コンピュータの[[SIMD]]を有効活用でき計算を高速化できる。また、複数の訓練データを使うので収束がよりなめらかになる事もある。

確率的勾配降下法の収束性は[[凸最適化]]と確率近似の理論を使い解析されている。目的関数が[[凸関数]]もしくは疑似凸関数であり、学習率が適切な速度で減衰し、さらに、比較的緩い制約条件を付ければ、確率的勾配降下法は[[ほとんど (数学)|ほとんど確実に]]最小解に収束する。目的関数が凸関数でない場合でも、ほとんど確実に局所解に収束する<ref>{{Cite book
     |last=Bottou
     |first=Léon
     |contribution=Online Algorithms and Stochastic Approximations
     |year=1998
     |title=Online Learning and Neural Networks
     |publisher=Cambridge University Press
     |url=http://leon.bottou.org/papers/bottou-98x
     |isbn=978-0-521-65263-6
     |postscript=
}}</ref><ref>{{cite news
     |last=Kiwiel
     |first=Krzysztof C.
     |title=Convergence and efficiency of subgradient methods for quasiconvex minimization
     |journal=Mathematical Programming  (Series A)
     |publisher=Springer|location=Berlin, Heidelberg
     |issn=0025-5610
     |pages=1–25
     |volume=90
     |issue=1
     |doi=10.1007/PL00011414
     |year=2001 
     |mr=1819784
}}</ref>。これは Robbins-Siegmund の定理による<ref>{{Cite book
     |last1=Robbins
     |first1=Herbert
     |last2=Siegmund
     |first2=David O.
     |contribution=A convergence theorem for non negative almost supermartingales and some applications
     |title=Optimizing Methods in Statistics
     |publisher=Academic Press
     |year=1971
     |editor-last=Rustagi
     |editor-first=Jagdish S.
     |postscript=
}}</ref>。

<math>Q_i</math>（訓練データ）がランダムにシャッフルされる事により、確率的に局所解にはまりにくくなる効果がある。

== 学習率の調整方法および変種 ==
基本的な確率的勾配降下法に対して多くの改良が提案されている。特に、機械学習において、ステップサイズ（学習率）の調整は重要問題として認識されている。学習率を大きくしすぎると発散し、小さくしすぎると収束まで遅くなる。

=== 確率的近似法 ===
1951年に Herbert Robbins と Sutton Monro が発表<ref>{{cite journal
|title=A Stochastic Approximation Method
|year=1951
|author=Herbert Robbins
|author2=Sutton Monro
|journal=Ann. Math. Statist. 
|volume=22
|issue=3
|pages=400-407
|doi=10.1214/aoms/1177729586
}}</ref>。学習率をイテレーション回数の逆数で減衰させる方法。Robbins-Monro法とも言われる。
: <math>\eta_t = \frac{\eta_0}{t}</math>

=== Nesterovの加速勾配降下法 ===
1983年に Yurii Nesterov が発表<ref>{{cite journal
|title=A method of solving a convex programming problem with convergence rate O(1/k<sup>2</sup>)
|author=Yurii Nestero
|year=1983
|journal=Soviet Mathematics Doklady
|volume=27
|pages=372–376
}}</ref>。
: <math>\begin{align}
x_0 &= w_0 \\
x_t &= w_{t-1} - \eta \nabla Q_i(w_{t-1}) \\
w_t &= x_t + \frac{t-1}{t+2}(x_t - x_{t-1})
\end{align}</math>

=== モメンタム法 ===
1986年に[[デビッド・ラメルハート]]らが[[バックプロパゲーション]]と共に提案した方法<ref name="Rumelhart1986">{{cite journal
 |last=Rumelhart
 |first=David E.
 |author2=Hinton, Geoffrey E.
 |author3=Williams, Ronald J.
 |title=Learning representations by back-propagating errors
 |journal=Nature
 |date=8 October 1986
 |volume=323
 |issue=6088
 |pages=533–536
 |doi=10.1038/323533a0
}}</ref>。
:<math>\begin{align}
\Delta w &:= \eta \nabla Q_i(w) + \alpha \Delta w \\
w &:= w - \Delta w
\end{align}</math>

=== 平均化法 ===
1988年に David Ruppert が提案した方法<ref>{{cite journal
 |author=David Ruppert
 |title=Efficient estimations from a slowly convergent Robbins-Monro process
 |journal=Technical Report
 |volume=781
 |publisher=Cornell University Operations Research and Industrial Engineering
 |year=1988
 |url=https://ecommons.cornell.edu/handle/1813/8664
}}</ref>。
:<math>\bar{w} = \frac{1}{t} \sum_{i=1}^{t} w_i</math>
を計算し、最終的にパラメータの平均値を学習結果とする。

=== Truncated Gradient ===
2009年に John Langford らが発表した方法<ref>{{cite journal
|author=John Langford
|author2=Lihong Li
|author3=Tong Zhang
|title=Sparse Online Learning via Truncated Gradient
|journal=Journal of Machine Learning Research
|volume=10
|year=2009
|pages=777-801
|url=http://www.jmlr.org/papers/volume10/langford09a/langford09a.pdf
}}</ref>。L1 [[正則化]]を含む場合、確率的勾配降下法だとパラメータが 0 になりにくいが、K 回毎にパラメータの大きさが θ 以下であれば、0にする方法。

=== 正則化双対平均化法（Regularized Dual Averaging Method） ===
2009年に Lin Xiao が発表した方法<ref>{{cite journal
|author=Lin Xiao
|title=Dual Averaging Method for Regularized Stochastic Learning and Online Optimization
|journal=In Advances in Neural Information Processing Systems
|volume=23
|year=2009
|url=http://papers.nips.cc/paper/3882-dual-averaging-method-for-regularized-stochastic-learning-and-online-optimization.pdf
}}</ref><ref name="note_adgrad">{{cite journal
|first=Joseph 
|last=Perla 
|year=2013
|title=Notes on AdaGrad 
|url=http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf
}}</ref>。目的関数が下記のように汎化能力を高めるために L1 [[正則化]]を含む場合、確率的勾配降下法だとパラメータが 0 になりにくく、そのための対策をした方法。以下、この手法では Q(w) には <math>\lambda \|w\|_1</math> を含めずに、L1 正則化の効果を実現する。
: <math>Q(w) + \lambda \|w\|_1</math>
まず、勾配の平均を計算する。
: <math>\overline{g}_t = \frac{1}{t} \sum_{t'=1}^t \nabla Q(w)_{t'}</math>
その上で、パラメータの更新は以下の通り。ここでパラメータの初期値は0としている。
:<math>w_i := \begin{cases}
  0 & \text{if } |\overline{g}_{t, i}| \le \lambda, \\
  - \dfrac{ \sqrt{t} }{\gamma} \left( \overline{g}_{t, i} - \lambda \sgn(\overline{g}_{t, i}) \right) & \text{otherwise.}
\end{cases}</math>

L1 正則化と L2 正則化を
:<math>Q(w) + \lambda \|w\|_1 + \frac{\sigma}{2} \|w\|_2^2</math>
の形で混ぜる場合は、このようになる。
:<math>w_i := \begin{cases}
  0 & \text{if } |\overline{g}_{t, i}| \le \lambda, \\
  - \dfrac{1}{\sigma} \left(\overline{g}_{t, i} - \lambda \sgn(\overline{g}_{t, i}) \right) & \text{otherwise.}
\end{cases}</math>

以下のように、<math>\lambda</math> を少しずつ大きくしていくと、疎になる度合いを徐々に高めていける。
:<math>\lambda = \lambda_0 + \rho / \sqrt{t}</math>

=== AdaGrad ===
2011年に John Duchi らが発表した方法<ref>{{cite journal
|title=Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
|author=John Duchi
|author2=Elad Hazan
|author3=Yoram Singer
|url=http://jmlr.org/papers/v12/duchi11a.html
|year=2011
|journal=The Journal of Machine Learning Research
|volume=12
|pages=2121-2159 
}}</ref>。<math>\circ</math> は[[アダマール積]]（要素ごとの積）。下記計算、全てパラメータごと（要素ごと）に計算する。<math>\epsilon</math> は無限大に発散させないための正の小さな定数。
:<math>\begin{align}
r_0 &= \epsilon \\
r_t &= r_{t - 1} + \nabla Q_i(w) \circ \nabla Q_i(w) \\
\eta_t &= \frac{\eta_0}{\sqrt{r_t}} \\
w_{t+1} &= w_t - \eta_t \circ \nabla Q_i(w)
\end{align}</math>

正則化双対平均化法と AdaGrad を組み合わせる方法が、AdaGrad の発表と共に2011年に出ている<ref name="note_adgrad"/>。
:<math>\begin{align}
u &:= u + \nabla Q(w) \\
r &:= r + \nabla Q_i(w) \circ \nabla Q_i(w) \\
w_i &:= \begin{cases}
  0 & \text{if } |u_i| / t \le \lambda, \\
  - \sgn(u_i) \dfrac{\eta t}{\sqrt{r_i}} \left( \dfrac{|u_i|}{t} - \lambda \right) & \text{otherwise.}
\end{cases}
\end{align}</math>

=== RMSProp ===
2012年に Tijmen Tieleman らが発表した方法<ref>{{cite journal
|title=Lecture 6.5 - rmsprop, COURSERA: Neural Networks for Machine Learning
|author=Tijmen Tieleman
|author2=G. Hinton
|year=2012
}}</ref>。AdaGrad の変形。勾配の2乗の指数[[移動平均]]を取るように変更。<math>\beta = 0.9</math> などを使用。
:<math>\begin{align}
r_t &= \beta r_{t - 1} + (1 - \beta) \nabla Q_i(w) \circ \nabla Q_i(w) \\
\eta_t &= \frac{\eta_0}{\sqrt{r_t + \epsilon}} \\
w_{t+1} &= w_t - \eta_t \circ \nabla Q_i(w)
\end{align}</math>

=== AdaDelta ===
2012年に Matthew D. Zeiler が発表した方法<ref>{{cite journal
|title=ADADELTA: An Adaptive Learning Rate Method
|author=Matthew D. Zeiler
|year=2012
|url=https://arxiv.org/abs/1212.5701
}}</ref>。AdaGrad や RMSProp の変形。初期学習率のハイパーパラメータがなくなっている。
:<math>\begin{align}
r_t &= \beta r_{t - 1} + (1 - \beta) \nabla Q_i(w) \circ \nabla Q_i(w) \\
v_t &= \frac{\sqrt{s_t} + \epsilon}{\sqrt{r_t} + \epsilon} \circ \nabla Q_i(w) \\
s_{t+1} &= \beta s_t + (1 - \beta) v_t \circ v_t \\
w_{t+1} &= w_t - v_t
\end{align}</math>

=== Sum of Functions Optimizer ===
2014年に Jascha Sohl-Dickstein らが発表した方法<ref>{{cite journal
|title=Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods
|author=Jascha Sohl-Dickstein
|author2=Ben Poole
|author3=Surya Ganguli
|year=2014
|url=https://arxiv.org/abs/1311.2115
|journal=Proceedings of the 31 st International Conference on Machine Learning
|volume=32
}}</ref>。確率的勾配降下法と記憶制限[[準ニュートン法]]の L-BFGS を組み合わせた方法。二次収束するようになり、収束が AdaGrad などよりも速くなった。

=== Adam ===
2015年に Diederik P. Kingma らが発表した方法<ref>{{cite journal
|title=Adam: A Method for Stochastic Optimization
|author=Diederik Kingma
|author2=Jimmy Ba
|journal=Proceedings of the 3rd International Conference for Learning Representations, San Diego
|year=2015
|url=https://arxiv.org/abs/1412.6980
}}</ref>。AdaGrad, RMSProp, AdaDelta の変形。AdaGrad や Sum of Functions Optimizer よりも収束が速くなった。ハイパーパラメータは <math>\alpha = 0.001, \beta_1 = 0.9, \beta_2 = 0.999, \epsilon = 10^{-8}</math> を推奨。イテレーション回数 t は 1 から始める。
:<math>\begin{align}
m_0 &= v_0 = 0 \\
m_t &= \beta_1 m_{t-1} + (1 - \beta_1) \nabla Q_i(w) \\
v_t &= \beta_2 v_{t-1} + (1 - \beta_2) \nabla Q_i(w) \circ \nabla Q_i(w) \\
\hat{m}_t &= \frac{m_t}{1 - \beta_1^t} \\
\hat{v}_t &= \frac{v_t}{1 - \beta_2^t} \\
w_t &= w_{t - 1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
\end{align}</math>

=== AdaBound ===
2019年のICLRでLiangchen Luoらが発表した方法<ref>{{cite journal
|title=Adaptive Gradient Methods with Dynamic Bound of Learning Rate
|authors=Luo, Liangchen and Xiong, Yuanhao and Liu, Yan and Sun, Xu
|journal=Proceedings of the 7th International Conference on Learning Representations, New Orleans, Louisiana
|year=2019
|url=https://www.luolc.com/publications/adabound/
}}</ref>。
Adamに学習率の制限(Bound)を加え、ステップごとにSGDへ連続的に変化させることによって、Adamの収束速度とSGDの汎化性能の両立を目指した。論文中でのハイパーパラメータと学習率の下限・上限は <math>\alpha = 0.001, \beta_1 = 0.9, \beta_2 = 0.999, \eta_l(t) = 0.1 - \frac{0.1}{(1 - \beta_2)t + 1} , \eta_u(t) = 0.1 + \frac{0.1}{(1 - \beta_2)t}</math>であり、Adamと同様にt=1から始める。 
:<math>\begin{align}
m_0 &= v_0 = 0 \\
g_t &= \nabla Q_i(w) \\
m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\
v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2, \text{and } V_t = \text{diag}(v_t) \\
\hat{\eta}_t &= \text{Clip}(\frac{\alpha}{\sqrt{V_t}}, \eta_l(t), \eta_u(t)) \\
w_t &= w_{t-1} - \frac{\hat{\eta}_t}{\sqrt{t}} \circ m_t
\end{align}</math>

== パラメータの初期値 ==
パラメータ <math>w</math> の初期値はなんらかの[[確率分布]]からランダムに選ぶ。どの確率分布を使うかは、最小値の近傍に収束する確率に影響がある。しかし、何が適切な確率分布かはモデル次第である。[[ニューラルネットワーク]]の場合については[[バックプロパゲーション]]の項目を参照。

== 平均・分散の調整 ==
確率的勾配降下法は入力の値に極端に平均・分散が異なる物が混じると、うまく行かなくなる確率が上がる。よって、モデル自体に線形変換をかけるなどして、訓練データの正規化をして、平均0分散1になるように調整する方が良い。

=== 正規化オンライン学習 ===
学習データがリアルタイムで手に入るなど、事前に平均0分散1に調整できない場合のために、2013年に Stephane Ross らが正規化オンライン学習を発表している<ref>{{cite journal
|title=Normalized Online Learning
|author=Stephane Ross
|author2=Paul Mineiro
|author3=John Langford
|journal=Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence
|year=2013
|url=https://arxiv.org/abs/1305.6646
}}</ref>。データの絶対値の最大値を追跡して、それを元に調整する。

== 脚注 ==
{{脚注ヘルプ}}
{{Reflist}}

== 関連項目 ==
* [[最適化問題]]
* [[最急降下法]]
* [[甘利俊一]]

{{最適化アルゴリズム}}

{{DEFAULTSORT:かくりつてきこうはいこうかほう}}
[[Category:勾配法]]
[[Category:統計学的近似]]
[[Category:機械学習アルゴリズム]]
[[Category:数学に関する記事]]
[[Category:確率的最適化]]