オートエンコーダのソースを表示

{{Expand English|Autoencoder|date=2023年3月}}
{{Machine learning bar}}
'''オートエンコーダ'''（自己符号化器、{{lang-en-short|autoencoder}}）とは、[[機械学習]]において、[[ニューラルネットワーク]]を使用した[[次元]]圧縮のための[[アルゴリズム]]。[[2006年]]に[[ジェフリー・ヒントン]]らが提案した<ref name="hinton2006">{{Cite journal
 |author=Geoffrey E. Hinton
 |author2=R. R. Salakhutdinov
 |title=Reducing the Dimensionality of Data with Neural Networks
 |journal=Science
 |volume=313
 |issue=5786
 |date=2006-07-28
 |pages=504-507
 |url=https://www.cs.toronto.edu/~hinton/absps/science.pdf
 }}</ref>。

== 概要 ==
[[File:AutoEncoder.png|right|250px]]
オートエンコーダは3層ニューラルネットにおいて、入力層と出力層に同じデータを用いて[[教師なし学習]]させたものである。教師データが実数値で値域がない場合、出力層の活性化関数は[[恒等写像]]、（すなわち出力層は線形変換になる）が選ばれることが多い。中間層の活性化関数も恒等写像を選ぶと結果は主成分分析とほぼ一致する。実用上では、入力と出力の差分をとることで、[[異常検知]]に利用されている。

== 特性と限界 ==
オートエンコーダは次元圧縮に必要な特性を有するように設計されている。

オートエンコーダは中間層の次元数 <math>d_m</math> が入出力層の次元数 <math>d_{i,o}</math> より小さいように制約されている。なぜなら <math>d_{i,o} \leqq d_m</math> の場合、オートエンコーダは[[恒等写像|恒等変換]]のみで再構成誤差ゼロを達成できてしまう<ref>"''autoencoder where Y is of the same dimensionality as X (or larger) can achieve perfect reconstruction simply by learning an identity mapping.''" Vincent. (2010). ''[https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion]''.</ref>。

オートエンコーダは次元圧縮を実現するが、これは良い[[特徴量#表現学習|表現学習]]を必ずしも意味しない<ref>"''The criterion that representation Y should retain information about input X is not by itself sufficient to yield a useful representation.''" Vincent. (2010). ''[https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion]''.</ref>。<math>d_m</math> を小さくすることで入力中の情報量が多い（より少量で画像を再構成できる）特徴のみが保存されると期待されるが（c.f. [[非可逆圧縮]]）、これが特徴量として優れているとは一概に言えない。

== 理論 ==
AEが再構成および次元圧縮を学習できる理由が理論的に解析されている。

オートエンコーダネットワーク <math>AE_{\phi,\theta}(x)</math> はエンコーダネットワーク <math>NN_{\phi}(x)</math> とデコーダネットワーク <math>NN_{\theta}(x)</math> からなる。決定論的な解釈においてAEは「再構成された入力」を直接出力する。すなわち <math>\hat{x}
= AE_{\phi,\theta}(x)
= NN_{\theta}(NN_{\phi}(x))
</math> である。

=== 確率論的解釈 ===
AEは[[統計モデル|確率モデル]]の観点から[[潜在変数#深層潜在変数モデル|深層潜在変数モデル]]の一種とみなせ、次のように定式化できる：

: <math>\begin{align}
z_{|x} \sim p_\phi(Z|X)
& = p(Z|\lambda = NN_\phi(X))
= \delta(Z - NN_\phi(X))    \\
\hat{x}_{|z} \sim p_\theta(\hat{X}|Z)
& = p(\hat{X}|\mu = NN_\theta(Z))
\end{align}


</math>
すなわち <math>NN_{\phi}(x), NN_{\theta}(x)</math> は分布パラメータ <math>\lambda, \mu
</math> を出力し分布を介して <math>z, \hat{x}
</math> が得られると解釈できる<ref>"a deterministic mapping from X to Y, that is, ... equivalently <math>q(Y|X; \theta) = \delta(Y-f_\theta(X))</math> ... The deterministic mapping <math>f_\theta</math> that transforms an input vector <math>\boldsymbol{x}</math> into hidden representation <math>\boldsymbol{y}</math> is called the '''encoder'''." Vincent. (2010). ''[https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion]''.</ref><ref>"<math>\boldsymbol{z} = g_{\theta^'}(\boldsymbol{y})</math>. This mapping <math>g_{\theta^'}</math> is called the '''decoder'''. ... In general <math>\boldsymbol{z}</math> is not to be interpreted as an exact reconstruction of <math>\boldsymbol{x}</math>, but rather in probabilistic terms as the parameters (typically the mean) of a distribution <math>p(X|Z=\boldsymbol{z})</math>" Vincent. (2010). ''[https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion]''.</ref>。AEではエンコーダが決定論的に振舞うため、[[条件付き確率分布#写像の条件付き確率分布表現|写像の条件付き確率分布]]（[[ディラックのデルタ関数|デルタ関数]] <math>\delta
</math>）で表現される。<math>\delta
</math> の決定論的性質より <math>NN_{\phi}(x), NN_{\theta}(x)</math> を集約して表現するとAEは次の確率論的表現で表される：

: <math>\hat{x}_{|x} \sim
p(\hat{X} | \mu = AE_{\phi,\theta}(X))

</math>
AEの学習には[[平均二乗誤差]]（MSE, L<sub>2</sub>）をはじめ様々な[[損失関数]]が（決定論的な視点から）経験的に使われている。これは経験的なものであって学習収束保証があるとは限らない。理論的な研究により、いくつかの損失関数では <math>p_\theta(\hat{X}|Z) 


</math> に特定の分布を設定したinfomax学習として定式化できることがわかっている。

==== 固定分散正規分布モデル ====
「分散が固定された正規分布 <math>N(X|\mu_\theta, \sigma)</math>」を考えると[[尤度関数#負の対数尤度|負の対数尤度]] <math>L_{n}(\theta)</math> は以下になる：

: <math>L_n(\theta) 
= \frac{\| x - \mu_\theta \|^2}{2\sigma^2} - \log(\sqrt{2 \pi \sigma^2})
\propto \| x - \mu_\theta \|^2
</math>
これは <math>x
</math> と <math>\mu_\theta
</math> の二乗誤差と解釈できる。すなわち、 <math>N(X|\mu_\theta=AE_{\phi,\theta}(x), \sigma)</math> のNLL最小化と <math>\hat{x} = AE_{\phi,\theta}(x)
</math> の二乗誤差最小化は同等とみなせる<ref>"<math>g_{\theta^'}</math> is called the decoder ... <math>Z = g_{\theta^'}(\boldsymbol{y})</math> ... associated loss function <math>L(\boldsymbol{x}, \boldsymbol{z})</math> ... <math>X|\boldsymbol{z} \sim N(\boldsymbol{z}, \boldsymbol{\sigma}^2 \boldsymbol{I})</math> ... This yields <math>L(\boldsymbol{x}, \boldsymbol{z})
= L_2(\boldsymbol{x}, \boldsymbol{z})
= C(\sigma^2) \| \boldsymbol{x} - \boldsymbol{z} \| ^2</math> ... This is the squared error objective found in most traditional autoencoders." Vincent. (2010). ''[https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion]''.</ref>。換言すれば、二乗誤差で学習されたオートエンコーダモデルは「最尤推定された固定分散正規分布 <math>N(X|\mu_\theta=AE_{\phi,\theta}(x), \sigma)</math> からの最頻値サンプリングモデル」であるとみなせる。

== 派生 ==
オートエンコーダには様々な変種・派生モデルが存在する。以下はその一例である：

* [[変分オートエンコーダー]]（VAE）
* Contractive AutoEncoder
* Saturating AutoEncoder
* Nonparametrically Guided AutoEncoder
* Unfolding Recursive AutoEncoder

=== スパース・オートエンコーダ ===
'''スパース・オートエンコーダ'''（{{lang-en-short|sparse autoencoder}}）とは、フィードフォワードニューラルネットワークの学習において汎化能力を高めるため、[[正則化]]項を追加したオートエンコーダのこと。ただし、ネットワークの重みではなく、中間層の値自体を0に近づける。

=== Stacked autoencoder ===
[[File:Stacked Autoencoders.png|right|150px]]
バックプロパゲーションでは通常、中間層が2層以上ある場合、[[極値|極小解]]に収束してしまう。そこで、中間層1層だけでオートエンコーダを作って学習させる。次に、中間層を入力層と見なしてもう1層積み上げる。これを繰り返して多層化したオートエンコーダをつくる方法をstacked autoencoderと言う。

=== Denoising AutoEncoder ===
入力層のデータにノイズを加えて学習させたもの。制約付き[[ボルツマンマシン]]と結果がほぼ一致する。ノイズは[[確率分布]]が既知であればそれに従ったほうが良いが、未知である場合は[[一様分布]]で良い。

=== Generative Adversarial Network ===
深層学習を活用した生成モデルの一種で、2014年にイアン・グッドフェローらによって提案された。GANは、2つのニューラルネットワーク（ジェネレーターとディスクリミネーター）が互いに競争しながら学習することで、高品質なデータを生成する能力を持つ。

== 類似技術 ==
*[[ディープビリーフネットワーク]]
*[[ディープボルツマンマシン]]
== 脚注 ==
{{脚注ヘルプ}}
=== 出典 ===
{{Reflist}}

{{デフォルトソート:おおとえんこおた}}
[[Category:人工ニューラルネットワーク]]
[[Category:教師なし学習]]