相互情報量のソースを表示

{{情報理論}}
'''相互情報量'''（そうごじょうほうりょう、{{lang-en-short|mutual information}}）または'''伝達情報量'''（でんたつじょうほうりょう、{{lang-en-short|transinformation}}）は、[[確率論]]および[[情報理論]]において、2つの[[確率変数]]の相互依存の尺度を表す[[量]]である。最も典型的な相互情報量の[[物理単位]]は[[ビット]]であり、2 を底とする対数が使われることが多い。

== 定義 ==
形式的には、2つの離散確率変数 <math>X</math> と <math>Y</math> の相互情報量は以下で定義される。

:<math> I(X;Y) = \sum_{y \in {\mathcal Y}} \sum_{x \in {\mathcal X}} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}, \!</math>

ここで、<math>p(x,y)</math> は <math>X</math> と <math>Y</math> の[[同時分布]]関数、<math>p(x)</math> と <math>p(y)</math> はそれぞれ <math>X</math> と <math>Y</math> の[[周辺確率]]分布関数である。

連続確率変数の場合、総和の代わりに定積分を用いる。

:<math> I(X;Y) = \int_{\mathcal Y} \int_{\mathcal X} p(x,y) \log \frac{p(x,y)}{p(x)\,p(y)} \; dx \,dy, \!</math>

ここで、<math>p(x,y)</math> は <math>X</math> と <math>Y</math> の同時分布密度関数であり、<math>p(x)</math> と <math>p(y)</math> はそれぞれ <math>X</math> と <math>Y</math> の周辺確率密度関数である。

どちらの場合でも相互情報量は負とならず（<math>I(X; Y) \geq 0</math>）、[[対称性]]がある（<math>I(X; Y) = I(Y; X)</math>）。

これらの定義は対数の底が明示されていない。離散確率変数の場合、最も一般的な相互情報量の尺度はビットであるため、底として 2 を指定することが多い。一方、連続確率変数の場合、ネイピア数<math>e=2.718..</math>をとることが多い。

直観的には、相互情報量は <math>X</math> と <math>Y</math> が共有する情報量の尺度であり、一方の変数を知ることでもう一方をどれだけ推測できるようになるかを示す。例えば、<math>X</math> と <math>Y</math> が独立であれば、<math>X</math> をいくら知っても <math>Y</math> に関する情報は得られないし、逆も同様である。このとき、相互情報量はゼロである。逆に、<math>X</math> と ''<math>Y</math>'' が同じであれば、<math>X</math> と <math>Y</math> は全情報を共有しているという事ができ、<math>X</math> を知れば <math>Y</math> も知ることになり、逆も同様である。結果として、相互情報量は <math>Y</math>（すなわち <math>X</math>）単独の情報量（[[情報量|エントロピー]]）と同じとなる。

相互情報量は、以下のような意味で相互の依存性（非独立性）の尺度でもある。これは一方向から考えると分かり易い。<math>X</math> と <math>Y</math> が独立なら、<math>p(x, y) = p(x) p(y)</math> であるから、次が成り立つ。

:<math> \log \frac{p(x,y)}{p(x)\,p(y)} = \log 1 = 0. \!</math>

したがって、離散確率変数の場合も連続確率変数の場合も<math>I(X; Y) = 0</math> となる。実際は逆も成り立ち、<math>I(X; Y) = 0</math> であることと、<math>X</math> と <math>Y</math> が独立な確率変数であることは[[同値]]である。

また、後述するように<math>X</math> と <math>Y</math> が独立な場合の同時分布と実際の[[同時分布]]の（擬）距離を示す量であるとも考えられる。

== 他の情報量との関係 ==
相互情報量は次のようにも表せる。

:<math>
\begin{align}
I(X;Y) 
& = H(X) - H\left(X \mathop{|} Y\right) \\ 
& = H(Y) - H\left(Y \mathop{|} X\right) \\ 
& = H(X) + H(Y) - H(X,Y)
\end{align}
</math>

ここで、<math>H(X)</math> と <math>H(Y)</math> は周辺[[情報量|エントロピー]]、<math>H(X \mathop{|} Y)</math> と <math>H(Y \mathop{|} X)</math> は[[情報量|条件付きエントロピー]]、<math>H(X, Y)</math> は <math>X</math> と <math>Y</math> の[[結合エントロピー]]である。<math>H(X) \geq H(X \mathop{|} Y)</math> であるため、相互情報量は常に非負であることがわかる。

直観的に、エントロピー <math>H(X)</math> が確率変数の不確かさの尺度であるとすれば、<math>H(X \mathop{|} Y)</math> は「<math>Y</math> を知った後にも残る <math>X</math> の不確かさの量」と見ることができ、最初の行の右辺は「<math>X</math> の不確かさの量から <math>Y</math> を知った後に残った <math>X</math> の不確かさの量を引いたもの」となり、「<math>Y</math> を知ったことで削減される <math>X</math> の不確かさの量」と等価である。これは、相互情報量が2つの確率変数について互いにもう一方を知ったことで得られる別の一方に関する情報量という直観的定義とも合っている。

離散の場合、<math>H(X \mathop{|} X) = 0</math> であるから、<math>H(X) = I(X; X)</math> となる。従って <math>I(X; X) \geq I(X; Y)</math> であり、ある確率変数は他のどんな確率変数よりも自分自身についての情報を多くもたらすという基本原理が定式化されている。

相互情報量は、2つの確率変数 <math>X</math> と <math>Y</math> の[[周辺分布]]の積 <math>p(x) p(y)</math> と[[同時分布]] <math>p(x, y)</math> の[[カルバック・ライブラー情報量]]で表すこともできる。

:<math> I(X;Y) = D_{\mathrm{KL}} \left( p(x, y) \mathop{\|} p(x) p(y) \right) </math>

さらに、<math>p(x, y) = p(x \mathop{|} y) p(y)</math> を用いて変形すると、次のようになる。

:<math>
\begin{align}
I(X;Y) & {} = \sum_y p(y) \sum_x p(x \mathop{|} y) \log \frac{p(x \mathop{|} y)}{p(x)} \\
& {} =  \sum_y p(y) \; D_{\mathrm{KL}} \left( p(x \mathop{|} y)\mathop{\|} p(x) \right) \\
& {} = \mathbb{E}_Y\{D_{\mathrm{KL}} \left( p(x\mathop{|} y) \mathop{\|} p(x) \right)\}
\end{align}
</math>

従って、相互情報量は、<math>p(x \mathop{|} y)</math> の <math>p(x)</math> に対するカルバック・ライブラー情報量の[[期待値]]として解釈することもできる。ここで、<math>p(x \mathop{|} y)</math> は <math>Y</math> を与えられた時の <math>X</math> の条件付き分布、<math>p(x)</math> は <math>X</math> の確率分布である。<math>p(x \mathop{|} y)</math> と <math>p(x)</math> の分布に差があればあるほど、情報利得（カルバック・ライブラー情報量）は大きくなる。

== 多変数の場合 ==

多確率変数の相互情報量は、一般に次のように表される。ただし、<math>\boldsymbol{y}</math> は <math>q</math> 次元ベクトルである。

:<math>
I(\boldsymbol{y})=\left\{\sum_{j=1}^q H\left(y_j\right)\right\}-H(\boldsymbol{y})
</math>

これは、二確率変数の相互情報量の自然な拡張と見なせる。

== 応用 ==
多くの場合、相互情報量を最大化させ（つまり相互依存性を強め）、[[情報量|条件付きエントロピー]]を最小化させるという方向で使われる。以下のような例がある。

* [[通信路容量]]は相互情報量（伝達情報量）を使って定義される。<!-- 相互情報量は入力分布に依存するが、その最大値。 -->
* [[多重配列アラインメント]]による[[リボ核酸|RNA]]の二次構造予測
* [[機械学習]]における[[特徴選択]]や特徴変換の尺度として相互情報量が使われてきた。
* 相互情報量は[[コーパス言語学]]における[[連語]]の計算における重み付け関数として使われることが多い。
* 相互情報量は[[医用画像処理]]における画像の位置合わせに使われる。ある画像と別の画像の[[座標]]を合わせるために、両者の相互情報量が最大となるように位置合わせを行う。
* [[時系列]]解析における{{仮リンク|位相同期|en|Phase synchronization}}の検出。
* [[情報量最大化]][[独立成分分析]]アルゴリズムでも利用されている。
* {{仮リンク|ターケンスの定理|en|Takens' theorem}}では平均相互情報量を使って埋め込み遅延パラメータを求める。

== 関連項目 ==
* [[自己相互情報量]]

== 参考文献 ==
{{参照方法|date=2015年6月}}
* {{cite journal | last = Cilibrasi  | first = R.  | coauthors = Paul Vit&aacute;nyi | title = Clustering by compression | journal = IEEE Transactions on Information Theory | volume = 51 | issue = 4 | pages = 1523-1545 | date = 2005 | url = http://www.cwi.nl/~paulv/papers/cluster.pdf | format = [[Portable Document Format|PDF]]}}
* Coombs, C. H., Dawes, R. M. & Tversky, A. (1970), ''Mathematical Psychology: An Elementary Introduction'', Prentice-Hall, Englewood Cliffs, NJ.
* Cronbach L. J. (1954). On the non-rational application of information measures in psychology, in H Quastler, ed., ''Information Theory in Psychology: Problems and Methods'', Free Press, Glencoe, Illinois, pp. 14—30.
* Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography, ''Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics'', 1989.
* Guiasu, Silviu (1977), ''Information Theory with Applications'', McGraw-Hill, New York.
* {{cite book  | last = Li | first = Ming | coauthors = Paul Vit&aacute;nyi | title = An introduction to Kolmogorov complexity and its applications | location = New York | publisher = Springer-Verlag | date = February 1997 | id = ISBN 0387948686 }}
* Lockhead G. R. (1970). Identification and the form of multidimensional discrimination space, ''Journal of Experimental Psychology'' '''85'''(1), 1-10.
* Athanasios Papoulis. ''Probability, Random Variables, and Stochastic Processes'', second edition. New York: McGraw-Hill, 1984. ''(See Chapter 15.)''
* Press, W. H., Flannery, B. P., Teukolsky, S. A. & Vetterling, W. T. (1988), ''Numerical Recipes in C: The Art of Scientific Computing'', Cambridge University Press, Cambridge.
* {{cite journal | last = Strehl | first = Alexander  | coauthors = Joydeep Ghosh | title = Cluster ensembles -- a knowledge reuse framework for combining multiple partitions | journal = Journal of Machine Learning Research | volume = 3 | pages = 583-617 | date = 2002 | url = http://strehl.com/download/strehl-jmlr02.pdf | format = [[Portable Document Format|PDF]]}}
* Witten, Ian H. & Frank, Eibe (2005), ''Data Mining: Practical Machine Learning Tools and Techniques'', Morgan Kaufmann, Amsterdam.
* Yao, Y. Y. (2003) Information-theoretic measures for knowledge discovery and data mining, in ''Entropy Measures, Maximum Entropy Principle and Emerging Applications'' , Karmeshu (ed.), Springer, pp. 115-136.
* Peng, H.C., Long, F., and Ding, C., "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005. [http://research.janelia.org/peng/proj/mRMR/index.htm Program]

== 外部リンク ==
* {{高校数学の美しい物語|1403|相互情報量の意味とエントロピーとの関係}}
* {{Wayback|url=http://www.scholarpedia.org/article/Mutual_Information |title=Mutual Information |date=20061205091411}} - [[スカラーペディア]]百科事典「相互情報量」の項目。

{{確率論}}
{{DEFAULTSORT:そうこしようほうりよう}}
[[Category:情報理論]]
[[Category:数学に関する記事]]