局所性鋭敏型ハッシュのソースを表示

'''局所性鋭敏型ハッシュ'''（きょくしょせいえいびんがたハッシュ、{{lang-en|locality sensitive hashing}}）とは高次元のデータを確率的な処理によって次元圧縮するための手法である。[[ハッシュ関数|ハッシュ]]の基本的な考え方は類似したデータが高確率で同じバケットに入るようにデータを整理するというものである。多くの場合においてこのバケットの数は入力されるデータサンプルの数よりもずっと小さくなる。

== 定義 ==
局所性鋭敏型ハッシュを行うためのパラメータの集合をLSH族(Locality Sensitive Hashing Family)と呼ぶ。LSH族は[[距離空間]]<math>\mathcal M =(M,d)</math>と閾値<math>R>0</math>、近似因子<math>c>1</math>によって定義される。LSH族<ref name=GIM1999>{{cite journal
| author= Gionis, A.
| coauthors = [[Piotr Indyk|Indyk, P.]], [[Rajeev Motwani|Motwani, R.]]
| year = 1999
| title = Similarity Search in High Dimensions via Hashing
| url= http://people.csail.mit.edu/indyk/vldb99.ps ,
| journal = Proceedings of the 25th Very Large Database (VLDB) Conference
}}</ref><ref name=IndykMotwani98>{{cite journal
 | author = [[Piotr Indyk|Indyk, Piotr]].
 | coauthors = [[Rajeev Motwani|Motwani, Rajeev]].
 | year = 1998
 | title = Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality.
 | url= http://people.csail.mit.edu/indyk/nndraft.ps ,
 | journal = Proceedings of 30th Symposium on Theory of Computing
}}
</ref>は2点<math>p,q\in {\mathcal M}</math>について次の2つの性質、
* <math>d(p,q) \le R</math>ならば<math>h(p)=h(q)</math>となる確率は<math>P_1 \,</math>以上である。
* <math>d(p,q) \ge cR</math>ならば<math>h(p)=h(q)</math>となる確率は<math>P_2 \,</math>以下である。
を満たす関数<math>h:{\mathcal M}\to S</math>により与えられる[[族 (数学)|族]]であり，<math>h</math>は<math>\mathcal F</math>から一様乱数にしたがって選択される。このとき<math>d(p,q)</math>は2点<math>p, q</math>の距離を表す関数であり、<math>P_1 > P_2</math>となるよう設計する。このような族<math>\mathcal F</math>は<math>(R,cR,P_1,P_2)</math>に鋭敏であるという。

これに準ずる定義として、領域<math>U</math>における類似度関数<math>\phi : U \times U \to [0,1]</math>によるものがある<ref name=Charikar2002>{{cite journal
 | author = Charikar, Moses S..
 | coauthors = 
 | year = 2002
 | title = Similarity Estimation Techniques from Rounding Algorithms
 | journal = Proceedings of the 34th Annual ACM Symposium on Theory of Computing 2002
 | pages = (ACM 1–58113–495–9/02/0005)…
 | url = http://portal.acm.org/citation.cfm?id=509965
 | accessdate = 2007-12-21
 | doi = 10.1145/509907.509965
}}</ref>。局所性鋭敏型ハッシュの性質は、ハッシュ関数の集合<math>H</math>と確率分布<math>D</math>により与えられる。あるハッシュ関数<math>h</math>は集合<math>H</math>から確率分布<math>D</math>により選ばれるが、<math>D</math>とは領域<math>U</math>に存在する2点<math>a, b</math>について、
: <math>Pr_{h \in H} [h(a) = h(b)] = \phi(a,b)</math>
を満たすような確率分布である。

== 手法 ==
=== ハミング距離に基づく標本化 ===
LSH族を構築するためのもっとも単純な手法は[[ハミング距離]]に基づくものである。これは<math>d</math>次元のベクトル<math>\{0,1\}^d</math>に対して適応できる。この手法は<math>d</math>次元のベクトルについて<math>i</math>番目の座標値をハッシュ値として与えるような族<math>\mathcal F</math>により定義され、<math>\mathcal F</math>とは例えば<math>{\mathcal F}=\{h:\{0,1\}^d\to \{0,1\}\mid h(x)=x_i,i =1 ... d\}</math>のように与えられる。ここで<math>\mathcal F</math>から<math>h</math>を任意に選ぶということは、入力点から任意にビットを選択するということに他ならない。この時、族は次の性質を持つ。
: <math>P_1=1-R/d \,</math>, <math>P_2=1-cR/d \,</math>

=== 安定分布に基づく手法 ===
ハッシュ関数<math>h_{\mathbf{a},b} (\boldsymbol{\upsilon}) : 
\mathcal{R}^d
\to \mathcal{N} </math>を<math>d</math>次元のベクトル<math>v</math>を整数の集合に移すような関数であると定義する<ref name=DIIM04>{{cite journal
 | author = Datar, M.
 | coauthors =  Immorlica, N., [[Piotr Indyk|Indyk, P.]], Mirrokni, V.S.
 | year=2004
 | title = Locality-Sensitive Hashing Scheme Based on p-Stable Distributions
 | url = http://theory.csail.mit.edu/~mirrokni/pstable.ps 
 | journal = Proceedings of the Symposium on Computational Geometry
}}</ref>。ハッシュ関数<math>h</math>は2つの乱数<math>a, b</math>によって定義される。ここで<math>a</math>とは安定分布から独立に選ばれる乱数であり、<math>b</math>とは<math>[0, r]</math>から一様に選ばれる実乱数である。<math>a</math>および<math>b</math>が選ばれたとき、ハッシュ関数<math>h_{\mathbf{a},b}</math>は
: <math>h_{\mathbf{a},b} (\boldsymbol{\upsilon}) = \left \lfloor \frac{\mathbf{a}\cdot \boldsymbol{\upsilon}+b}{r} \right \rfloor </math>
のように与えられる。

この他にもデータをより適切に対応させるハッシュ関数が提案されている<ref name=PJA10>
{{cite journal
 | author = Pauleve, L.
 | coauthors =  Jegou, H., Amsaleg, L.
 | year=2010
 | title = Locality sensitive hashing: A comparison of hash function types and querying mechanisms
 | url = http://hal.inria.fr/inria-00567191/en/ 
 | journal = Pattern recognition Letters
}}</ref>。例えば[[k-平均法]]に基づくハッシュ関数などは大域的最適解を与えることが保証されていないものの実用的なハッシュ関数として知られている。

== 出典 ==
{{脚注ヘルプ}}
<references />

== 関連項目 ==
* [[最近傍探索]]
* [[ハッシュ関数]]
* [[データ構造]]

{{Tech-stub}}
{{DEFAULTSORT:きよくしよせいえいひんかたはつしゆ}}
[[Category:検索アルゴリズム]]