Okapi BM25のソースを表示

'''Okapi BM25'''は、[[情報検索]]における順位付けの手法である。[[検索エンジン]]が[[クエリ]]との関連性に応じて、文書を順位付けするのに用いられる。1970年代から1980年代にかけて、{{仮リンク|スティーブン・ロバートソン (コンピュータ科学者)|lavel=スティーブン・ロバートソン|en|Stephen Robertson (computer scientist)}}や{{仮リンク|カレン・スパーク・ジョーンズ|en|Karen Spärck Jones}}らが{{仮リンク|確率適合モデル|en|Probabilistic relevance model}}に基づいて開発した。BM25の"BM"は、"Best Matching"の略である。

[[ロンドン大学シティ校]]が1980年代から1990年代にかけて開発したオカピ情報検索システム (Okapi information retrieval system) に最初に実装されたため、 "Okapi BM25" と呼ばれるが、単に、この手法自体の名称である'''BM25'''とも呼ばれる。

== 順位付け手法 ==
BM25は、bag-of-wordsを拡張した手法であり、文書内のクエリの単語同士の相互関係ではなく、文書におけるクエリの単語の[[tf-idf|出現頻度]]に基づいて、[[コーパス|文書集合]]を順位付けする。

単語<math>q_1, ..., q_n</math>を含むクエリ{{mvar|Q}}が与えられたとき、文書{{mvar|D}}のBM25スコアは、

:<math> \text{score}(D,Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)},</math>

と定義される。このとき、<math>f(q_i, D)</math>を文書{{mvar|D}}における単語の出現頻度、<math>|D|</math>を文書{{mvar|D}}の単語数、{{math|avgdl}}を文書集合の平均単語数とする。<math>k_1</math>および{{mvar|b}}は任意の[[媒介変数|パラメータ]]であり、<math>k_1 \in [1.2,2.0]</math>、<math>b = 0.75</math>とされることが多い<ref>Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. ''An Introduction to Information Retrieval'', Cambridge University Press, 2009, p. 233.</ref>。また、単語<math>q_i</math>のidf値は、

:<math>\text{IDF}(q_i) = \log \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5},</math>

と定義される。このとき、{{mvar|N}}を全文書数、<math>n(q_i)</math>を<math>q_i</math>を含む文書数とする。また、<math>\text{IDF}(q_i)</math>には複数の定義があり、上記の定義式はその1つである。BM25では[[二項独立モデル]]に基づいて導出された。

ただし、上記の定義式では、半分以上の文書集合に出現する単語のidf値が負になるため、ほぼ同一の2つの文書について、半分以上の文書集合に出現する単語を含む文書と含まない文書とでは、後者のBM25スコアが大きくなってしまうことがある。そのため、実用上は、
* idf値の最小値を0とし、一般的な用語を完全に無視する
* idf値の最小値を定数<math>\epsilon</math>とし、一般的な用語を完全に無視することを避けつつ、影響を減らす
* idfが必ず正となる定義式に変える
といった処理がなされる。

== idfの情報理論的な解釈 ==
クエリの単語<math>q</math>が<math>n(q)</math>個の文書に出現したとき、無作為に選択した文書<math>D</math>に単語<math>q</math>が含まれる確率は<math>\frac{n(q)}{N}</math>である（<math>N</math>は全文書数）。したがって、「<math>D</math>が<math>q</math>を含む」という事象の[[情報量]]は、

:<math>-\log \frac{n(q)}{N} = \log \frac{N}{n(q)}.</math>

である。このとき、2つのクエリの単語<math>q_1</math>, <math>q_2</math>が与えられたとする。2つの単語が完全に独立して文書内に存在するとき、無作為に選択した文書<math>D</math>に2つの単語が出現する確率は、

:<math>\frac{n(q_1)}{N} \cdot \frac{n(q_2)}{N},</math>

となる。したがって、このときの情報量は、

:<math>\sum_{i=1}^{2} \log \frac{N}{n(q_i)}.</math>

となり、BM25のidf値の定義式と似た式が現れる。

== BM25の改変版 ==
* BM25の係数{{mvar|b}}を極値に変えたものは、'''BM11'''（<math>b=1</math>のとき）や'''BM15'''（<math>b=0</math>のとき）という<ref>http://xapian.org/docs/bm25.html</ref>。
* '''BM25F'''<ref>Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. [http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf ''Microsoft Cambridge at TREC-13: Web and HARD tracks.''] In Proceedings of TREC-2004.</ref><ref name="robertson2009">{{Cite journal2 |author1=Stephen Robertson |author2=Hugo Zaragoza |name-list-style=amp |date=2009 |title=The Probabilistic Relevance Framework: BM25 and Beyond |url=http://dl.acm.org/citation.cfm?id=1704810 |journal=Foundations and Trends in Information Retrieval |volume=3 |issue=4 |pages=333–389 |citeseerx=10.1.1.156.5282 |doi=10.1561/1500000019}}</ref>：重要度や長さが異なるフィールド（見出しや本文、アンカーテキストなど）で構成される文書を考慮した、BM25の改変版である。
* '''BM25+'''<ref>Yuanhua Lv and ChengXiang Zhai. [http://sifaka.cs.uiuc.edu/~ylv2/pub/cikm11-lowerbound.pdf ''Lower-bounding term frequency normalization.''] In Proceedings of CIKM'2011, pages 7-16.</ref>：BM25では、単語出現頻度が文書長で適切に正規化されていないため、クエリを含む長い文書よりクエリを含まない短い文書の方がBM25スコアが高くなることがある。BM25+は、BM25の上記について改良するために開発された。BM25+の定義式では任意のパラメータ<math>\delta</math>が追加されている（訓練データがないときの初期値は1.0）。BM25+の定義式は、下式である。

:<math> \text{score}(D,Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \left[ \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)} + \delta \right]</math>

== 出典 ==
{{Reflist}}

== 参考文献 ==
* {{cite conference|author1=Stephen E. Robertson |author2=Steve Walker |author3=Susan Jones |author4=Micheline Hancock-Beaulieu |author5=Mike Gatford  |name-list-style=amp | title=Okapi at TREC-3 | conference=Proceedings of the Third Text REtrieval Conference (TREC 1994)|conference-url=http://trec.nist.gov/pubs/trec3/t3_proceedings.html |location=Gaithersburg, USA|date=November 1994|url=http://trec.nist.gov/pubs/trec3/papers/city.ps.gz}}
* {{cite conference|author1=Stephen E. Robertson |author2=Steve Walker |author3=Micheline Hancock-Beaulieu  |name-list-style=amp |title=Okapi at TREC-7|conference=Proceedings of the Seventh Text REtrieval Conference|conference-url=http://trec.nist.gov/pubs/trec7/t7_proceedings.html|location=Gaithersburg, USA|date=November 1998|url=http://trec.nist.gov/pubs/trec7/papers/okapi_proc.pdf.gz}}
* {{Cite journal2 | doi = 10.1016/S0306-4573(00)00015-7| title = A probabilistic model of information retrieval: Development and comparative experiments: Part 1| journal = Information Processing & Management| volume = 36| issue = 6| pages = 779–808| year = 2000| last1 = Spärck Jones | first1 = K. | authorlink1 = :en:Karen Spärck Jones| last2 = Walker | first2 = S.| last3 = Robertson | first3 = S. E. | authorlink3 = :en:Stephen Robertson (computer scientist)| citeseerx = 10.1.1.134.6108}}
* {{Cite journal2 | doi = 10.1016/S0306-4573(00)00016-9| title = A probabilistic model of information retrieval: Development and comparative experiments: Part 2| journal = Information Processing & Management| volume = 36| issue = 6| pages = 809–840| year = 2000| last1 = Spärck Jones | first1 = K. | authorlink1 = :en:Karen Spärck Jones| last2 = Walker | first2 = S. | last3 = Robertson | first3 = S. E. | authorlink3 = :en:Stephen Robertson (computer scientist)}}
* {{Cite journal2 |author1=Stephen Robertson  |author2=Hugo Zaragoza  |name-list-style=amp | title=The Probabilistic Relevance Framework: BM25 and Beyond |journal=Foundations and Trends in Information Retrieval  | date=2009 | url=http://dl.acm.org/citation.cfm?id=1704810 | volume=3 | issue=4 | pages=333–389 | doi=10.1561/1500000019 |citeseerx=10.1.1.156.5282 |s2cid=207178704 }}

== 関連項目 ==
* [[自然言語処理]]
* [[tf-idf]]

== 外部リンク ==
* {{cite book|last1=[[Stephen Robertson (computer scientist)|Robertson]]|first1=Stephen|last2=Zaragoza|first2=Hugo|title=The Probabilistic Relevance Framework: BM25 and Beyond|date=2009|publisher=NOW Publishers, Inc.|isbn=978-1-60198-308-4|url=http://staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf}}

[[Category:アルゴリズム]]
[[Category:自然言語処理]]