Okapi BM25 is a ranking function for documents for a given query.
It can also be used for a better replacement of TF-IDF and can be used for term-weight for each document.
The ranking function
Given a query , containing keywords , the BM25 score of a document is:
where is 's term frequency in the document , is the length of the document in words, and is the average document length in the text collection from which documents are drawn. and are free parameters, usually chosen, in absence of an advanced optimization, as and .
BM25 can also be applied for term weighing, showing how important a word is to a document in a collection or corpus, as follows:
where is a term appeared in document .
In similar to TF-IDF, you need to prepare a relation consists of (docid,word) tuples to compute BM25 score.
create external table wikipage ( docid int, page string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE; cd ~/tmp wget https://gist.githubusercontent.com/myui/190b91a3a792ccfceda0/raw/327acd192da4f96da8276dcdff01b19947a4373c/tfidf_test.tsv LOAD DATA LOCAL INPATH '/home/myui/tmp/tfidf_test.tsv' INTO TABLE wikipage; create or replace view wikipage_exploded as select docid, word from wikipage LATERAL VIEW explode(tokenize(page,true)) t as word where not is_stopword(word);
Define views of term/doc frequency
create or replace view term_frequency as select t1.docid, t2.word, t2.freq from ( select docid, tf(word) as word2freq from wikipage_exploded group by docid ) t1 LATERAL VIEW explode(word2freq) t2 as word, freq; create or replace view document_frequency as select word, count(distinct docid) docs from wikipage_exploded group by word; create or replace view doc_len as select docid, count(1) as dl, avg(count(1)) over () as avgdl, count(distinct docid) over () as total_docs from wikipage_exploded group by docid ;
Compute Okapi BM25 score
BM25 (and TF-IDF) score that represents importance of term for each document is useful for feature weight in feature engineering.
create table scores as select tf.docid, tf.word, bm25( tf.freq, dl.dl, dl.avgdl, dl.total_docs, df.docs -- , '-k1 1.5 -b 0.75' ) as bm25, tfidf(tf.freq, df.docs, dl.total_docs) as tfidf from term_frequency tf JOIN document_frequency df ON (tf.word = df.word) JOIN doc_len dl ON (tf.docid = dl.docid) ;
bm25()'s function signature and hyperparameters are as follows:
hive> select bm25(); FAILED: SemanticException Line 1:7 Wrong arguments 'bm25': #arguments must be greater than or equal to 5: 0 usage: bm25(double termFrequency, int docLength, double avgDocLength, int numDocs, int numDocsWithTerm [, const string options]) - Return an Okapi BM25 score in double [-b <arg>] [-d <arg>] [-k1 <arg>] [-min_idf <arg>] -b <arg> Hyperparameter with type double in range 0.0 and 1.0 [default: 0.75] -d,--delta <arg> Hyperparameter delta of BM25+ [default: 0.0] -k1 <arg> Hyperparameter with type double, usually in range 1.2 and 2.0 [default: 1.2] -min_idf,--epsilon <arg> Hyperparameter delta of BM25+ [default: 1e-8]
Show important terms for each document
select docid, to_ordered_list(feature(word,bm25), bm25, '-k 10') as bm25_scores, to_ordered_list(feature(word,tfidf),tfidf, '-k 10') as tfidf_scores from scores group by docid limit 10;
Retrive relevant documents for a given search terms
You can retrieve relevant documents for a given search query
wisdom, justice, discussion as follows:
WITH scores as ( select tf.docid, tf.word, bm25( tf.freq, dl.dl, dl.avgdl, dl.total_docs, df.docs -- , '-k1 1.5 -b 0.75' ) as bm25, tfidf(tf.freq, df.docs, dl.total_docs) as tfidf from term_frequency tf JOIN document_frequency df ON (tf.word = df.word) JOIN doc_len dl ON (tf.docid = dl.docid) where tf.word in ('wisdom', 'justice', 'discussion') ) select docid, sum(bm25) as score from scores group by docid order by score DESC LIMIT 10 ;