Okapi BM25 is a ranking function for documents for a given query.

It can also be used for a better replacement of TF-IDF and can be used for term-weight for each document.

# The ranking function

Given a query $Q$, containing keywords $q1,....,q_n$, the BM25 score of a document $D$ is:

$score(Q, D) = \sum_{i=1}^{n}IDF(q_{i}) \cdot \frac{tf(q_{i},D) \cdot (k_{1}+1)}{tf(q_{i},D) + k_{1} \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}$

where $tf(q_{i}, D)$ is $q_{i}$'s term frequency in the document $D$, $|D|$ is the length of the document $D$ in words, and $avgdl$ is the average document length in the text collection from which documents are drawn. $k_{1}$ and $b$ are free parameters, usually chosen, in absence of an advanced optimization, as $k_{1} \in [1.2,2.0]$ and $b = 0.75$.

BM25 can also be applied for term weighing, showing how important a word is to a document in a collection or corpus, as follows:

$score(t_{i}, D) = IDF(t_{i}) \cdot \frac{tf(t_{i},D) \cdot (k_{1}+1)}{tf(t_{i},D) + k_{1} \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}$

where $t_{i}$ is a term appeared in document $D$.

# Data preparation

In similar to TF-IDF, you need to prepare a relation consists of (docid,word) tuples to compute BM25 score.

create external table wikipage (
docid int,
page string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

cd ~/tmp
wget https://gist.githubusercontent.com/myui/190b91a3a792ccfceda0/raw/327acd192da4f96da8276dcdff01b19947a4373c/tfidf_test.tsv

LOAD DATA LOCAL INPATH '/home/myui/tmp/tfidf_test.tsv' INTO TABLE wikipage;

create or replace view wikipage_exploded
as
select
docid,
word
from
wikipage LATERAL VIEW explode(tokenize(page,true)) t as word
where
not is_stopword(word);


# Define views of term/doc frequency

create or replace view term_frequency
as
select
t1.docid,
t2.word,
t2.freq
from (
select
docid,
tf(word) as word2freq
from
wikipage_exploded
group by
docid
) t1
LATERAL VIEW explode(word2freq) t2 as word, freq;

create or replace view document_frequency
as
select
word,
count(distinct docid) docs
from
wikipage_exploded
group by
word;

create or replace view doc_len
as
select
docid,
count(1) as dl,
avg(count(1)) over () as avgdl,
count(distinct docid) over () as total_docs
from
wikipage_exploded
group by
docid
;


# Compute Okapi BM25 score

BM25 (and TF-IDF) score that represents importance of term for each document is useful for feature weight in feature engineering.

create table scores
as
select
tf.docid,
tf.word,
bm25(
tf.freq,
dl.dl,
dl.avgdl,
dl.total_docs,
df.docs
-- , '-k1 1.5 -b 0.75'
) as bm25,
tfidf(tf.freq, df.docs, dl.total_docs) as tfidf
from
term_frequency tf
JOIN document_frequency df ON (tf.word = df.word)
JOIN doc_len dl ON (tf.docid = dl.docid)
;


## Hyperparameters

bm25()'s function signature and hyperparameters are as follows:

hive> select bm25();
FAILED: SemanticException Line 1:7 Wrong arguments 'bm25':

#arguments must be greater than or equal to 5: 0

usage: bm25(double termFrequency, int docLength, double avgDocLength, int
numDocs, int numDocsWithTerm [, const string options]) - Return an
Okapi BM25 score in double [-b <arg>] [-d <arg>] [-k1 <arg>]
[-min_idf <arg>]
-b <arg>                   Hyperparameter with type double in range 0.0
and 1.0 [default: 0.75]
-d,--delta <arg>           Hyperparameter delta of BM25+ [default: 0.0]
-k1 <arg>                  Hyperparameter with type double, usually in
range 1.2 and 2.0 [default: 1.2]
-min_idf,--epsilon <arg>   Hyperparameter delta of BM25+ [default: 1e-8]


## Show important terms for each document

select
docid,
to_ordered_list(feature(word,bm25), bm25, '-k 10') as bm25_scores,
to_ordered_list(feature(word,tfidf),tfidf, '-k 10') as tfidf_scores
from
scores
group by
docid
limit 10;


# Retrive relevant documents for a given search terms

You can retrieve relevant documents for a given search query wisdom, justice, discussion as follows:

WITH scores as (
select
tf.docid,
tf.word,
bm25(
tf.freq,
dl.dl,
dl.avgdl,
dl.total_docs,
df.docs
-- , '-k1 1.5 -b 0.75'
) as bm25,
tfidf(tf.freq, df.docs, dl.total_docs) as tfidf
from
term_frequency tf
JOIN document_frequency df ON (tf.word = df.word)
JOIN doc_len dl ON (tf.docid = dl.docid)
where
tf.word in ('wisdom', 'justice', 'discussion')
)
select
docid,
sum(bm25) as score
from
scores
group by
docid
order by
score DESC
LIMIT 10
;

docid score
1 0.14190456024682774
2 0.02197354085722251