Topic modeling is a way to analyze massive documents by clustering them into some topics. In particular, Latent Dirichlet Allocation (LDA) is one of the most popular topic modeling techniques; papers introduce the method are as follows:
- D. M. Blei, et al. Latent Dirichlet Allocation. Journal of Machine Learning Research 3, pp. 993-1022, 2003.
- M. D. Hoffman, et al. Online Learning for Latent Dirichlet Allocation. NIPS 2010.
Hivemall enables you to analyze your data such as, but not limited to, documents based on LDA. This page gives usage instructions of the feature.
This feature is supported from Hivemall v0.5-rc.1 or later.
Prepare document data
Assume that we already have a table
docs which contains many documents as string format:
|1||"Fruits and vegetables are healthy."|
|2||"I like apples, oranges, and avocados. I do not like the flu or colds."|
Hivemall has several functions which are particularly useful for text processing. More specifically, by using
is_stopword(), you can immediately convert the documents to bag-of-words-like format:
with word_counts as ( select docid, feature(word, count(word)) as word_count from docs t1 LATERAL VIEW explode(tokenize(doc, true)) t2 as word where not is_stopword(word) group by docid, word ) select docid, collect_list(word_count) as features from word_counts group by docid ;
It should be noted that, as long as your data can be represented as the feature format, LDA can be applied for arbitrary data as a generic clustering technique.
Building Topic Models and Finding Topic Words
Each feature vector is input to the
with word_counts as ( select docid, feature(word, count(word)) as word_count from docs t1 LATERAL VIEW explode(tokenize(doc, true)) t2 as word where not is_stopword(word) group by docid, word ), input as ( select docid, collect_list(word_count) as features from word_counts group by docid ) select train_lda(features, '-topics 2 -iter 20') as (label, word, lambda) from input ;
Here, an option
-topics 2 specifies the number of topics we assume in the set of documents.
order by docid ensures building a LDA model precisely in a single node. In case that you like to launch
train_lda in parallel, following query hopefully returns similar (but might be slightly approximated) result:
with word_counts as ( -- same as above ), input as ( select docid, collect_list(f) as features from word_counts group by docid ) select label, word, avg(lambda) as lambda from ( select train_lda(features, '-topics 2 -iter 20') as (label, word, lambda) from input ) t2 group by label, word -- order by lambda desc -- ordering is optional ;
Eventually, a new table
lda_model is generated as shown below:
In the table,
label indicates a topic index, and
lambda is a value which represents how each word is likely to characterize a topic. That is, we can say that, in terms of
lambda, top-N words are the topic words of a topic.
Obviously, we can observe that topic
0 corresponds to document
1, and topic
1 represents words in document
Predicting Topic Assignments of Documents
Once you have constructed topic models as described before, a function
lda_predict() allows you to predict topic assignments of documents.
For example, if we consider the
docs table, the exactly same set of documents as used for training, probability that a document is assigned to a topic can be computed by:
with test as ( select docid, word, count(word) as value from docs t1 LATERAL VIEW explode(tokenize(doc, true)) t2 as word where not is_stopword(word) group by docid, word ) select t.docid, lda_predict(t.word, t.value, m.label, m.lambda, '-topics 2') as probabilities from test t JOIN lda_model m ON (t.word = m.word) group by t.docid ;
|docid||probabilities (sorted by probabilities)|
Importantly, an option
-topics is expected to be the same value as you set for training.
Since the probabilities are sorted in descending order, a label of the most promising topic is easily obtained as:
select docid, probabilities.label from topic;
Of course, using the different set of documents for prediction is possible. Predicting topic assignments of newly observed documents should be more realistic scenario.