Hivemall has a generic function for classification:
train_classifier. Compared to the other functions we will see in the later chapters,
train_classifier provides simpler and configurable generic interface which can be utilized to build binary classification models in a variety of settings.
This feature is supported from Hivemall v0.5-rc.1 or later.
create table classification_model as select feature, avg(weight) as weight from ( select train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no') as (feature, weight) from a9a_train ) t group by feature;
Prediction & evaluation
WITH test_exploded as ( select rowid, label, extract_feature(feature) as feature, extract_weight(feature) as value from a9a_test LATERAL VIEW explode(add_bias(features)) t AS feature ), predict as ( select t.rowid, sigmoid(sum(m.weight * t.value)) as prob, (case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end)as label from test_exploded t LEFT OUTER JOIN classification_model m ON (t.feature = m.feature) group by t.rowid ), submit as ( select t.label as actual, p.label as predicted, p.prob as probability from a9a_test t JOIN predict p on (t.rowid = p.rowid) ) select sum(if(actual = predicted, 1, 0)) / count(1) as accuracy from submit;
Comparison with the other binary classifiers
In the next part of this user guide, our binary classification tutorials introduce many different functions:
- Logistic Regression
- Passive Aggressive
All of them actually have the same interface, but mathematical formulation and its implementation differ from each other.
In particular, the above sample queries are almost same as a9a tutorial using Logistic Regression. The difference is only in a choice of training function:
However, at the same time, the options
-loss logloss -opt SGD -reg no for
train_classifier indicates that Hivemall uses the generic classifier as
logress. Hence, the accuracy of prediction based on either
train_classifier would be (almost) same under the configuration.
train_classifier supports the
-mini_batch option in a similar manner to what
logress does. Thus, following two training queries show the same results:
select logress(add_bias(features), label, '-mini_batch 10') as (feature, weight) from a9a_train
select train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no -mini_batch 10') as (feature, weight) from a9a_train
Likewise, you can generate many different classifiers based on its options.