Area Under the ROC Curve
Binary classifiers generally predict how likely a sample is to be positive by computing probability. Ultimately, we can evaluate the classifiers by comparing the probabilities with truth positive/negative labels.
Now we assume that there is a table which contains predicted scores (i.e., probabilities) and truth labels as follows:
Once the rows are sorted by the probabilities in a descending order, AUC gives a metric based on how many positive (
label=1) samples are ranked higher than negative (
label=0) samples. If many positive rows get larger scores than negative rows, AUC would be large, and hence our classifier would perform well.
Compute AUC on Hivemall
In Hivemall, a function
auc(double score, int label) provides a way to compute AUC for pairs of probability and truth label.
Sequential AUC computation on a single node
For instance, the following query computes AUC of the table which was shown above:
with data as ( select 0.5 as prob, 0 as label union all select 0.3 as prob, 1 as label union all select 0.2 as prob, 0 as label union all select 0.8 as prob, 1 as label union all select 0.7 as prob, 1 as label ) select auc(prob, label) as auc from ( select prob, label from data ORDER BY prob DESC ) t;
This query returns
0.83333 as AUC.
Since AUC is a metric based on ranked probability-label pairs as mentioned above, input data (rows) needs to be ordered by scores in a descending order.
Parallel approximate AUC computation
distribute by clause allows you to compute AUC in parallel:
with data as ( select 0.5 as prob, 0 as label union all select 0.3 as prob, 1 as label union all select 0.2 as prob, 0 as label union all select 0.8 as prob, 1 as label union all select 0.7 as prob, 1 as label ) select auc(prob, label) as auc from ( select prob, label from data DISTRIBUTE BY floor(prob / 0.2) SORT BY prob DESC ) t;
floor(prob / 0.2) means that the rows are distributed to 5 bins for the AUC computation because the column
prob is in a [0, 1] range.
Difference between AUC and Logarithmic Loss
Hivemall has another metric called Logarithmic Loss for binary classification. Both AUC and Logarithmic Loss compute scores for probability-label pairs.
Score produced by AUC is a relative metric based on sorted pairs. On the other hand, Logarithmic Loss simply gives a metric by comparing probability with its truth label one-by-one.
To give an example,
auc(prob, label) and
logloss(prob, label) respectively returns
0.54001 in the above case. Note that larger AUC and smaller Logarithmic Loss are better.