Area Under the ROC Curve

ROC curve and Area Under the ROC Curve (AUC) are widely-used metric for binary (i.e., positive or negative) classification problems such as Logistic Regression.

Binary classifiers generally predict how likely a sample is to be positive by computing probability. Ultimately, we can evaluate the classifiers by comparing the probabilities with truth positive/negative labels.

Now we assume that there is a table which contains predicted scores (i.e., probabilities) and truth labels as follows:

probability
(predicted score)
truth label
0.5 0
0.3 1
0.2 0
0.8 1
0.7 1

Once the rows are sorted by the probabilities in a descending order, AUC gives a metric based on how many positive (label=1) samples are ranked higher than negative (label=0) samples. If many positive rows get larger scores than negative rows, AUC would be large, and hence our classifier would perform well.

Compute AUC on Hivemall

In Hivemall, a function auc(double score, int label) provides a way to compute AUC for pairs of probability and truth label.

Sequential AUC computation on a single node

For instance, the following query computes AUC of the table which was shown above:

with data as (
  select 0.5 as prob, 0 as label
  union all
  select 0.3 as prob, 1 as label
  union all
  select 0.2 as prob, 0 as label
  union all
  select 0.8 as prob, 1 as label
  union all
  select 0.7 as prob, 1 as label
)
select 
  auc(prob, label) as auc
from (
  select prob, label
  from data
  ORDER BY prob DESC
) t;

This query returns 0.83333 as AUC.

Since AUC is a metric based on ranked probability-label pairs as mentioned above, input data (rows) needs to be ordered by scores in a descending order.

Parallel approximate AUC computation

Meanwhile, Hive's distribute by clause allows you to compute AUC in parallel:

with data as (
  select 0.5 as prob, 0 as label
  union all
  select 0.3 as prob, 1 as label
  union all
  select 0.2 as prob, 0 as label
  union all
  select 0.8 as prob, 1 as label
  union all
  select 0.7 as prob, 1 as label
)
select 
  auc(prob, label) as auc
from (
  select prob, label
  from data
  DISTRIBUTE BY floor(prob / 0.2)
  SORT BY prob DESC
) t;

Note that floor(prob / 0.2) means that the rows are distributed to 5 bins for the AUC computation because the column prob is in a [0, 1] range.

Difference between AUC and Logarithmic Loss

Hivemall has another metric called Logarithmic Loss for binary classification. Both AUC and Logarithmic Loss compute scores for probability-label pairs.

Score produced by AUC is a relative metric based on sorted pairs. On the other hand, Logarithmic Loss simply gives a metric by comparing probability with its truth label one-by-one.

To give an example, auc(prob, label) and logloss(prob, label) respectively returns 0.83333 and 0.54001 in the above case. Note that larger AUC and smaller Logarithmic Loss are better.

results matching ""

    No results matching ""