Approximate Counting using HyperLogLog

count(distinct value) can often cause memory exhausted errors where input data and the cardinarity of value are large.

HyperLogLog is an efficient algorithm for approximating the number of distinct elements in a multiset. Hivemall implements HyperLogLog++ in approx_count_distinct.

Usage

approx_count_distinct is less accurate than COUNT(DISTINCT expression), but performs better on huge input.

select
    count(distinct rowid) as actual,
    approx_count_distinct(rowid) as default_p 
from
    train;
actual default_p
45840617 45567770
select
    approx_count_distinct(rowid, '-p 4') as p4,
    approx_count_distinct(rowid, '-p 6 -sp 6') as p6_sp6,
    approx_count_distinct(rowid, '-p 14') as p14,
    approx_count_distinct(rowid, '-p 15') as p15,
    approx_count_distinct(rowid, '-p 16') as p16,
    approx_count_distinct(rowid, '-p 24') as p24,
    approx_count_distinct(rowid, '-p 25') as p25,
    approx_count_distinct(rowid, '-p 15 -sp 15') as p15_sp15
from
    train;
p4 p6_sp6 p14 p15 p16 p24 p25 p15_sp15
38033066 49332600 45051015 45567770 45614484 45831359 45832280 45567770

Note

p controls expected precision and memory consumption tradeoff and default p=15 generally works well. Find More information on this paper.

Function Signature

You can find the function signature and options of approx_count_distinct is as follows:

select 
    approx_count_distinct(rowid, '-help')
from
    train;
usage: HLLEvaluator [-help] [-p <arg>] [-sp <arg>]
 -help       Show function help
 -p <arg>    The size of registers for the normal set. `p` MUST be in the
             range [4,sp] and 15 by the default
 -sp <arg>   The size of registers for the sparse set. `sp` MUST be in the
             range [4,32] and 25 by the defaul

results matching ""

    No results matching ""