Hivemall supports Feature Hashing (a.k.a. hashing trick) through feature_hashing and mhash functions. Find the differences in the following examples.

feature_hashing function

feature_hashing applies MurmurHash3 hashing to features.

select feature_hashing('aaa');


select feature_hashing('aaa','-features 3');


select feature_hashing(array('aaa','bbb'));


select feature_hashing(array('aaa','bbb'),'-features 10');


select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'));


select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'), '-libsvm');


select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'), '-features 10');


select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'), '-features 10 -libsvm');


select feature_hashing(array(1,2,3));


select feature_hashing(array('1','2','3'));


select feature_hashing(array('1:0.1','2:0.2','3:0.3'));


select feature_hashing(features), features from training_fm limit 2;

["1803454","6630176"] ["userid#5689","movieid#3072"] ["1828616","6238429"] ["userid#4505","movieid#2331"]

select feature_hashing(array("userid#4505:3.3","movieid#2331:4.999", "movieid#2331"));


select feature_hashing();

usage: feature_hashing(array<string> features [, const string options]) -
       returns a hashed feature vector in array<string> [-features <arg>]
 -features,--num_features <arg>   The number of features [default:
                                  16777217 (2^24)]
 -libsvm                          Returns in libsvm format
                                  (<index>:<value>)* sorted by index
                                  ascending order


The hash value is starting from 1 and 0 is system reserved for a bias clause. The default number of features are 16777217 (2^24). You can control the number of features by -num_features (or -features) option.

mhash function

describe function extended mhash;

mhash(string word) returns a murmurhash3 INT value starting from 1

select mhash('aaa');


Note: The default number of features are 16777216 (2^24).

set hivevar:num_features=16777216;
select mhash('aaa',${num_features});


Note: mhash returns a +1'd murmurhash3 value starting from 1. Never returns 0 (It's a system reserved number).

set hivevar:num_features=1;
select mhash('aaa',${num_features});


Note: mhash does not considers feature values.

select mhash('aaa:2.0');


Note: mhash always returns a scalar INT value.

select mhash(array('aaa','bbb'));


Note: mhash value of an array is element order-sentitive.

select mhash(array('bbb','aaa'));


