This page explains the input format of training data in Hivemall. Here, we use EBNF-like notation for describing the format.
- Input Format for Classification
- Features format (for classification and regression)
- Label format in Binary Classification
- Label format in Multi-class Classification
- Input format in Regression
- Helper functions
Input Format for Classification
The classifiers of Hivemall takes 2 (or 3) arguments: features, label, and options (a.k.a. hyperparameters). The first two arguments of training functions represents training examples.
In Statistics, features and label are called Explanatory variable and Response Variable, respectively.
Features format (for classification and regression)
The format of features is common between (binary and multi-class) classification and regression.
ARRAY<INT|BIGINT|TEXT> for the type of features column.
The format of each feature in an array is as follows:
feature ::= <index>:<weight> or <index>
Each element of index or weight then accepts the following format:
index ::= <INT | BIGINT | TEXT> weight ::= <FLOAT>
The index are usually a number (INT or BIGINT) starting from 1. Here is an instance of a features.
10:3.4 123:0.5 34567:0.231
Note: As mentioned later, index "0" is reserved for a Bias/Dummy variable.
In addition to numbers, you can use a TEXT value for an index. For example, you can use array("height:1.5", "length:2.0") for the features.
Quantitative and Categorical variables
A quantitative variable must have an index entry.
Hivemall (v0.3.1 or later) provides add_feature_index function which is useful for adding indexes to quantitative variables.
select add_feature_index(array(3,4.0,5)) from dual;
You can omit specifying weight for each feature e.g. for Categorical variables as follows:
feature ::= <index>
Note 1.0 is used for the weight when omitting weight.
Bias/Dummy Variable in features
Note that "0" is reserved for a Bias variable (called dummy variable in Statistics).
The addBias function is Hivemall appends "0:1.0" as an element of array in features.
The mhash function takes a feature (i.e., index) of TEXT format and generates a hash number of a range from 1 to 2^24 (=16777216) by the default setting.
Feature hashing is useful where the dimension of feature vector (i.e., the number of elements in features) is so large. Consider applying mhash function) when a prediction model does not fit in memory and OutOfMemory exception happens.
In general, you don't need to use mhash when the dimension of feature vector is less than 16777216. If feature index is very long TEXT (e.g., "xxxxxxx-yyyyyy-weight:55.3") and uses huge memory spaces, consider using mhash as follows:
-- feature is v0.3.2 or before concat(mhash(extract_feature("xxxxxxx-yyyyyy-weight:55.3")), ":", extract_weight("xxxxxxx-yyyyyy-weight:55.3")) -- feature is v0.3.2-1 or later feature(mhash(extract_feature("xxxxxxx-yyyyyy-weight:55.3")), extract_weight("xxxxxxx-yyyyyy-weight:55.3"))
Feature (weight) normalization is important in machine learning. Please refer this article for detail.
Label format in Binary Classification
The label must be an INT typed column and the values are positive (+1) or negative (-1) as follows:
<label> ::= 1 | -1
Alternatively, you can use the following format that represents 1 for a positive example and 0 for a negative example:
<label> ::= 0 | 1
Label format in Multi-class Classification
You can used any PRIMITIVE type in the multi-class label.
<label> ::= <primitive type>
Typically, the type of label column will be INT, BIGINT, or TEXT.
Input format in Regression
In regression, response/predictor variable (we denote it as target) is a real number.
Before Hivemall v0.3, we accepts only FLOAT type for target.
<target> ::= <FLOAT>
You need to explicitly cast a double value of target to float as follows:
CAST(target as FLOAT)
On the other hand, Hivemall v0.3 or later accepts double compatible numbers in target.
<target> ::= <FLOAT | DOUBLE | INT | TINYINT | SMALLINT| BIGINT >
Target in Logistic Regression
Logistic regression is actually a binary classification scheme while it can produce probabilities of positive of a training example.
A target value of a training input must be in range 0.0 to 1.0, specifically 0.0 or 1.0.
-- hivemall v0.3.2 and before select concat("weight",":",55.0); -- hivemall v0.3.2-1 and later select feature("weight", 55.0);
select extract_feature("weight:55.0"), extract_weight("weight:55.0");
weight | 55.0
-- hivemall v0.4.0 and later select feature_index(array("10:0.2","7:0.3","9"));
select convert_label(-1), convert_label(1), convert_label(0.0f), convert_label(1.0f) from dual;
0.0f | 1.0f | -1 | 1
array<string> quantitative_features(array<string> featureNames, ...) is a helper function to create sparse quantitative features from a table.
array<string> categorical_features(array<string> featureNames, ...) is a helper function to create sparse categorical features from a table.
select categorical_features( array("is_cat","is_dog","is_lion","is_pengin","species"), 1, 0, 1.0, true, "dog" );
Preparing training data table
You can create a training data table as follows:
select rowid() as rowid, concat_array( array("bias:1.0"), categorical_features( array("id", "name"), id, name ), quantitative_features( array("height", "weight"), height, weight ) ) as features, click_or_not as label from table;