# What is "prediction problem"?

In a context of machine learning, numerous tasks can be seen as prediction problem. For example, this user guide provides solutions for:

For any kinds of prediction problems, we generally provide a set of input-output pairs as:

• Input: Set of features
• e.g., ["1:0.001","4:0.23","35:0.0035",...]
• Output: Target value
• e.g., 1, 0, 0.54, 42.195, ...

Once a prediction model has been constructed based on the samples, the model can make prediction for unforeseen inputs.

In order to train prediction models, an algorithm so-called stochastic gradient descent (SGD) is normally applied. You can learn more about this from the following external resources:

Importantly, depending on types of output value, prediction problem can be categorized into regression and classification problem.

# Regression

The goal of regression is to predict real values as shown below:

features (input) target real value (output)
["1:0.001","4:0.23","35:0.0035",...] 21.3
["1:0.2","3:0.1","13:0.005",...] 6.2
["5:1.3","22:0.0.089","77:0.0001",...] 17.1
... ...

In practice, target values could be any of small/large float/int negative/positive values. Our CTR prediction tutorial solves regression problem with small floating point target values in a 0-1 range, for example.

While there are several ways to realize regression by using Hivemall, train_regressor() is one of the most flexible functions. This feature is explained in this page.

# Classification

In contrast to regression, output for classification problems should be (integer) labels:

features (input) label (output)
["1:0.001","4:0.23","35:0.0035",...] 0
["1:0.2","3:0.1","13:0.005",...] 1
["5:1.3","22:0.0.089","77:0.0001",...] 1
... ...

In case the number of possible labels is 2 (0/1 or -1/1), the problem is binary classification, and Hivemall's train_classifier() function enables you to build binary classifiers. Binary Classification demonstrates how to use the function.

Another type of classification problems is multi-class classification. This task assumes that the number of possible labels is more than 2. We need to use different functions for the multi-class problems, and our news20 and iris tutorials would be helpful.

# Mathematical formulation of generic prediction model

Here, we briefly explain about how prediction model is constructed.

First and foremost, we represent input and output for prediction models as follows:

• Input: a vector $\mathbf{x}$
• Output: a value $y$

For a set of samples $(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \cdots, (\mathbf{x}_n, y_n)$, the goal of prediction algorithms is to find a weight vector (i.e., parameters) $\mathbf{w}$ by minimizing the following error:

$E(\mathbf{w}) := \frac{1}{n} \sum_{i=1}^{n} L(\mathbf{w}; \mathbf{x}_i, y_i) + \lambda R(\mathbf{w})$

In the above formulation, there are two auxiliary functions we have to know:

• $L(\mathbf{w}; \mathbf{x}_i, y_i)$
• Loss function for a single sample $(\mathbf{x}_i, y_i)$ and given $\mathbf{w}$.
• If this function produces small values, it means the parameter $\mathbf{w}$ is successfully learnt.
• $R(\mathbf{w})$
• Regularization function for the current parameter $\mathbf{w}$.
• It prevents failing to a negative condition so-called over-fitting.

($\lambda$ is a small value which controls the effect of regularization function.)

Eventually, minimizing the function $E(\mathbf{w})$ can be implemented by the SGD technique as described before, and $\mathbf{w}$ itself is used as a "model" for future prediction.

Interestingly, depending on a choice of loss and regularization function, prediction model you obtained will behave differently; even if one combination could work as a classifier, another choice might be appropriate for regression.

Below we list possible options for train_regressor and train_classifier, and this is the reason why these two functions are the most flexible in Hivemall:

• Loss function: -loss, -loss_function

• For train_regressor
• SquaredLoss (synonym: squared)
• QuantileLoss (synonym: quantile)
• EpsilonInsensitiveLoss (synonym: epsilon_insensitive)
• SquaredEpsilonInsensitiveLoss (synonym: squared_epsilon_insensitive)
• HuberLoss (synonym: huber)
• For train_classifier
• HingeLoss (synonym: hinge)
• LogLoss (synonym: log, logistic)
• SquaredHingeLoss (synonym: squared_hinge)
• ModifiedHuberLoss (synonym: modified_huber)
• The following losses are mainly designed for regression but can sometimes be useful in classification as well:
• SquaredLoss (synonym: squared)
• QuantileLoss (synonym: quantile)
• EpsilonInsensitiveLoss (synonym: epsilon_insensitive)
• SquaredEpsilonInsensitiveLoss (synonym: squared_epsilon_insensitive)
• HuberLoss (synonym: huber)
• Regularization function: -reg, -regularization

• L1
• L2
• ElasticNet
• RDA

Additionally, there are several variants of the SGD technique, and it is also configurable as:

• Optimizer: -opt, -optimizer
• SGD
• Momentum
• Hyperparameters
• -alpha 1.0 Learning rate.
• -momentum 0.9 Exponential decay rate of the first order moment.
• Nesterov
• AdaGrad (default)
• RMSprop
• RMSpropGraves
• Description: Alex Graves's RMSprop introducing weight decay and momentum.
• See: https://arxiv.org/abs/1308.0850
• Hyperparameters
• -alpha 1.0 Learning rate.
• -decay 0.95 Weight decay rate
• -momentum 0.9 Exponential decay rate of the first order moment.
• -eps 1.0 Constant for numerical stability
• AdaDelta
• See: https://arxiv.org/abs/1212.5701
• Hyperparameters
• -decay 0.95 Weight decay rate
• -eps 1e-6f Constant for numerical stability
• Adam
• See:
• Hyperparameters
• -alpha 1.0 Learning rate.
• -beta1 0.9 Exponential decay rate of the first order moment.
• -beta2 0.999 Exponential decay rate of the second order moment.
• -eps 1e-8f Constant for numerical stability
• -decay 0.0 Weight decay rate
• Nadam
• Description: Nadam is Adam optimizer with Nesterov momentum.
• See:
• Hyperparameters
• same as Adam except ...
• -scheduleDecay 0.004 Scheduled decay rate (for each 250 steps by the default; 1/250=0.004)
• Eve
• See: https://openreview.net/forum?id=r1WUqIceg
• Hyperparameters
• same as Adam except ...
• -beta3 0.999 Decay rate for Eve coefficient.
• -c 10 Constant used for gradient clipping clip(val, 1/c, c)
• AdamHD
• Description: Adam optimizer with Hypergradient Descent. Learning rate -alpha is automatically tuned.
• See:
• Hyperparameters
• same as Adam except ...
• -alpha 0.02 Learning rate.
• -beta -1e-6 Constant used for tuning learning rate.

Default (Adagrad+RDA), AdaDelta, Adam, and AdamHD is worth trying in my experience.

### Note

Option values are case insensitive and you can use sgd or rda, or huberloss in lower-case letters.

Furthermore, optimizer offers to set auxiliary options such as:

• Number of iterations: -iter, -iterations [default: 10]
• Repeat optimizer's learning procedure more than once to diligently find better result.
• Convergence rate: -cv_rate, -convergence_rate [default: 0.005]
• Define a stopping criterion for the iterative training.
• If the criterion is too small or too large, you may encounter over-fitting or under-fitting depending on value of -iter option.
• Mini-batch size: -mini_batch, -mini_batch_size [default: 1]
• Instead of learning samples one-by-one, this option enables optimizer to utilize multiple samples at once to minimize the error function.
• Appropriate mini-batch size leads efficient training and effective prediction model.

For details of available options, following queries might be helpful to list all of them:

select train_regressor('-help');
select train_classifier('-help');

SELECT train_regressor('-help');

FAILED: UDFArgumentException
train_regressor takes two or three arguments: List<Int|BigInt|Text> features, float target [, constant string options]

usage: train_regressor(list<string|int|bigint> features, double label [,
const string options]) - Returns a relation consists of
<string|int|bigint feature, float weight> [-alpha <arg>] [-amsgrad]
[-beta <arg>] [-beta1 <arg>] [-beta2 <arg>] [-beta3 <arg>] [-c
<arg>] [-cv_rate <arg>] [-decay] [-dense] [-dims <arg>]
[-disable_cv] [-disable_halffloat] [-eps <arg>] [-eta <arg>] [-eta0
<arg>] [-inspect_opts] [-iter <arg>] [-iters <arg>] [-l1_ratio
<arg>] [-lambda <arg>] [-loss <arg>] [-mini_batch <arg>] [-mix
<arg>] [-mix_cancel] [-mix_session <arg>] [-mix_threshold <arg>]
[-opt <arg>] [-power_t <arg>] [-reg <arg>] [-rho <arg>] [-scale
<arg>] [-ssl] [-t <arg>]
-alpha <arg>                            Coefficient of learning rate
[default: 1.0
(adam/RMSPropGraves), 0.02
(AdamHD/Nesterov)]
-amsgrad                                Whether to use AMSGrad variant of
Adam
-beta <arg>                             Hyperparameter for tuning alpha
in Adam-HD [default: 1e-6f]
-beta1,--momentum <arg>                 Exponential decay rate of the
first order moment used in Adam
[default: 0.9]
-beta2 <arg>                            Exponential decay rate of the
second order moment used in Adam
[default: 0.999]
-beta3 <arg>                            Exponential decay rate of alpha
value  [default: 0.999]
-c <arg>                                Clipping constant of alpha used
in Eve optimizer so that clipped
[default: 10]
-cv_rate,--convergence_rate <arg>       Threshold to determine
convergence [default: 0.005]
-decay                                  Weight decay rate [default: 0.0]
-dense,--densemodel                     Use dense model or not
-dims,--feature_dimensions <arg>        The dimension of model [default:
16777216 (2^24)]
-disable_cv,--disable_cvtest            Whether to disable convergence
check [default: OFF]
-disable_halffloat                      Toggle this option to disable the
use of SpaceEfficientDenseModel
-eps <arg>                              Denominator value of
AdaDelta/AdaGrad/Adam [default:
1e-8 (AdaDelta/Adam), 1.0
(Adagrad)]
-eta <arg>                              Learning rate scheme [default:
inverse/inv, fixed, simple]
-eta0 <arg>                             The initial learning rate
[default: 0.1]
-inspect_opts                           Inspect Optimizer options
-iter,--iterations <arg>                The maximum number of iterations
[default: 10]
-iters,--iterations <arg>               The maximum number of iterations
[default: 10]
-l1_ratio <arg>                         Ratio of L1 regularizer as a part
of Elastic Net regularization
[default: 0.5]
-lambda <arg>                           Regularization term [default
0.0001]
-loss,--loss_function <arg>             Loss function [SquaredLoss
(default), QuantileLoss,
EpsilonInsensitiveLoss,
SquaredEpsilonInsensitiveLoss,
HuberLoss]
-mini_batch,--mini_batch_size <arg>     Mini batch size [default: 1].
Expecting the value in range
[1,100] or so.
-mix,--mix_servers <arg>                Comma separated list of MIX
servers
-mix_cancel,--enable_mix_canceling      Enable mix cancel requests
-mix_session,--mix_session_name <arg>   Mix session name [default:
\${mapred.job.id}]
-mix_threshold <arg>                    Threshold to mix local updates in
range (0,127] [default: 3]
-opt,--optimizer <arg>                  Optimizer to update weights
[default: adagrad, sgd, momentum,
nesterov, rmsprop, rmspropgraves,
adadelta, adam, eve, adam_hd]
-power_t <arg>                          The exponent for inverse scaling
learning rate [default: 0.1]
-reg,--regularization <arg>             Regularization type [default:
rda, l1, l2, elasticnet]
-rho,--decay <arg>                       Exponential decay rate of the
first and second order moments
[default 0.95 (AdaDelta,
rmsprop)]
-scale <arg>                            Scaling factor for cumulative
weights [100.0]
-ssl                                    Use SSL for the communication
with mix servers
-t,--total_steps <arg>                  a total of n_samples * epochs
time steps


In practice, you can try different combinations of the options in order to achieve higher prediction accuracy.

You can also find the default optimizer hyperparameters by -inspect_opts option as follows:

select train_regressor(array(), 0, '-inspect_opts -optimizer adam -reg l1');

FAILED: UDFArgumentException Inspected Optimizer options ...
{disable_cvtest=false, regularization=L1, loss_function=SquaredLoss, eps=1.0E-8, decay=0.0, iterations=10, eta0=0.1, lambda=1.0E-4, eta=Invscaling, optimizer=adam, beta1=0.9, beta2=0.999, alpha=1.0, cv_rate=0.005, power_t=0.1}