# What is "prediction problem"?

In a context of machine learning, numerous tasks can be seen as prediction problem. For example, this user guide provides solutions for:

For any kinds of prediction problems, we generally provide a set of input-output pairs as:

• Input: Set of features
• e.g., ["1:0.001","4:0.23","35:0.0035",...]
• Output: Target value
• e.g., 1, 0, 0.54, 42.195, ...

Once a prediction model has been constructed based on the samples, the model can make prediction for unforeseen inputs.

In order to train prediction models, an algorithm so-called stochastic gradient descent (SGD) is normally applied. You can learn more about this from the following external resources:

Importantly, depending on types of output value, prediction problem can be categorized into regression and classification problem.

# Regression

The goal of regression is to predict real values as shown below:

features (input) target real value (output)
["1:0.001","4:0.23","35:0.0035",...] 21.3
["1:0.2","3:0.1","13:0.005",...] 6.2
["5:1.3","22:0.0.089","77:0.0001",...] 17.1
... ...

In practice, target values could be any of small/large float/int negative/positive values. Our CTR prediction tutorial solves regression problem with small floating point target values in a 0-1 range, for example.

While there are several ways to realize regression by using Hivemall, train_regressor() is one of the most flexible functions. This feature is explained in this page.

# Classification

In contrast to regression, output for classification problems should be (integer) labels:

features (input) label (output)
["1:0.001","4:0.23","35:0.0035",...] 0
["1:0.2","3:0.1","13:0.005",...] 1
["5:1.3","22:0.0.089","77:0.0001",...] 1
... ...

In case the number of possible labels is 2 (0/1 or -1/1), the problem is binary classification, and Hivemall's train_classifier() function enables you to build binary classifiers. Binary Classification demonstrates how to use the function.

Another type of classification problems is multi-class classification. This task assumes that the number of possible labels is more than 2. We need to use different functions for the multi-class problems, and our news20 and iris tutorials would be helpful.

# Mathematical formulation of generic prediction model

Here, we briefly explain about how prediction model is constructed.

First and foremost, we represent input and output for prediction models as follows:

• Input: a vector $\mathbf{x}$
• Output: a value $y$

For a set of samples $(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \cdots, (\mathbf{x}_n, y_n)$, the goal of prediction algorithms is to find a weight vector (i.e., parameters) $\mathbf{w}$ by minimizing the following error:

$E(\mathbf{w}) := \frac{1}{n} \sum_{i=1}^{n} L(\mathbf{w}; \mathbf{x}_i, y_i) + \lambda R(\mathbf{w})$

In the above formulation, there are two auxiliary functions we have to know:

• $L(\mathbf{w}; \mathbf{x}_i, y_i)$
• Loss function for a single sample $(\mathbf{x}_i, y_i)$ and given $\mathbf{w}$.
• If this function produces small values, it means the parameter $\mathbf{w}$ is successfully learnt.
• $R(\mathbf{w})$
• Regularization function for the current parameter $\mathbf{w}$.
• It prevents failing to a negative condition so-called over-fitting.

($\lambda$ is a small value which controls the effect of regularization function.)

Eventually, minimizing the function $E(\mathbf{w})$ can be implemented by the SGD technique as described before, and $\mathbf{w}$ itself is used as a "model" for future prediction.

Interestingly, depending on a choice of loss and regularization function, prediction model you obtained will behave differently; even if one combination could work as a classifier, another choice might be appropriate for regression.

Below we list possible options for train_regressor and train_classifier, and this is the reason why these two functions are the most flexible in Hivemall:

• Loss function: -loss, -loss_function

• For train_regressor
• SquaredLoss (synonym: squared)
• QuantileLoss (synonym: quantile)
• EpsilonInsensitiveLoss (synonym: epsilon_insensitive)
• SquaredEpsilonInsensitiveLoss (synonym: squared_epsilon_insensitive)
• HuberLoss (synonym: huber)
• For train_classifier
• HingeLoss (synonym: hinge)
• LogLoss (synonym: log, logistic)
• SquaredHingeLoss (synonym: squared_hinge)
• ModifiedHuberLoss (synonym: modified_huber)
• The following losses are mainly designed for regression but can sometimes be useful in classification as well:
• SquaredLoss (synonym: squared)
• QuantileLoss (synonym: quantile)
• EpsilonInsensitiveLoss (synonym: epsilon_insensitive)
• SquaredEpsilonInsensitiveLoss (synonym: squared_epsilon_insensitive)
• HuberLoss (synonym: huber)
• Regularization function: -reg, -regularization

• L1
• L2
• ElasticNet
• RDA

Additionally, there are several variants of the SGD technique, and it is also configurable as:

• Optimizer: -opt, -optimizer
• SGD
• Momentum
• Hyperparameters
• -alpha 1.0 Learning rate.
• -momentum 0.9 Exponential decay rate of the first order moment.
• Nesterov
• RMSprop
• RMSpropGraves
• Description: Alex Graves's RMSprop introducing weight decay and momentum.
• See: https://arxiv.org/abs/1308.0850
• Hyperparameters
• -alpha 1.0 Learning rate.
• -decay 0.95 Weight decay rate
• -momentum 0.9 Exponential decay rate of the first order moment.
• -eps 1.0 Constant for numerical stability
• See: https://arxiv.org/abs/1212.5701
• Hyperparameters
• -decay 0.95 Weight decay rate
• -eps 1e-6f Constant for numerical stability
• See:
• Hyperparameters
• -alpha 1.0 Learning rate.
• -beta1 0.9 Exponential decay rate of the first order moment.
• -beta2 0.999 Exponential decay rate of the second order moment.
• -eps 1e-8f Constant for numerical stability
• -decay 0.0 Weight decay rate
• Description: Nadam is Adam optimizer with Nesterov momentum.
• See:
• Hyperparameters
• same as Adam except ...
• -scheduleDecay 0.004 Scheduled decay rate (for each 250 steps by the default; 1/250=0.004)
• Eve
• See: https://openreview.net/forum?id=r1WUqIceg
• Hyperparameters
• same as Adam except ...
• -beta3 0.999 Decay rate for Eve coefficient.
• -c 10 Constant used for gradient clipping clip(val, 1/c, c)
• Description: Adam optimizer with Hypergradient Descent. Learning rate -alpha is automatically tuned.
• See:
• Hyperparameters
• same as Adam except ...
• -alpha 0.02 Learning rate.
• -beta -1e-6 Constant used for tuning learning rate.

### Note

Option values are case insensitive and you can use sgd or rda, or huberloss in lower-case letters.

Furthermore, optimizer offers to set auxiliary options such as:

• Number of iterations: -iter, -iterations [default: 10]
• Repeat optimizer's learning procedure more than once to diligently find better result.
• Convergence rate: -cv_rate, -convergence_rate [default: 0.005]
• Define a stopping criterion for the iterative training.
• If the criterion is too small or too large, you may encounter over-fitting or under-fitting depending on value of -iter option.
• Mini-batch size: -mini_batch, -mini_batch_size [default: 1]
• Instead of learning samples one-by-one, this option enables optimizer to utilize multiple samples at once to minimize the error function.
• Appropriate mini-batch size leads efficient training and effective prediction model.

For details of available options, following queries might be helpful to list all of them:

select train_regressor('-help');
-- v0.5.0 or before
-- select train_regressor(array(), 0, '-help');

select train_classifier('-help');
-- v0.5.0 or before
-- select train_classifier(array(), 0, '-help');


In practice, you can try different combinations of the options in order to achieve higher prediction accuracy.