Feature Selection is the process of selecting a subset of relevant features for use in model construction.

It is a useful technique to 1) improve prediction results by omitting redundant features, 2) to shorten training time, and 3) to know important features for prediction.

Note: This feature is supported from Hivemall v0.5-rc.1 or later.

# Supported Feature Selection algorithms

• Chi-square (Chi2)
• In statistics, the $\chi^2$ test is applied to test the independence of two even events. Chi-square statistics between every feature variable and the target variable can be applied to Feature Selection. Refer this article for Mathematical details.
• Signal Noise Ratio (SNR)
• The Signal Noise Ratio (SNR) is a univariate feature ranking metric, which can be used as a feature selection criterion for binary classification problems. SNR is defined as $|\mu_{1} - \mu_{2}| / (\sigma_{1} + \sigma_{2})$, where $\mu_{k}$ is the mean value of the variable in classes $k$, and $\sigma_{k}$ is the standard deviations of the variable in classes $k$. Clearly, features with larger SNR are useful for classification.

# Usage

## Feature Selection based on Chi-square test

CREATE TABLE input (
X array<double>, -- features
Y array<int> -- binarized label
);

set hivevar:k=2;

WITH stats AS (
SELECT
transpose_and_dot(Y, X) AS observed, -- array<array<double>>, shape = (n_classes, n_features)
array_sum(X) AS feature_count, -- n_features col vector, shape = (1, array<double>)
array_avg(Y) AS class_prob -- n_class col vector, shape = (1, array<double>)
FROM
input
),
test AS (
SELECT
transpose_and_dot(class_prob, feature_count) AS expected -- array<array<double>>, shape = (n_class, n_features)
FROM
stats
),
chi2 AS (
SELECT
chi2(r.observed, l.expected) AS v -- struct<array<double>, array<double>>, each shape = (1, n_features)
FROM
test l
CROSS JOIN stats r
)
SELECT
select_k_best(l.X, r.v.chi2, ${k}) as features -- top-k feature selection based on chi2 score FROM input l CROSS JOIN chi2 r;  ## Feature Selection based on Signal Noise Ratio (SNR) CREATE TABLE input ( X array<double>, -- features Y array<int> -- binarized label ); set hivevar:k=2; WITH snr AS ( SELECT snr(X, Y) AS snr -- aggregated SNR as array<double>, shape = (1, #features) FROM input ) SELECT select_k_best(X, snr,${k}) as features
FROM
input
CROSS JOIN snr;


# Function signatures

### [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>

##### Input
array<number> X array<number> Y
a row of matrix a row of matrix
##### Output
array<array<double>> dot product
dot(X.T, Y) of shape = (X.#cols, Y.#cols)

### [UDF] select_k_best(X::array<number>, importance_list::array<number>, k::int)::array<double>

##### Input
array<number> X array<number> importance_list int k
feature vector importance of each feature the number of features to be selected
##### Output
array<array<double>> k-best features
top-k elements from feature vector X based on importance list

### [UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>

##### Input
array<number> observed array<number> expected
observed features expected features dot(class_prob.T, feature_count)

Both of observed and expected have a shape (#classes, #features)

##### Output
struct<array<double>, array<double>> importance_list
chi2-value and p-value for each feature

### [UDAF] snr(X::array<number>, Y::array<int>)::array<double>

##### Input
array<number> X array<int> Y
feature vector one hot label
##### Output
array<double> importance_list
Signal Noise Ratio for each feature