Tokenizer for English Texts
Hivemall provides simple English text tokenizer UDF that has following syntax:
tokenize(text input, optional boolean toLowerCase = false)
Tokenizer for Japanese Texts
Hivemall-NLP module provides a Japanese text tokenizer UDF using Kuromoji.
First of all, you need to issue the following DDLs to use the NLP module. Note NLP module is not included in hivemall-with-dependencies.jar.
add jar /tmp/hivemall-nlp-xxx-with-dependencies.jar;
The signature of the UDF is as follows:
tokenize_ja(text input, optional const text mode = "normal", optional const array<string> stopWords, optional const array<string> stopTags)
tokenize_ja is supported since Hivemall v0.4.1 and later.
It's basic usage is as follows:
For detailed APIs, please refer Javadoc of JapaneseAnalyzer as well.