Tokenizer for English Texts

Hivemall provides simple English text tokenizer UDF that has following syntax:

tokenize(text input, optional boolean toLowerCase = false)

Tokenizer for Non-English Texts

Hivemall-NLP module provides some Non-English Text tokenizer UDFs as follows.

First of all, you need to issue the following DDLs to use the NLP module. Note NLP module is not included in hivemall-with-dependencies.jar.

add jar /path/to/hivemall-nlp-xxx-with-dependencies.jar;

source /path/to/define-additional.hive;

Japanese Tokenizer

Japanese text tokenizer UDF uses Kuromoji.

The signature of the UDF is as follows:

tokenize_ja(text input, optional const text mode = "normal", optional const array<string> stopWords, const array<string> stopTags, const array<string> userDict)


tokenize_ja is supported since Hivemall v0.4.1, and the fifth argument is supported since v0.5-rc.1 and later.

Its basic usage is as follows:

select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");


In addition, the third and fourth argument respectively allow you to use your own list of stop words and stop tags. For example, the following query simply ignores "kuromoji" (as a stop word) and noun word "分かち書き" (as a stop tag):

select tokenize_ja("kuromojiを使った分かち書きのテストです。", "normal", array("kuromoji"), array("名詞-一般"));


Moreover, the fifth argument userDict enables you to register a user-defined custom dictionary in Kuromoji official format:

select tokenize_ja("日本経済新聞&関西国際空港", "normal", null, null, 
                     "日本経済新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞", 
                     "関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,テスト名詞"


Note that you can pass null to each of the third and fourth argument to explicitly use Kuromoji's default stop words and stop tags.

If you have a large custom dictionary as an external file, userDict can also be const string userDictURL which indicates URL of the external file on somewhere like Amazon S3:

select tokenize_ja("日本経済新聞&関西国際空港", "normal", null, null,


For detailed APIs, please refer Javadoc of JapaneseAnalyzer as well.

Chinese Tokenizer

Chinese text tokenizer UDF uses SmartChineseAnalyzer.

The signature of the UDF is as follows:

tokenize_cn(string line, optional const array<string> stopWords)

Its basic usage is as follows:

select tokenize_cn("Smartcn为Apache2.0协议的开源中文分词系统,Java语言编写,修改的中科院计算所ICTCLAS分词系统。");

[smartcn, 为, apach, 2, 0, 协议, 的, 开源, 中文, 分词, 系统, java, 语言, 编写, 修改, 的, 中科院, 计算, 所, ictcla, 分词, 系统]

For detailed APIs, please refer Javadoc of SmartChineseAnalyzer as well.

