This page describes a list of useful Hivemall generic functions.

Array functions

Array UDFs

  • array_concat(array<ANY> x1, array<ANY> x2, ..) - Returns a concatenated array

      select array_concat(array(1),array(2,3));
      > [1,2,3]
    
  • array_intersect(array<ANY> x1, array<ANY> x2, ..) - Returns an intersect of given arrays

      select array_intersect(array(1,3,4),array(2,3,4),array(3,5));
      > [3]
    
  • array_remove(array<int|text> original, int|text|array<int> target) - Returns an array that the target is removed from the original array

      select array_remove(array(1,null,3),array(null));
      > [3]
    
      select array_remove(array("aaa","bbb"),"bbb");
      > ["aaa"]
    
  • sort_and_uniq_array(array<int>) - Takes an array of type INT and returns a sorted array in a natural order with duplicate elements eliminated

      select sort_and_uniq_array(array(3,1,1,-2,10));
      > [-2,1,3,10]
    
  • subarray_endwith(array<int|text> original, int|text key) - Returns an array that ends with the specified key

      select subarray_endwith(array(1,2,3,4), 3);
      > [1,2,3]
    
  • subarray_startwith(array<int|text> original, int|text key) - Returns an array that starts with the specified key

      select subarray_startwith(array(1,2,3,4), 2);
      > [2,3,4]
    
  • subarray(array<int> orignal, int fromIndex, int toIndex) - Returns a slice of the original array between the inclusive fromIndex and the exclusive toIndex

      select subarray(array(1,2,3,4,5,6), 2,4);
      > [3,4]
    

Array UDAFs

  • array_avg(array<NUMBER>) - Returns an array in which each element is the mean of a set of numbers

  • array_sum(array<NUMBER>) - Returns an array in which each element is summed up

List UDAF

  • to_ordered_list(PRIMITIVE value [, PRIMITIVE key, const string options]) or to_ordered_list(value, key [, const string options]) - Return list of values sorted by value itself or specific key

      with t as (
          select 5 as key, 'apple' as value
          union all
          select 3 as key, 'banana' as value
          union all
          select 4 as key, 'candy' as value
          union all
          select 2 as key, 'donut' as value
          union all
          select 3 as key, 'egg' as value
      )
      select                                             -- expected output
          to_ordered_list(value, key, '-reverse'),       -- [apple, candy, (banana, egg | egg, banana), donut] (reverse order)
          to_ordered_list(value, key, '-k 2'),           -- [apple, candy] (top-k)
          to_ordered_list(value, key, '-k 100'),         -- [apple, candy, (banana, egg | egg, banana), dunut]
          to_ordered_list(value, key, '-k 2 -reverse'),  -- [donut, (banana | egg)] (reverse top-k = tail-k)
          to_ordered_list(value, key),                   -- [donut, (banana, egg | egg, banana), candy, apple] (natural order)
          to_ordered_list(value, key, '-k -2'),          -- [donut, (banana | egg)] (tail-k)
          to_ordered_list(value, key, '-k -100'),        -- [donut, (banana, egg | egg, banana), candy, apple]
          to_ordered_list(value, key, '-k -2 -reverse'), -- [apple, candy] (reverse tail-k = top-k)
          to_ordered_list(value, '-k 2'),                -- [egg, donut] (alphabetically)    
          to_ordered_list(key, '-k -2 -reverse'),        -- [5, 4] (top-2 keys)
          to_ordered_list(key)                           -- [2, 3, 3, 4, 5] (natural ordered keys)
      from 
          t
      ;
    

Bitset functions

Bitset UDF

  • to_bits(int[] indexes) - Returns an bitset representation if the given indexes in long[]

      select to_bits(array(1,2,3,128));
      >[14,-9223372036854775808]
    
  • unbits(long[] bitset) - Returns an long array of the give bitset representation

      select unbits(to_bits(array(1,4,2,3)));
      > [1,2,3,4]
    
  • bits_or(array<long> b1, array<long> b2, ..) - Returns a logical OR given bitsets

      select unbits(bits_or(to_bits(array(1,4)),to_bits(array(2,3))));
      > [1,2,3,4]
    

Bitset UDAF

  • bits_collect(int|long x) - Returns a bitset in array

Compression functions

  • deflate(TEXT data [, const int compressionLevel]) - Returns a compressed BINARY object by using Deflater. The compression level must be in range [-1,9]

      select base91(deflate('aaaaaaaaaaaaaaaabbbbccc'));
      > AA+=kaIM|WTt!+wbGAA
    
  • inflate(BINARY compressedData) - Returns a decompressed STRING by using Inflater

      select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc'))));
      > aaaaaaaaaaaaaaaabbbbccc
    

Map functions

Map UDFs

  • map_get_sum(map<int,float> src, array<int> keys) - Returns sum of values that are retrieved by keys

  • map_tail_n(map SRC, int N) - Returns the last N elements from a sorted array of SRC

MAP UDAFs

  • to_map(key, value) - Convert two aggregated columns into a key-value map

  • to_ordered_map(key, value [, const int k|const boolean reverseOrder=false]) - Convert two aggregated columns into an ordered key-value map

      with t as (
          select 10 as key, 'apple' as value
          union all
          select 3 as key, 'banana' as value
          union all
          select 4 as key, 'candy' as value
      )
      select
          to_ordered_map(key, value, true),   -- {10:"apple",4:"candy",3:"banana"} (reverse)
          to_ordered_map(key, value, 1),      -- {10:"apple"} (top-1)
          to_ordered_map(key, value, 2),      -- {10:"apple",4:"candy"} (top-2)
          to_ordered_map(key, value, 3),      -- {10:"apple",4:"candy",3:"banana"} (top-3)
          to_ordered_map(key, value, 100),    -- {10:"apple",4:"candy",3:"banana"} (top-100)
          to_ordered_map(key, value),         -- {3:"banana",4:"candy",10:"apple"} (natural)
          to_ordered_map(key, value, -1),     -- {3:"banana"} (tail-1)
          to_ordered_map(key, value, -2),     -- {3:"banana",4:"candy"} (tail-2)
          to_ordered_map(key, value, -3),     -- {3:"banana",4:"candy",10:"apple"} (tail-3)
          to_ordered_map(key, value, -100)    -- {3:"banana",4:"candy",10:"apple"} (tail-100)
      from t
      ;
    

MapReduce functions

  • rowid() - Returns a generated row id of a form {TASK_ID}-{SEQUENCE_NUMBER}

  • taskid() - Returns the value of mapred.task.partition

Math functions

  • sigmoid(x) - Returns 1.0 / (1.0 + exp(-x))

Text processing functions

  • base91(binary) - Convert the argument from binary to a BASE91 string

      select base91(deflate('aaaaaaaaaaaaaaaabbbbccc'));
      > AA+=kaIM|WTt!+wbGAA
    
  • unbase91(string) - Convert a BASE91 string to a binary

      select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc'))));
      > aaaaaaaaaaaaaaaabbbbccc
    
  • normalize_unicode(string str [, string form]) - Transforms str with the specified normalization form. The form takes one of NFC (default), NFD, NFKC, or NFKD

      select normalize_unicode('ハンカクカナ','NFKC');
      > ハンカクカナ
    
      select normalize_unicode('㈱㌧㌦Ⅲ','NFKC');
      > (株)トンドルIII
    
  • split_words(string query [, string regex]) - Returns an array containing splitted strings

  • is_stopword(string word) - Returns whether English stopword or not

  • singularize(string word) - Returns singular form of a given English word

      select singularize(lower("Apples"));
    
      > "apple"
    
  • tokenize(string englishText [, boolean toLowerCase]) - Returns words in array

  • tokenize_ja(String line [, const string mode = "normal", const list<string> stopWords, const list<string> stopTags]) - returns tokenized strings in array. Refer this article for detail.

      select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");
    
      > ["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal"," モード"]
    
  • word_ngrams(array<string> words, int minSize, int maxSize) - Returns list of n-grams where minSize <= n <= maxSize

      select word_ngrams(tokenize('Machine learning is fun!', true), 1, 2);
    
      > ["machine","machine learning","learning","learning is","is","is fun","fun"]
    

Other functions

  • convert_label(const int|const float) - Convert from -1|1 to 0.0f|1.0f, or from 0.0f|1.0f to -1|1

  • each_top_k(int K, Object group, double cmpKey, *) - Returns top-K values (or tail-K values when k is less than 0). Refer this article for detail.

  • generate_series(const int|bigint start, const int|bigint end) - Generate a series of values, from start to end

      select generate_series(1,9);
    
      1
      2
      3
      4
      5
      6
      7
      8
      9
    

    A similar function to PostgreSQL's generate_serics. http://www.postgresql.org/docs/current/static/functions-srf.html

  • x_rank(KEY) - Generates a pseudo sequence number starting from 1 for each key

results matching ""

    No results matching ""