elasticsearch. guide

Word Delimiter Token Filter

Named word_delimiter, it Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

  • split on intra-word delimiters (by default, all non alpha-numeric characters).
  • “Wi-Fi” → “Wi”, “Fi”
  • split on case transitions: “PowerShot” → “Power”, “Shot”
  • split on letter-number transitions: “SD500” → “SD”, “500”
  • leading and trailing intra-word delimiters on each subword are ignored: “//hello—-there, ‘dude’” → “hello”, “there”, “dude”
  • trailing “’s” are removed for each subword: “O’Neil’s” → “O”, “Neil”

Parameters include:

  • generate_word_parts: If true causes parts of words to be generated: “PowerShot” => “Power” “Shot”. Defaults to true.
  • generate_number_parts: If true causes number subwords to be generated: “500-42” => “500” “42”. Defaults to true.
  • catenate_words: If true causes maximum runs of word parts to be catenated: “wi-fi” => “wifi”. Defaults to false.
  • catenate_numbers: If true causes maximum runs of number parts to be catenated: “500-42” => “50042”. Defaults to false.
  • catenate_all: If true causes all subword parts to be catenated: “wi-fi-4000” => “wifi4000”. Defaults to false.
  • split_on_case_change: If true causes “PowerShot” to be two tokens; (“Power-Shot” remains two parts regards). Defaults to true.
  • preserve_original: If true includes original words in subwords: “500-42” => “500” “42” “500-42”. Defaults to false.
  • split_on_numerics: If true causes “j2se” to be three tokens; “j” “2” “se”. Defaults to true.
  • stem_english_possessive: If true causes trailing “’s” to be removed for each subword: “O’Neil’s” => “O”, “Neil”. Defaults to true.

Advance settings include:

protected_words: A list of protected words from being delimiter. Either an array, or also can set protected_words_path which resolved to a file configured with protected words (one on each line). Automatically resolves to config/ based location if exists.

type_table: A custom type mapping table, for example (when configured using type_table_path):

    # Map the $, %, '.', and ',' characters to DIGIT 
    # This might be useful for financial data.
    $ => DIGIT
    % => DIGIT
    . => DIGIT
    \u002C => DIGIT

    # in some cases you might not want to split on ZWJ
    # this also tests the case where we need a bigger byte[]
    # see http://en.wikipedia.org/wiki/Zero-width_joiner
    \u200D => ALPHANUM
 
Fork me on GitHub