Named word_delimiter
, it Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:
- split on intra-word delimiters (by default, all non alpha-numeric characters).
- “Wi-Fi” → “Wi”, “Fi”
- split on case transitions: “PowerShot” → “Power”, “Shot”
- split on letter-number transitions: “SD500” → “SD”, “500”
- leading and trailing intra-word delimiters on each subword are ignored: “//hello—-there, ‘dude’” → “hello”, “there”, “dude”
- trailing “’s” are removed for each subword: “O’Neil’s” → “O”, “Neil”
Parameters include:
generate_word_parts
: Iftrue
causes parts of words to be generated: “PowerShot” => “Power” “Shot”. Defaults totrue
.generate_number_parts
: Iftrue
causes number subwords to be generated: “500-42” => “500” “42”. Defaults totrue
.catenate_words
: Iftrue
causes maximum runs of word parts to be catenated: “wi-fi” => “wifi”. Defaults tofalse
.catenate_numbers
: Iftrue
causes maximum runs of number parts to be catenated: “500-42” => “50042”. Defaults tofalse
.catenate_all
: Iftrue
causes all subword parts to be catenated: “wi-fi-4000” => “wifi4000”. Defaults tofalse
.split_on_case_change
: Iftrue
causes “PowerShot” to be two tokens; (“Power-Shot” remains two parts regards). Defaults totrue
.preserve_original
: Iftrue
includes original words in subwords: “500-42” => “500” “42” “500-42”. Defaults tofalse
.split_on_numerics
: Iftrue
causes “j2se” to be three tokens; “j” “2” “se”. Defaults totrue
.stem_english_possessive
: Iftrue
causes trailing “’s” to be removed for each subword: “O’Neil’s” => “O”, “Neil”. Defaults totrue
.
Advance settings include:
protected_words
: A list of protected words from being delimiter. Either an array, or also can set protected_words_path
which resolved to a file configured with protected words (one on each line). Automatically resolves to config/
based location if exists.
type_table
: A custom type mapping table, for example (when configured using type_table_path
):
# Map the $, %, '.', and ',' characters to DIGIT # This might be useful for financial data. $ => DIGIT % => DIGIT . => DIGIT \u002C => DIGIT # in some cases you might not want to split on ZWJ # this also tests the case where we need a bigger byte[] # see http://en.wikipedia.org/wiki/Zero-width_joiner \u200D => ALPHANUM