The ICU analysis plugin allows for unicode normalization, collation and folding. The plugin is called analysis-icu
and can be installed by running:
bin/plugin install analysis-icu
The plugin includes the following analysis components:
ICU Normalization
Normalizes characters as explained here. It registers itself by default under icu_normalizer
or icuNormalizer
using the default settings. Allows for the name parameter to be provided which can include the following values: nfc
, nfkc
, and nfkc_cf
. Here is a sample settings:
{ "index" : { "analysis" : { "analyzer" : { "collation" : { "tokenizer" : "keyword", "filter" : ["icu_normalizer"] } } } } }
ICU Folding
Folding of unicode characters based on UTR#30
. It registers itself under icu_folding
and icuFolding
names. Sample setting:
{ "index" : { "analysis" : { "analyzer" : { "collation" : { "tokenizer" : "keyword", "filter" : ["icu_folding"] } } } } }
ICU Collation
Uses collation token filter. Allows to either specify the rules for collation (defined here) using the rules
parameter (can point to a location or expressed in the settings, location can be relative to config location), or using the language
parameter (further specialized by country and variant). By default registers under icu_collation
or icuCollation
and uses the default locale.
Here is a sample settings:
{ "index" : { "analysis" : { "analyzer" : { "collation" : { "tokenizer" : "keyword", "filter" : ["icu_collation"] } } } } }
And here is a sample of custom collation:
{ "index" : { "analysis" : { "analyzer" : { "collation" : { "tokenizer" : "keyword", "filter" : ["myCollator"] } }, "filter" : { "myCollator" : { "type" : "icu_collation", "language" : "en" } } } } }