An analyzer of type pattern
that can flexibly separate text into terms via a regular expression. Accepts the following settings:
The following are settings that can be set for a pattern
analyzer type:
Setting | Description |
---|---|
lowercase |
Should terms be lowercased or not. Defaults to true . |
pattern |
The regular expression pattern, defaults to \W+ . |
flags |
The regular expression flags. |
IMPORTANT: The regular expression should match the token separators, not the tokens themselves.
Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS"
. Check Java Pattern API for more details about flags
options.
Pattern Analyzer Examples
In order to try out these examples, you should delete the test
index before running each example:
curl -XDELETE localhost:9200/test
Whitespace tokenizer
curl -XPUT 'localhost:9200/test' -d ' { "settings":{ "analysis": { "analyzer": { "whitespace":{ "type": "pattern", "pattern":"\\s+" } } } } }' curl 'localhost:9200/test/_analyze?pretty=1&analyzer=whitespace' -d 'foo,bar baz' # "foo,bar", "baz"
Non-word character tokenizer
curl -XPUT 'localhost:9200/test' -d ' { "settings":{ "analysis": { "analyzer": { "nonword":{ "type": "pattern", "pattern":"[^\\w]+" } } } } }' curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'foo,bar baz' # "foo,bar baz" becomes "foo", "bar", "baz" curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'type_1-type_4' # "type_1","type_4"
CamelCase tokenizer
curl -XPUT 'localhost:9200/test?pretty=1' -d ' { "settings":{ "analysis": { "analyzer": { "camel":{ "type": "pattern", "pattern":"([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])" } } } } }' curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d ' MooseX::FTPClass2_beta ' # "moose","x","ftp","class","2","beta"
The regex above is easier to understand as:
([^\\p{L}\\d]+) # swallow non letters and numbers, | (?<=\\D)(?=\\d) # or non-number followed by number, | (?<=\\d)(?=\\D) # or number followed by non-number, | (?<=[ \\p{L} && [^\\p{Lu}]]) # or lower case (?=\\p{Lu}) # followed by upper case, | (?<=\\p{Lu}) # or upper case (?=\\p{Lu} # followed by upper case [\\p{L}&&[^\\p{Lu}]] # then lower case )