Sunday, 15 March 2015

elasticsearch - How to perform "lowercase filter" along with "char_filter"? -



elasticsearch - How to perform "lowercase filter" along with "char_filter"? -

as far read in es documentation:

"character filters used “tidy up” string before tokenized." "after tokenization, resulting token stream passed through specified token filters"

( source: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-analyzers.html )

from 2 statements, understand next steps executed:

char_filter; tokenization; filter.

problem:

i may have char_filter turns multiple letters @ once.

example: ph -> f.

however, "ph" won't turned "f", because "ph" not part of mapping.

so, analysis of "philipp" retrieves "filipp", whereas "philipp" retrieves "philipp".

working both upper , lowercase (to accomplish same result in both cases), number of mappings in char_filter (number of characters)².

example: ph -> f; ph -> f; ph -> f; ph -> f.

i wouldn't problem if had 4 mappings, if need more x² mappings, char_filter tends become big mess.

example of index:

{ "settings" : { "index" : { "analysis" : { "analyzer" : { "default_index" : { "type" : "custom", "tokenizer" : "whitespace", "filter" : [ "lowercase" ], "char_filter" : [ "misc_simplifications" ] } }, "char_filter" : { "misc_simplifications" : { "type" : "mapping", "mappings" : [ "ph=>f","ph=>f","ph=>f","ph=>f" ] } } } } } }

philosophical question:

i understand may want treat "ph" , "ph" equally, "ph" mean totally different. there way of turning characters lowercase before char_filter phase? create sense?

because big mapping gives me feeling doing wrong or can find easier (more elegant) solution.

you're right in sequence of steps:

charfilter tokenizer tokenfilter

however, main purpose of charfilter clean info create tokenisation easier. illustration stripping out xml tags or replacing delimiter space character.

so - set misc_simplifications tokenfilter applied after lowercase filter.

{ "settings" : { "index" : { "analysis" : { "analyzer" : { "default_index" : { "type" : "custom", "tokenizer" : "whitespace", "filter" : [ "lowercase", "misc_simplifications" ] } }, "filter" : { "misc_simplifications" : { "type" : "pattern_replace", "pattern": "ph", "replacement":"f" } } } } } }

note i've used pattern replace instead of mappings. modify regexp replace "ph" @ origin of token.

also - mappings phonetic replacements. i'm not sure of requirements, looks perchance phonetic token filter help you.

elasticsearch

No comments:

Post a Comment