My Blog: elasticsearch - How to perform "lowercase filter" along with "char

Sunday, 15 March 2015

elasticsearch - How to perform "lowercase filter" along with "char_filter"? -

as far read in es documentation:

"character filters used “tidy up” string before tokenized." "after tokenization, resulting token stream passed through specified token filters"

( source: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-analyzers.html )

from 2 statements, understand next steps executed:

char_filter; tokenization; filter.

problem:

i may have char_filter turns multiple letters @ once.

example: ph -> f.

however, "ph" won't turned "f", because "ph" not part of mapping.

so, analysis of "philipp" retrieves "filipp", whereas "philipp" retrieves "philipp".

working both upper , lowercase (to accomplish same result in both cases), number of mappings in char_filter (number of characters)².

example: ph -> f; ph -> f; ph -> f; ph -> f.

i wouldn't problem if had 4 mappings, if need more x² mappings, char_filter tends become big mess.

example of index:

{     "settings" : {         "index" : {             "analysis" : {                 "analyzer" : {                     "default_index" : {                         "type" : "custom",                         "tokenizer" : "whitespace",                         "filter" : [                             "lowercase"                         ],                         "char_filter" : [                             "misc_simplifications"                         ]                     }                 },                 "char_filter" : {                     "misc_simplifications" : {                         "type" : "mapping",                         "mappings" : [                             "ph=>f","ph=>f","ph=>f","ph=>f"                         ]                     }                 }             }         }     } }

philosophical question:

i understand may want treat "ph" , "ph" equally, "ph" mean totally different. there way of turning characters lowercase before char_filter phase? create sense?

because big mapping gives me feeling doing wrong or can find easier (more elegant) solution.

you're right in sequence of steps:

charfilter tokenizer tokenfilter

however, main purpose of charfilter clean info create tokenisation easier. illustration stripping out xml tags or replacing delimiter space character.

so - set misc_simplifications tokenfilter applied after lowercase filter.

{ "settings" : {     "index" : {         "analysis" : {             "analyzer" : {                 "default_index" : {                     "type" : "custom",                     "tokenizer" : "whitespace",                     "filter" : [                         "lowercase",                         "misc_simplifications"                     ]                 }             },             "filter" : {                 "misc_simplifications" : {                     "type" : "pattern_replace",                     "pattern": "ph",                     "replacement":"f"                 }             }         }     }   } }

note i've used pattern replace instead of mappings. modify regexp replace "ph" @ origin of token.

also - mappings phonetic replacements. i'm not sure of requirements, looks perchance phonetic token filter help you.

elasticsearch

My Blog

Sunday, 15 March 2015

elasticsearch - How to perform "lowercase filter" along with "char_filter"? -

No comments:

Post a Comment