elasticsearch - How to perform "lowercase filter" along with "char_filter"? -
as far read in es documentation:
"character filters used “tidy up” string before tokenized." "after tokenization, resulting token stream passed through specified token filters"( source: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-analyzers.html )
from 2 statements, understand next steps executed:
char_filter; tokenization; filter.problem:
i may have char_filter turns multiple letters @ once.
example: ph -> f.
however, "ph" won't turned "f", because "ph" not part of mapping.
so, analysis of "philipp" retrieves "filipp", whereas "philipp" retrieves "philipp".
working both upper , lowercase (to accomplish same result in both cases), number of mappings in char_filter (number of characters)².
example: ph -> f; ph -> f; ph -> f; ph -> f.
i wouldn't problem if had 4 mappings, if need more x² mappings, char_filter tends become big mess.
example of index:
{ "settings" : { "index" : { "analysis" : { "analyzer" : { "default_index" : { "type" : "custom", "tokenizer" : "whitespace", "filter" : [ "lowercase" ], "char_filter" : [ "misc_simplifications" ] } }, "char_filter" : { "misc_simplifications" : { "type" : "mapping", "mappings" : [ "ph=>f","ph=>f","ph=>f","ph=>f" ] } } } } } }
philosophical question:
i understand may want treat "ph" , "ph" equally, "ph" mean totally different. there way of turning characters lowercase before char_filter phase? create sense?
because big mapping gives me feeling doing wrong or can find easier (more elegant) solution.
you're right in sequence of steps:
charfilter tokenizer tokenfilterhowever, main purpose of charfilter clean info create tokenisation easier. illustration stripping out xml tags or replacing delimiter space character.
so - set misc_simplifications
tokenfilter applied after lowercase filter.
{ "settings" : { "index" : { "analysis" : { "analyzer" : { "default_index" : { "type" : "custom", "tokenizer" : "whitespace", "filter" : [ "lowercase", "misc_simplifications" ] } }, "filter" : { "misc_simplifications" : { "type" : "pattern_replace", "pattern": "ph", "replacement":"f" } } } } } }
note i've used pattern replace instead of mappings. modify regexp replace "ph" @ origin of token.
also - mappings phonetic replacements. i'm not sure of requirements, looks perchance phonetic token filter help you.
elasticsearch
No comments:
Post a Comment