tokenize - Google Sites API full-text search does not work for non-Western languages -
in javaee application, i'm using atom-based google sites api retrieve content non-public google site. in essence, we're using google site lightweight cms, , within application utilize api retrieve site contents feed online help system. i've had setup while , it's working without hitch.
the issuein application, need add together full-text search functionality online help system. knew feature request come along @ point, when deciding on google sites host content, checked whether sites api supports full-text search. it does. example, next url search entire site my-site
pages containing keyword user
.
https://sites.google.com/feeds/content/my.doma.in/my-site?q=user
this works, , gives me expected result pages. only content written in western languages, or, more specifically, languages in tokens/words separated whitespace , punctuation. when run similar search on japanese content, searching keyword ユーザー
:
https://sites.google.com/feeds/content/my.doma.in/my-site?q=%e3%83%a6%e3%83%bc%e3%82%b6%e3%83%bc
i result pages in search term appears bare string, i.e. delimited either white-space or punctuation. since japanese language written in scriptio continua, not sufficient. pages contain, example:
ご自身のユーザー基本情報の確認
will not show in results. seems search index used behind scenes created based on "western" lexical rules, , japanese content not correctly tokenized. however, when search same keyword google site's search site field, right results. conclude a correctly tokenized index exists, seems impossible utilize api-based search.
what i've tried farto remedy situation, these avenues i've explored far:
i've tried looking language settings in google sites itself. there's general ui language setting set japanese , has no impact on api query results. there no per-page or per-template language settings forcefulness indexer/tokenizer's hand. i've tried quoting search string double quotes ("ユーザー"
). i've tried including wildcards (*ユーザー*
). i've tried using additional language parameters url mutual in other google apis: lang
, hl
(interface language), rl
(results language),.. i've tried creating google custom search engine, seems impossible work on non-public google site. so... i'm running out of ideas here. in worst case scenario, end having retrieve, tokenize, , index of content myself , create searchable way. since require substantial effort, know if has encountered same issue , has found acceptable workaround or solution.
update 1i have yet find elegant solution issue, raised defect on google apps apis issue tracker: https://code.google.com/a/google.com/p/apps-api-issues/issues/detail?id=3780
update 2after going , forth, google's engineers have acknowledged problem indeed exists described, , have "filed issue internally". defect ticket has been stuck in triaged state ever since. if you, me, interested in seeing issue resolved, please take moment star/vote on google's issue tracker.
i know how feels when waiting somebodies back upwards handle api bug while application going not met deadlines defined. issue described sound bug, "clean" solution have wait until google sites team guys resolve bug (i upvoted :) ) , able utilize search api.
however, in meanwhile, think should seek workarounds. may suggest different solution not met needs 100% may useful. example, configure site expose aggregation info feed feed processor rich search api - may rss feed articles google site burned feedly have nice multi languages search api back upwards (search content of stream) along strong authentication protect info privacy.
as architect know not proper solution issue, 1 time helped me build searchable application aggregating info 100+ different info sources using russian , ukrainian locales.
have luck in application development , allow me know if solution helped you! :)
full-text-search tokenize google-sites google-data-api
No comments:
Post a Comment