Multilingual free-text search in an app with normalized data?_问答_开发者

Multilingual free-text search in an app with normalized data?

开发者 https://www.devze.com 2023-02-27 18:38 出处：网络

We have enums, free-text, and referenced fields etc. in our DB. Each enum has its own translation, free-text could be in any language. We\'d like to do efficient large-scale free-text searching and e

We have enums, free-text, and referenced fields etc. in our DB.

Each enum has its own translation, free-text could be in any language. We'd like to do efficient large-scale free-text searching and enum value based searching.

I know of solutions like Solr which are nice, but that would mean we'd have to index entire de-normalized records with all the text of all the languages in the system. Thi开发者_C百科s seems a bit excessive.

What are some recommended approaches for searching multilingual normalized data? Anyone tackle this before?

ETL. Extract, Transform, Load. In other words, get the data out of your existing databases, transform it (which is more than merely denormalizing it) and load it into SOLR. The SOLR db will be a lot smaller than the existing databases because there is no relational overhead. And SOLR search takes most of the load off of your existing database servers.

Take a good look at how to configure and use SOLR and learn about SOLR cores. You may want to put some languages in separate cores because that way you can more effectively use the various stemming algorithms in SOLR. But even with multilingual data you can still use bigrams (such as are used with Chinese language analysis).

Having multiple cores makes searching a bit more complex since you can try either a single language index, or an all-languages index. But it is much more effective to group language data and apply language specific stopwords, protected words, stemming and language analysis tools.

Normally you would include some key data in the index so that when you find a record via SOLR search, you can then reference directly into the source db. Also, you can have normalised and non-normalised data together, for instance an enum could be recorded in a normalised field in English as well as a non-normalised field in the same language as the free-text. A field can be duplicated in order to apply two different analysis and filtering treatments.

It would be worth your while to trial this with a subset of your data in order to learn how SOLR works and how best to configure it.