Designing an optimal multi-language search engine with Elasticsearch
All things Elastic series. Four different approaches to design a multi-language Elasticsearch index
Elasticsearch 7.x version used
When I was designing Elasticsearch index for NewsCatcherAPI, one of the biggest problems I had was handling multi-language news articles.
I knew that Elasticsearch has pre-build analyzers for the most popular languages. The question was “How do I manage to have documents with different languages that I can search all together (if needed)?”
Important: in our case, we had each document already labeled with the correct language. Still, it is not necessary for all approaches described in this post.
Also, for this post set up, let’s assume that each document (news article) has only 2 fields: title
and language
. Where language
is the language of the title. For simplicity, assume that there can be only two different languages: English (en
) and French (fr
).
Why care about language at all?
Each language is different in many ways (I speak 4 languages so gimme some credits). Lemmatization, stemming, stopwords. All of these are unique on a per-language basis.
So, if you want Elasticsearch to understand that “dogs” is just a plural form of “dog”, or that “different” and “differ” share the same root — you have to use language-specific analyzers. (even for English!)
First, I will describe 2 approached that I found on the web and explain why I did not like them. Then, I will propose my solution which we used for NewsCatherAPI. And, finally, I will leave a link to a very advanced approach that will autodetect the language.
Approach #1. Multi-field
The idea of such an approach is to index your text field multiple times using fields
parameter.
For example, if we want to make an index that would index the same text field with standard, English, and French analyzers:
PUT your_index_name_here
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"en": {
"type": "text",
"analyzer": "english"
},
"fr": {
"type": "text",
"analyzer": "french"
}
}
}
}
}
}
The title
variable is, therefore, indexed 3 times. Now, to search through all the languages, you will have to perform a multi-match search. For example:
GET your_index_name_here/_search
{
"query": {
"multi_match": {
"query": "ce n'est pas une méthode optimale",
"fields": [
"title",
"title.en",
"title.fr"
],
"type": "most_fields"
}
}
}
Advantages of the multi-fields approach
Easy to implement
Works even if data is not labeled with language
Drawbacks of the multi-fields approach
When you have just 2 languages it is acceptable, but assume you have 10 (as we do).
Slower indexing time. Indexing for each language
More storage. Each language index takes storage place
Expensive querying. Read more about in my other post
The first two points may not be that bad, however, the third one is. Assume you have 10 languages. To search through your entire database, you would have to compose a multi-match query that has to search through 10 differently indexed fields simultaneously (plus, the number of index’ shards).
To sum up this approach, it may be an acceptable option for a 2-3 languages index (and relaxed budget).
Approach #2. Multi-index
The most popular answer you can get on Stackoverflow (assumes the language of each document is known at the pre-index time).
Create a separate index for each language. For example, an index with English text we call index_en
and the French one would be index_fr
.
Then, if you know the language of search you can direct it to the correct index.
Advantages of the multi-index approach
Not storing the same information multiple times
Drawbacks of the multi-index approach
Managing multiple indices. Plus, documents of different languages will not follow a uniform distribution
Search for common fields will need to be done through all the indices.
Regarding the last point. Assume we have a timestamp field, and we want to retrieve all the articles published over this week. For that, you have to filter by the published DateTime field over all of the indexes. Technically, not a problem at all, just search your field from multiple indexes via a wildcard.
For example, if we want to search through both, index_en
and index_fr
, just use index_*
.
But, that is no better than using a multi-match search. It is a multi-index search now
Approach #3. Using ingest processor to identify correct field
My strategy is as follows:
Create one index
For each language, create its own separate field (not subfields)
Set up ingest processor to set
title_{lang}
field based on a value of language parameter
Index with a separate field for each language
PUT your_index_name_here
{
"mappings": {
"properties" : {
"language" : {
"type" : "keyword"
},
"title" : {
"type" : "text"
},
"title_en" : {
"type" : "text",
"analyzer" : "english"
},
"title_fr" : {
"type" : "text",
"analyzer" : "french"
}
}
}
}
Our source data does not have title_en
nor title_fr
fields. So, we have to set up the ingest node’s pipeline to fix this.
Ingest node
According to the official documentation:
Use an ingest node to pre-process documents before the actual document indexing happens. The ingest node intercepts bulk and index requests, it applies transformations, and it then passes the documents back to the index or bulk APIs.
We will use a set processor to “copy” title
value to title_en
or title_fr
depending on the language
value.
We have to write a simple painless script to make the set processor conditional.
We create an ingest pipeline called “langdetect”
PUT _ingest/pipeline/langdetect
{
"description": "copies the text data into a specific field depending on the language field",
"processors": [
{
"set": {
"if": "ctx.language == 'en'",
"field": "title_en",
"value": "{{title}}"
}
},
{
"set": {
"if": "ctx.language == 'fr'",
"field": "title_fr",
"value": "{{title}}"
}
}
]
}
It has 2 processors that will set title_en
and title_fr
fields depending on the values of the language
field.
According to our pipeline, if language
field’s value is equal to “en” then set title_en
field to what is in the title
field. Therefore, English text will be analyzed by a standard analyzer (title
field) and also by English analyzer (title_en
field).
Now, when the pipeline is created, we have to “attach” it to the index. So, let’s update our index settings:
PUT /your_index_name_here/_settings
{
"default_pipeline" : "langdetect"
}
Advantages
Single index with multilanguage support
Drawbacks
Slower indexing time
Works only when language is known
For NewsCatcherAPI example, when user wants to search in English she has to set language parameter in our API to en
. We, at the back-end, will search through title_
+ {lang}
which is title_en
in case of en
. It is a bit more complicated than that but should be enough to explain this blog post.
Approach #4. Detect language within Elasticsearch then index appropriately
Multilingual search using language identification in Elasticsearch
Conclusion
It was a bit of surprise when I realized that it is not easy and obvious to manage multi-language search index with Elasticsearch. I had to spend lots of time to figure out the best case for our use-case.
Hope this “cheat-sheet” will save you a bit of time!
Ask me questions in the comments if something is not clear.
By the way, if you need help with your Elasticsearch cluster/index/set up, I do consult.
artem [at] newscatcherapi [dot] com
Hi! We have our content in some languages (4, and further more). There are many keywords and numeric fields for direct search. And some text fields, each in some languages. We know request language. Which approach for indexing you advise?
can you suggest resourses to build multilingual elasticsearch search engine along with voice search functionality integrated with django