Designing an optimal multi-language search engine with Elasticsearch

All things Elastic series. Four different approaches to design a multi-language Elasticsearch index

Elasticsearch 7.x version used


When I was designing Elasticsearch index for NewsCatcherAPI, one of the biggest problems I had was handling multi-language news articles.

I knew that Elasticsearch has pre-build analyzers for the most popular languages. The question was “How do I manage to have documents with different languages that I can search all together (if needed)?”

Important: in our case, we had each document already labeled with the correct language. Still, it is not necessary for all approaches described in this post.

Also, for this post set up, let’s assume that each document (news article) has only 2 fields: title and language. Where language is the language of the title. For simplicity, assume that there can be only two different languages: English (en) and French (fr).

Why care about language at all?

Each language is different in many ways (I speak 4 languages so gimme some credits). Lemmatization, stemming, stopwords. All of these are unique on a per-language basis.

So, if you want Elasticsearch to understand that “dogs” is just a plural form of “dog”, or that “different” and “differ” share the same root — you have to use language-specific analyzers. (even for English!)

First, I will describe 2 approached that I found on the web and explain why I did not like them. Then, I will propose my solution which we used for NewsCatherAPI. And, finally, I will leave a link to a very advanced approach that will autodetect the language.


Approach #1. Multi-field

The idea of such an approach is to index your text field multiple times using fields parameter.

For example, if we want to make an index that would index the same text field with standard, English, and French analyzers:

PUT your_index_name_here
{
  "mappings": {
    "properties": {
      "title": { 
        "type": "text",
        "fields": {
          "en": { 
            "type":     "text",
            "analyzer": "english"
          },
          "fr": { 
            "type":     "text",
            "analyzer": "french"
          }
        }
      }
    }
  }
}

The title variable is, therefore, indexed 3 times. Now, to search through all the languages, you will have to perform a multi-match search. For example:

GET your_index_name_here/_search
{
  "query": {
    "multi_match": {
      "query": "ce n'est pas une méthode optimale",
      "fields": [ 
        "title",
        "title.en",
        "title.fr"
      ],
      "type": "most_fields" 
    }
  }
}

Advantages of the multi-fields approach

  1. Easy to implement

  2. Works even if data is not labeled with language

Drawbacks of the multi-fields approach

When you have just 2 languages it is acceptable, but assume you have 10 (as we do).

  1. Slower indexing time. Indexing for each language

  2. More storage. Each language index takes storage place

  3. Expensive querying. Read more about in my other post

The first two points may not be that bad, however, the third one is. Assume you have 10 languages. To search through your entire database, you would have to compose a multi-match query that has to search through 10 differently indexed fields simultaneously (plus, the number of index’ shards).

To sum up this approach, it may be an acceptable option for a 2-3 languages index (and relaxed budget).

Approach #2. Multi-index

The most popular answer you can get on Stackoverflow (assumes the language of each document is known at the pre-index time).

Create a separate index for each language. For example, an index with English text we call index_en and the French one would be index_fr.

Then, if you know the language of search you can direct it to the correct index.

Advantages of the multi-index approach

  1. Not storing the same information multiple times

Drawbacks of the multi-index approach

  1. Managing multiple indices. Plus, documents of different languages will not follow a uniform distribution

  2. Indices are not free from a cluster perspective, as there is some level of resource overhead for each index.

  3. Search for common fields will need to be done through all the indices.

Regarding the last point. Assume we have a timestamp field, and we want to retrieve all the articles published over this week. For that, you have to filter by the published DateTime field over all of the indexes. Technically, not a problem at all, just search your field from multiple indexes via a wildcard.

For example, if we want to search through both, index_en and index_fr, just use index_*.

But, that is no better than using a multi-match search. It is a multi-index search now

Approach #3. Using ingest processor to identify correct field

My strategy is as follows:

  1. Create one index

  2. For each language, create its own separate field (not subfields)

  3. Set up ingest processor to set title_{lang} field based on a value of language parameter

Index with a separate field for each language

PUT your_index_name_here
{
  "mappings": {
    "properties" : {
        "language" : {
          "type" : "keyword"
        }, 
        "title" : {
          "type" : "text"
        },   
        "title_en" : {
          "type" : "text",
          "analyzer" : "english"
        },  
        "title_fr" : {
          "type" : "text",
          "analyzer" : "french"
        }
      }
    }
}

Our source data does not have title_en nor title_fr fields. So, we have to set up the ingest node’s pipeline to fix this.

Ingest node

According to the official documentation:

Use an ingest node to pre-process documents before the actual document indexing happens. The ingest node intercepts bulk and index requests, it applies transformations, and it then passes the documents back to the index or bulk APIs.

We will use a set processor to “copy” title value to title_en or title_fr depending on the language value.

We have to write a simple painless script to make the set processor conditional.

We create an ingest pipeline called “langdetect

PUT _ingest/pipeline/langdetect
{
  "description": "copies the text data into a specific field depending on the language field",
  "processors": [
		{
      "set": {
        "if": "ctx.language == 'en'",
        "field": "title_en",
        "value": "{{title}}"
      }
    },
	{
      "set": {
        "if": "ctx.language == 'fr'",
        "field": "title_fr",
        "value": "{{title}}"
      }
    }
   ]
}

It has 2 processors that will set title_en and title_fr fields depending on the values of the language field.

According to our pipeline, if language field’s value is equal to “en” then set title_en field to what is in the title field. Therefore, English text will be analyzed by a standard analyzer (title field) and also by English analyzer (title_en field).

Now, when the pipeline is created, we have to “attach” it to the index. So, let’s update our index settings:

PUT /your_index_name_here/_settings
{
    "default_pipeline" : "langdetect"
}

Advantages

  1. Single index with multilanguage support

Drawbacks

  1. Slower indexing time

  2. Works only when language is known

For NewsCatcherAPI example, when user wants to search in English she has to set language parameter in our API to en. We, at the back-end, will search through title_ + {lang} which is title_en in case of en. It is a bit more complicated than that but should be enough to explain this blog post.

Approach #4. Detect language within Elasticsearch then index appropriately

Multilingual search using language identification in Elasticsearch


Conclusion

It was a bit of surprise when I realized that it is not easy and obvious to manage multi-language search index with Elasticsearch. I had to spend lots of time to figure out the best case for our use-case.

Hope this “cheat-sheet” will save you a bit of time!

Ask me questions in the comments if something is not clear.


By the way, if you need help with your Elasticsearch cluster/index/set up, I do consult.

artem [at] newscatcherapi [dot] com