Optimizing Elasticsearch performance. Avoiding multi-match queries
All things Elastic series. Using copy_to parameter to optimize the search speed of your index
TL;DR Try to avoid querying multiple text fields at once. Instead, use copy_to
parameter to copy the values of multiple fields into one. Then, perform a search only in this field.
Elasticsearch 7.x version used
“Elasticsearch is elastic”. In many ways. Do not specify anything, and it will still index all your data. And, that will not be bad at all. However, to get the most value out of Elasticsearch (and decrease your bill) you will have to do some configurations.
Elasticsearch multi-match and why you should avoid using it
Full-text searches are expensive per se. Searching through multiple fields at once is even more expensive. Expensive in terms of computing power, not storage.
Queries that have to hit multiple fields are slow.
Optimization described in this post will optimize search speed, however, it will take (slightly) more disk space.
Good news! The storage cost is low. But, computing power is still expensive.
You can find interesting my other article that talks about SQL vs NoSQL optimization difference.
Real-world example
At NewsCatcherAPI, we collect up to 300,000 news articles of different languages from all over the web, then store them in our Elasticsearch cluster.
When users make a call to our search endpoint we match their input from the q
parameter to both, title and text of the news articles in our Elastisearch cluster (by default).
We do not use a multi-match query for that. Instead, we use copy_to
parameter to index both values in one field (that is then searched).
Assume that each document passed to you index has only 2 data points: title
and body_text
Doing it with the multi-match search:
-- define index structue
PUT news_index
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"body_text": {
"type": "text"
}
}
}
}
-- make a search call
GET news_index/_search
{
"query": {
"multi_match" : {
"query": "Elon Musk Grimes",
"fields": [ "title", "body_text" ]
}
}
}
That will be 2 operations. Slow.
If we want to optimize for the search speed:
-- define index structue with copy_to
PUT news_index
{
"mappings": {
"properties": {
"title": {
"type": "text",
"copy_to": "title_body_text"
},
"body_text": {
"type": "text",
"copy_to": "title_body_text"
},
"title_body_text": {
"type": "text"
}
}
}
}
-- make a search call
GET news_index/_search
{
"query": {
"match" : {
"title_body_text" : {
"query" : "Elon Musk Grimes"
}
}
}
}
That will be just 1 operation. Fast.
Important to understand: copy_to
will not create a new field in your source document.
In case you wonder how you could create a new field in your source document based on some pre-defined condition. Ingest node is your answer.
Know your use case in advance
Which is a golden rule for working with NoSQL databases. Elasticsearch is not an exception.
Know all the access patterns to your data in advance (if possible). That will help you optimize your cluster/index design.
Conclusion
“It is not a big deal” you might say. I may agree. There are many such small details that all together will play a significant role when your cluster is under heavy load and does not crush.
If you need help with your Elasticsearch cluster/index/set up, I do consulting.
artem {at{ newscatcherapi [dot] com
Through this article I got to know that you have used newscatcherapi. I think using NewsData.io NewsAPI would be a better option.