Deep Dive into Querying Elasticsearch. Filter vs Query. Full text search

Or how to understand what official documentation is missing

Jan 21, 2020

If I had to describe Elasticsearch in one phrase I would say something like:

When search meets analytics at scale (in near real time)

Elasticsearch is in the top 10 most popular open-source technologies at the moment. Fair enough, it unites many crucial features that are not unique itself, however, it can make the best search engine/analytics platform when combined.

More precisely, Elasticsearch has become so popular due to a combination of the following features:

Search with relevance scoring
Full-text search
Analytics (aggregations)
Schemaless (no limitations on data schema), NoSQL, document-oriented
Rich choice of data types
Horizontally scalable
Fault-tolerant

Working with Elasticsearch for my side-project I quickly realized that official documentation looks more like a “squeeze” from what should be called documentation. I had to google around and stackowerflowing a lot, so I decided to compile all the information in this post.

In this article, I will write mostly about the querying/searching Elasticsearch cluster. There are many different ways you could accomplish more or less the same result, therefore, I will try to explain the pros and cons of each method.

More importantly, I will introduce you two important concepts — query and filter contexts — which are not well explained in the documentation. I will give you a set of rules for when it is better to use which method.

If there was just one thing that I would like you to remember after reading this article that would be:

Do you really need to score your documents while querying?

Query context vs Filter context

There is always a relevance score when we talk about Elasticsearch. The relevance score is a strictly positive float that indicates how well each document satisfies the searching criteria. This score is relative to the highest score assigned, therefore, higher the score better the relevance of a document to the searching criteria.

However, filter and query are two different concepts that you should be able to understand before writing your query.

Generally speaking, filter context is a yes/no option where each document either matches the query or not. A good example will be SQL WHERE followed by some conditions. SQL queries always return you the rows that strictly match the criteria. There is no way for an SQL query to return an ambiguous result.

Filters are automatically cached and do not contribute to the relevance score.

Elastisearch query context, on the other hand, shows you how well does each document matches your requirements. To do so, the query uses an analyzer to find the best matches.

The rule of a thumb would be to use filters for:

yes/no search
search on exact values (numeric, range and keyword)

Use queries for:

ambiguous result (some documents suit more than others)
full-text search

Unless you need relevance score or full-text search always try to use filters. Filters are “cheaper”.

In addition, Elasticsearch will automatically cache the results of filters.

In parts 1. and 2. I will speak about queries (that can be transformed into filters). Please do not confuse structured vs full text with query vs filters — those are two different things.

1. Structured querying

Also called term-level queries, structured queries are a group of querying methods that checks if a document should be selected or not. Therefore, there is no real need for relevance score in many cases — document either going to match or not (especially numerics).

Term-level queries are still queries, so they will return the score.

Term query

Returns the documents where the value of a field exactly matches the criteria. The term query is somewhat an alternative of SQL select * from table_name where column_name =...

The term query goes directly to the inverted index which makes it fast. It is preferred to use term only for keyword fields when working with text data.

GET /_search{    "query": {        "term": {            "<field_name>": {                "value": "<your_value>"            }        }    }}

The term query is run in the query context by default, therefore, it will calculate the score. Even if the score will be identical for all documents returned, additional computing power will be involved.

Term query with a filter

If we want to speed up term query and get it cached then it should be wrapped up in a constant_score filter.

Remember the rule of thumb? Use this method if you do not care about the relevance score.

GET /_search{    "query": {        "constant_score" : {            "filter" : {                "term" : {"<field_name>" : "<your_value>"}            }        }    }}

Now, the query is not calculating any relevance score, therefore, it is faster. Moreover, it is automatically cached.

Quick advise — use match instead of term for text fields.

Remember, the term query goes directly to the inverted index. Term query takes the value you provide and searches for it as it is, that is why it suits well for querying keyword fields that are stored without any transformations.

Terms query

As you could have guessed, the term query allows you to return documents which are matching at least one exact term.

Term query is somewhat an alternative of SQL select * from table_name where column_name is in...

Important to understand that querying field in Elasticsearch might be a list, for example { "name" : ["Odin", "Woden", "Wodan"] }. If you perform a terms query that contains one f the following names then this record will be matched — it does not have to match all the values in the field, but only one.

GET /_search{    "query" : {        "terms" : {            "name" : ["Frigg", "Odin", "Baldr"]        }    }}

Terms set query

Same as terms query but this time you can specify how many exact terms should be in the queried field.

You specify how many have to match — one, two, three or all of them. However, this number is another numeric field. Therefore, each document should contain this number (specific to this particular document).

Range query

Returns documents in which queried field’s value is within the defined range.

Equivalent of SQL select * from table_name where column_name is between...

Range query has its own syntax:

gt is greater than
gte is greater than or equal to
lt is less than
lte is less than or equal to

An example where the field’s value should be ≥ 4 and ≤ 17:

GET _search{    "query": {        "range" : {            "<field_name>" : {                "gte" : 4,                "lte" : 17            }        }    }}

The range query also works well with dates.

Regexp, wildcard and prefix queries

Regexp query returns the documents in which fields match your regular expression.

If you have never used regular expression then I highly advise you to get at least some understanding about what it is and when you could apply it.

Elasticsearch’s regexp is Lucene’s one. It has standard reserved characters and operators. If you worked already with Python’s re package then it should not be a problem to use it here. The only difference is that Lucene’s engine does not support anchor operators such as ^ and $.

You may find the entire list for the regexp in the official documentation.

In addition to the regexp query Elsticsearch has wildcard and prefix queries. Logically, those two are just special cases of regexp.

Unfortunately, I could not find any information regarding the performance of those 3 queries, therefore, I decided to test it myself to see if I find any significant difference.

I could not find any difference in performance while comparing a wildcard query using rehexp and wildcard query. In case you know what is the difference, please, tweet me.

Exists query

Due to the fact that Elasticsearch is schemaless (or no strict schema limitation), it is a fairly common situation when different documents have different fields. As a result, there is a lot of use to know whether a document has any certain field or not.

Exists query returns documents that contain an indexed value for a field

GET /_search{    "query": {        "exists": {            "field": "<your_field_name>"        }    }}

2. Full-text querying

Full-text queries work well with unstructured text data. Full-text queries take advantage of the analyzer. Therefore, I will briefly outline the Elasticsearch’s analyzer so that we can better analyze full-text querying.

Elasticsearch’s analyzer pipe

Every time text type data is inserted into the Elasticsearch index it is analyzed and, then, stored at the inverted index. Depending on how you configure the analyzer will impact your searching capabilities because analyzer is also applied for full-text search.

Analyzer pipe consists of three stages:

Character filter (0+) → Tokenizer (1) → Token filter (0+)

There is always one tokenizer and zero or more character & token filters.

1) Character filter receives the text data as it is, then it might preprocess the data before it gets tokenized. Character filters are used to:

Replace characters matching given regular expression
Replace characters matching given strings
Clean HTML text

2) Tokenizer breaks text data received after character filter (if any) into tokens. For example, whitespace tokenizer simply breaks text by the whitespace (it is not the standard one). Therefore, Wednesday is called after Woden. will be split into [Wednesday, is, called, after, Woden.]. There are many build-in tokenizers that can be used to create custom analyzers.

Standard tokenizer breaks text by whitespace after removing the punctuation. It is the most neutral option for the vast majority of languages.

In addition to tokenization, tokenizer does the following:

keeps track of tokens order,
notes start and end of each word
defines the type of token

3) The token filter applies some transformation on the tokens. There are many different token filters that you might choose to add to your analyzer. Some of the most popular are:

lowercase
stemmer (exist for many languages!)
remove duplicate
transformation to the ASCII equivalent
workaround with patterns
limit on token count
stop list of tokens (removes tokens from the stop list)

Now, when we know what the analyzer consists of we might think about how we are going to work with our data. Then, we might compose an analyzer that fits our case the most by choosing proper components. The analyzer can be specified on a per-field basis.

Enough theory, let’s see how the default analyzer works.

The standard analyzer is the default one. It has 0 character filters, standard tokenizer, lowercase and stops token filters. You can compose your custom analyzer as you wish, but there are also few build-in analyzers.

Some of the most efficient out of a box analyzers are the language analyzers that are taking the specifics of each language to make a more advanced transformation. Therefore, if you know in advance the language of your data, I would recommend switching from the standard analyzer to the one of the data’s languages.

The full-text query will use the same analyzer that was used while indexing the data. More precisely, the text of your query will go through the same transformations as the text data in the searching field, so that both are at the same level.

Match query

Match query is the standard query for querying the text fields.

We might call match query an equivalent of the term query but for the text type fields (while term should be used solely for the keyword type field when working with text data).

GET /_search{  "query" : {    "match" : {      "<text_field>" {        "query" : "<your_value>"      }    }  }}

The string that is passed into the query parameter (required one), by default, going to be processed by the same analyzer as the one that has been applied to the searched field. Unless you specify the analyzer yourself using analyzer parameter.

When you specify your phrase to be searched for it is being analyzed and the result is always a set of tokens. By default, Elasticsearch will be using OR operator between all of those tokens. That means that at least one should match — more matches will hit a higher score though. You might switch this to AND in operator parameter. In this case, all of the tokens will have to be found in the document for it to be returned.

If you want to have something in between OR and AND you might specify minimum_should_match parameter which specifies the number of clauses that should match. It can be specified in both, number and percentage.

fuzziness parameter (optional) allows you to omit the typos. Levenshtein distance is used for calculations.

If you apply match query to the keyword field then it will perform the same as term query. More interestingly, if you pass the exact value of a token that is stored in an inverted index to the term query then it will return exactly the same result as match query but faster as it will go straight to the inverted index.

Match phrase query

Same as match but the sequence order and proximity are important. Match query is not aware of the sequence and proximity, therefore, it is only possible to achieve the phrase match with a different type of a query.

GET /_search{    "query": {        "match_phrase" : {            "<text_field>" : {                "query" : "<your_value>",                "slop" : "0"            }        }    }}

match_phrase query has slop parameter (default value 0) which is responsible for skipping the terms. Therefore, if you specify slop equal to 1 then one word out of a phrase might be omitted.

Multi-match query

Multi-match query does the same job as the match with the only difference that it is applied to more than one field.

GET /_search{  "query": {    "multi_match" : {      "query":    "<your_value>",       "fields": [ "<text_field1>", "<text_field2>" ]     }  }}

fields names can be specified using wildcards
each field is equally weighted by default
each field’s contribution to the score can be boosted
if no fields specified in the fields parameter then all eligible fields will be searched

There are different types of multi_match. I am not going to describe them all in this post, but I will explain the most popular:

best_fields type (default) prefers results where tokens from searched value are found in one field to those results where searched tokens are split among different fields.

most_fields is somewhat opposite to best_fields type.

phrase type behaves as best_fields but searches for the entire phrase similar to match_phrase.

I highly recommend going through the official documentation to check how exactly the score is calculated for each of those fields.

3. Compound queries

Compound queries wrap together other queries. Compound queries:

combine te score
change behavior of wrapped queries
switch query context to filter context
any of above combined

Boolean query

Boolean query combines together other queries. It is the most important compound query.

Boolean query allows you to combine searches in query context with filter context searches.

The boolean query has four occurrences (types) that can be combined together:

must or “has to satisfy the clause”
should or “additional points to relevance score if clause is satisfied”
filter or “has to satisfy the clause but relevance score is not calculated”
must_not or “inverse to must, does not contribute to relevance score”

must and should → query context

filter and must_not → filter context

For those who are familiar with SQL must is AND while should is OR operators. Therefore, each query inside the must clause has to be satisfied.

Boosting query

Boosting query is alike with boost parameter for most queries but is not the same. Boosting query returns documents that match positive clause and reduces the score for the documents that match negative clause.

Constant score query

As we previously saw in term query example, constant_score query converts any query into filter context with relevance score equal to the boost parameter (default 1).

To sum up, Elasticsearch fits many purposes nowadays, and sometimes it is difficult to understand what is the best tool to use.

The main thing that I would like you to remember is that you do not always need to use the most advanced features to resolve easy problems.

If you do not need a relevance score to retrieve your data try to switch to the filter context.

Also, understanding how Elasticsearch works under the hood is crucial, so I recommend you to always know what your analyzer does.

There are many more query types in Elasticsearch. I tried to describe the most used ones. I hope you liked it.

Let me know if you would like to read another post where I give real examples of all queries.

I plan to post a few more posts on Elasticsearch, so do not miss it.

That was quite a long one, so if you got till there:

About me

My name is Artem, I build newscatcherapi.com - ultra-fast API to find news articles by any topic, country, language, website, or keyword.

I write about Python, cloud architecture, elasticsearch, data engineering, and entrepreneurship.

CODARIUM

Discussion about this post