Elasticsearch — Basic and Advanced Concepts

6 min readApr 10, 2020

What is Elasticsearch?

In our previous blog, we have seen Elasticsearch is a highly scalable open-source full-text search and analytics engine, built on the top of Apache Lucene. Elasticsearch allows you to store, search, and analyze huge volumes of data as quickly as possible and in near real-time.

Basic Concepts -

Index — Large collection of JSON documents. Can be compared to a database in relational databases. Every document must reside in an index.
Shards — Since, there is no limit on the number of documents that reside in an index, indices are often horizontally partitioned as shards that reside on nodes in the cluster.
Max documents allowed in a shard = 2,147,483,519 (as of now)‍
Type — Logical partition of an index. Similar to a table in relational databases. ‍
Fields — Similar to a column in relational databases. ‍
Analyzers — Used while indexing/searching the documents. These contain “tokenizers” that split phrases/text into tokens and “token-filters”, that filter/modify tokens during indexing & searching.‍
Mappings — Combination of Field + Analyzers. It defines how your fields can be stored & indexed.

Inverted Index

ES uses Inverted Indexes under the hood. Inverted Index is an index which maps terms to documents containing them.

Let’s say, we have 3 documents :

An inverted index for these documents can be constructed as -

The terms in the dictionary are stored in a sorted order to find them quickly.

Searching multiple terms is done by performing a lookup on the terms in the index. It performs either UNION or INTERSECTION on them and fetches relevant matching documents.

An ES Index is spanned across multiple shards, each document is routed to a shard in a round — robin fashion while indexing. We can customize which shard to route the document, and which shard search-requests are sent to.

ES Index is made of multiple Lucene indexes, which in turn, are made up of index segments. These are write once, read many types of indices, i.e the index files Lucene writes are immutable (except for deletions).

Analyzers -

Analysis is the process of converting text into tokens or terms which are added to the inverted index for searching. Analysis is performed by an analyzer. An analyzer can be either a built-in or a custom.

We can define single analyzer for both indexing & searching, or a different search-analyzer and an index-analyzer for a mapping.

Building blocks of analyzer-

Character filters — receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters.
Tokenizers — receives a stream of characters, breaks it up into individual tokens.
Token filters — receives the token stream and may add, remove, or change tokens.

Some Commonly used built-in analyzers -

1. Standard -

Divides text into terms on word boundaries. Lower-cases all terms. Removes punctuation and stopwords (if specified, default = None).