Elastic Search

Introduction

It is a distributed, open-source search and analytics engine built on Apache Lucene and developed in Java, allows you to store, search, and analyze huge volumes of data quickly and in near real-time and give back answers in milliseconds.
Purpose-built for full-text search and excels at searching and analyzing unstructured or semi-structured data. It uses inverted indices and scoring algorithms to provide highly relevant search results
Comes with extensive REST APIs for storing and searching the data
Support horizontal scaling
Schema less
Powerful aggregation capabilities that allow you to perform complex analytics on your data. It supports various aggregation functions like sum, average, min, max, and more, enabling you to derive insights from your data in real-time.
Part of the Elastic Stack, which includes complementary tools like Logstash for data ingestion and Kibana for data visualization and monitoring. This ecosystem integration provides a seamless end-to-end solution for search, analytics, and monitoring needs.

When we index a document, Elastic search takes the full text fields of the document and runs them through an analysis process
The results of the analysis is actually what is stored within the index that a document is added to. More specifically, the analyzed terms are stored within something called the inverted index

When the new documents are inserted, A character filter receives a text field’s original text and can then transform the value by adding, removing, or changing characters.
Afterwards, a tokenizer splits the text into individual tokens, which will usually be words. So if we have a sentence with ten words, we would get an array of ten tokens. An analyzer may only have one tokenizer. By default, a tokenizer named standard is used, which uses a Unicode Text Segmentation algorithm. Without going into details, it basically splits by whitespace and also removes most symbols, such as commas, periods, semicolons, etc. That’s because most symbols are not useful when it comes searching.
After splitting the text into tokens, it is run through zero or more token filters. A token filter may add, remove, or change tokens.
There are a couple of different token filters, with the simplest one being a lowercase token filter which just converts all characters to lowercase. Another token filter that you can make use of, is named stop. It removes common words, which are referred to as stop words. These are words such as “the,” “a,” “and,” “at,” etc.

The purpose of an inverted index, is to store text in a structure that allows for very efficient and fast full-text searches. When performing full-text searches, we are actually querying an inverted index and not the JSON documents that we defined when indexing the documents.
An inverted index consists of all of the unique terms that appear in any document covered by the index. For each term, the list of documents in which the term appears, is stored. So essentially an inverted index is a mapping between terms and which documents contain those terms.
we can see which document contains the term, which enables Elastic search to efficiently match documents containing specific terms

A shard is a self-contained subset of the index. Each shard is an independent index in itself and contains a portion of the data. The index can be divided into multiple primary shards and their replicas
When an index is created, Elastic search automatically assigns a configurable number of primary shards to distribute the data across the cluster. Each primary shard is responsible for a specific subset of the data.
Sharding enables parallel processing of data. As the data is distributed across multiple shards, Elastic search can perform search and indexing operations simultaneously on different shards, allowing for increased throughput and faster response times.
Sharding allows Elastic search to scale horizontally by adding more nodes to the cluster. As new nodes join the cluster, Elastic search automatically rebalances the shards, redistributing the data across the new nodes. This distributed nature of sharding allows for efficient utilization of resources and better performance as the cluster grows.
When a search query is executed, Elastic search routes the query to the appropriate shards based on the shard allocation and the query scope. The search results from each shard are then combined and returned to the user. This distributed search and retrieval process allows Elastic search to efficiently process large volumes of data.

A new search is received, and it’s transformed into a set of searches (one on each shard). Each shard returns its matching documents, the lists are merged, rank, and sorted

Logging and log analytics —- As we’ve discussed, Elastic search is commonly used for ingesting and analyzing log data in near-real-time and in a scalable manner. It also provides important operational insights on log metrics to drive actions.
Business analytics —- Many of the built-in features available within the ELK Stack makes it a good option as a business analytics tool. However, there is a steep learning curve for implementing this product and in most organizations. This is especially true in cases where companies have multiple data sources besides Elastic search– since Kibana only works with Elastic search data.
Application search —- For applications that rely heavily on a search platform for the access, retrieval, and reporting of data.

Better performance for text searching, more suitable for data that be searched frequently
Schema less
Easier to integrated with third party, such as alerting, data ETL, ...

Database provide ACID properties to sustain the data integrity
Structured data and easier to present the complex relationship, such as join table

Last updated 9 months ago

Was this helpful?