🖍️
Developer Note
  • Welcome
  • Git
    • Eslint & Prettier & Stylelint & Husky
  • Programming Language
    • JavaScript
      • Script Async vs Defer
      • Module
      • Const VS Let VS Var
      • Promise
      • Event Loop
      • Execution Context
      • Hoisting
      • Closure
      • Event Buddling and Capturing
      • Garbage Collection
      • This
      • Routing
      • Debounce and Throttle
      • Web Component
      • Iterator
      • Syntax
      • String
      • Array
      • Object
      • Proxy & Reflect
      • ProtoType
      • Class
      • Immutability
      • Typeof & Instanceof
      • Npm (Node package manager)
    • TypeScript
      • Utility Type
      • Type vs Interface
      • Any vs Unknown vs Never
      • Void and undefined
      • Strict Mode
      • Namespace
      • Enum
      • Module
      • Generic
    • Python
      • Local Development
      • Uv
      • Asyncio & Event loop
      • Context Manager
      • Iterator & Generator
      • Fast API
      • Pydantic & Data Class
    • Java
      • Compilation and Execution
      • Data Type
      • Enumeration
      • Data Structure
      • Try Catch
      • InputStream and OutputStream
      • Concurrent
      • Unicode Block
      • Build Tools
      • Servlet
      • Java 8
  • Coding Pattern
    • MVC vs MVVM
    • OOP vs Functional
    • Error Handling
    • MVC vs Flux
    • Imperative vs Declarative
    • Design Pattern
  • Web Communication
    • REST API
      • Web Hook
      • CORS issue
    • HTTPS
    • GraphQL
      • REST API vs GraphQL
      • Implementation (NodeJS + React)
    • Server-Sent Event
    • Web Socket
    • IP
    • Domain Name System (DNS)
  • Frontend
    • Progressive Web App (PWA)
    • Single Page & Multiple Page Application
    • Search Engine Optimiaztion (SEO)
    • Web bundling & Micro-frontend
      • Webpack
        • Using Webpack to build React Application
        • Using Webpack to build react library
      • Vite
      • Using rollup to build react library
      • Implementing micro frontend
    • Web Security
      • CSRF & Nonce
      • XSS
      • Click hijacking
    • Cypress
    • CSS
      • Core
        • Box Model
        • Inline vs Block
        • Flexbox & Grid
        • Pseudo Class
        • Position
      • Tailwind CSS
        • Shadcn
      • CSS In JS
        • Material UI
    • React
      • Core
        • Component Pattern
        • React Lazy & Suspense
        • React Portal
        • Error Boundary
        • Rendering Methods
        • Environment Variable
        • Conditional CSS
        • Memo
        • Forward Reference
        • High Order Component (HOC) & Custom Hook
        • TypeScript
      • State Management
        • Redux
        • Recoil
        • Zustand
      • Routing
        • React Router Dom
      • Data Fetching
        • Axios & Hook
        • React Query
        • Orval
      • Table
        • React Table
      • Form & Validation
        • React Hook Form
        • Zod
      • NextJS
        • Page Router
        • App Router
      • React Native
    • Angular
    • Svelte
      • Svelte Kit
  • Backend
    • Cache
      • Browser Cache
      • Web Browser Storage
      • Proxy
      • Redis
    • Rate limit
    • Monitoring
      • Logging
      • Distributed Tracing
    • Load Test
    • Encryption
    • Authentication
      • Password Protection
      • Cookie & Session
      • JSON Web Token
      • SSO
        • OAuth 2.0
        • OpenID Connect (OIDC)
        • SAML
    • Payment
      • Pre-built
      • Custom
    • File Handling
      • Upload & Download (Front-end)
      • Stream & Buffer
    • Microservice
      • API Gateway
      • Service Discovery
      • Load Balancer
      • Circuit Breaker
      • Message Broker
      • BulkHead & Zipkin
    • Elastic Search
    • Database
      • SQL
        • Group By vs Distinct
        • Index
        • N + 1 problem
        • Normalization
        • Foreign Key
        • Relationship
        • Union & Join
        • User Defined Type
      • NOSQL (MongoDB)
      • Transaction
      • Sharding
      • Lock (Concurrency Control)
    • NodeJS
      • NodeJS vs Java Spring
      • ExpressJS
      • NestJS
        • Swagger
        • Class Validator & Validation Pipe
        • Passport (Authentication)
      • Path Module
      • Database Connection
        • Integrating with MYSQL
        • Sequalize
        • Integrating with MongoDB
        • Prisma
        • MikroORM
        • Mongoose
      • Streaming
      • Worker Thread
      • Passport JS
      • JSON Web Token
      • Socket IO
      • Bull MQ
      • Pino (Logging)
      • Yeoman
    • Spring
      • Spring MVC
      • Spring REST
      • Spring Actuator
      • Aspect Oriented Programming (AOP)
      • Controller Advice
      • Filter
      • Interceptor
      • Concurrent
      • Spring Security
      • Spring Boot
      • Spring Cloud
        • Resilience 4j
      • Quartz vs Spring Batch
      • JPA and Hibernate
      • HATEOS
      • Swagger
      • Unit Test (Java Spring)
      • Unit Test (Spring boot)
  • DevOp
    • Docker
    • Kubernetes
      • Helm
    • Nginx
    • File System
    • Cloud
      • AWS
        • EC2 (Virtual Machine)
        • Network
        • IAM
          • Role-Service Binding
        • Database
        • Route 53
        • S3
        • Message Queue
        • Application Service
        • Serverless Framework
        • Data Analysis
        • Machine Learning
        • Monitoring
        • Security
      • Azure
        • Identity
        • Compute Resource
        • Networking
        • Storage
        • Monitoring
      • Google Cloud
        • IAM
          • Workload Identity Federation
        • Compute Engine
        • VPC Network
        • Storage
        • Kubernetes Engine
        • App Engine
        • Cloud function
        • Cloud Run
        • Infra as Code
        • Pub/Sub
    • Deployment Strategy
    • Jenkins
    • Examples
      • Deploy NextJS on GCP
      • Deploy Spring on Azure
      • Deploy React on Azure
  • Domain Knowledge
    • Web 3
      • Blockchain
      • Cryptocurrency
    • AI
      • Prompt
      • Chain & Agent
      • LangChain
      • Chunking
      • Search
      • Side Products
Powered by GitBook
On this page
  • Introduction
  • Analysis & Inverted Index
  • Overview
  • Analysis Process
  • Inverted Index
  • Index Sharding
  • Overview
  • Distributed Search
  • Usage
  • Vs Database
  • Pro
  • Cons

Was this helpful?

  1. Backend

Elastic Search

PreviousBulkHead & ZipkinNextDatabase

Last updated 6 months ago

Was this helpful?

Introduction

  • It is a distributed, open-source search and analytics engine built on Apache Lucene and developed in Java, allows you to store, search, and analyze huge volumes of data quickly and in near real-time and give back answers in milliseconds.

  • Purpose-built for full-text search and excels at searching and analyzing unstructured or semi-structured data. It uses inverted indices and scoring algorithms to provide highly relevant search results

  • Comes with extensive REST APIs for storing and searching the data

  • Support horizontal scaling

  • Schema less

  • Powerful aggregation capabilities that allow you to perform complex analytics on your data. It supports various aggregation functions like sum, average, min, max, and more, enabling you to derive insights from your data in real-time.

  • Part of the Elastic Stack, which includes complementary tools like Logstash for data ingestion and Kibana for data visualization and monitoring. This ecosystem integration provides a seamless end-to-end solution for search, analytics, and monitoring needs.

Analysis & Inverted Index

Overview

  • When we index a document, Elastic search takes the full text fields of the document and runs them through an analysis process

  • The results of the analysis is actually what is stored within the index that a document is added to. More specifically, the analyzed terms are stored within something called the inverted index

Analysis Process

  • When the new documents are inserted, A character filter receives a text field’s original text and can then transform the value by adding, removing, or changing characters.

  • Afterwards, a tokenizer splits the text into individual tokens, which will usually be words. So if we have a sentence with ten words, we would get an array of ten tokens. An analyzer may only have one tokenizer. By default, a tokenizer named standard is used, which uses a Unicode Text Segmentation algorithm. Without going into details, it basically splits by whitespace and also removes most symbols, such as commas, periods, semicolons, etc. That’s because most symbols are not useful when it comes searching.

  • After splitting the text into tokens, it is run through zero or more token filters. A token filter may add, remove, or change tokens.

  • There are a couple of different token filters, with the simplest one being a lowercase token filter which just converts all characters to lowercase. Another token filter that you can make use of, is named stop. It removes common words, which are referred to as stop words. These are words such as “the,” “a,” “and,” “at,” etc.

Inverted Index

  • The purpose of an inverted index, is to store text in a structure that allows for very efficient and fast full-text searches. When performing full-text searches, we are actually querying an inverted index and not the JSON documents that we defined when indexing the documents.

  • An inverted index consists of all of the unique terms that appear in any document covered by the index. For each term, the list of documents in which the term appears, is stored. So essentially an inverted index is a mapping between terms and which documents contain those terms.

  • we can see which document contains the term, which enables Elastic search to efficiently match documents containing specific terms

Index Sharding

Overview

  • A shard is a self-contained subset of the index. Each shard is an independent index in itself and contains a portion of the data. The index can be divided into multiple primary shards and their replicas

  • When an index is created, Elastic search automatically assigns a configurable number of primary shards to distribute the data across the cluster. Each primary shard is responsible for a specific subset of the data.

  • Sharding enables parallel processing of data. As the data is distributed across multiple shards, Elastic search can perform search and indexing operations simultaneously on different shards, allowing for increased throughput and faster response times.

  • Sharding allows Elastic search to scale horizontally by adding more nodes to the cluster. As new nodes join the cluster, Elastic search automatically rebalances the shards, redistributing the data across the new nodes. This distributed nature of sharding allows for efficient utilization of resources and better performance as the cluster grows.

  • When a search query is executed, Elastic search routes the query to the appropriate shards based on the shard allocation and the query scope. The search results from each shard are then combined and returned to the user. This distributed search and retrieval process allows Elastic search to efficiently process large volumes of data.

Distributed Search

  • A new search is received, and it’s transformed into a set of searches (one on each shard). Each shard returns its matching documents, the lists are merged, rank, and sorted

Usage

  • Logging and log analytics —- As we’ve discussed, Elastic search is commonly used for ingesting and analyzing log data in near-real-time and in a scalable manner. It also provides important operational insights on log metrics to drive actions.

  • Business analytics —- Many of the built-in features available within the ELK Stack makes it a good option as a business analytics tool. However, there is a steep learning curve for implementing this product and in most organizations. This is especially true in cases where companies have multiple data sources besides Elastic search– since Kibana only works with Elastic search data.

  • Application search —- For applications that rely heavily on a search platform for the access, retrieval, and reporting of data.

Vs Database

Pro

  • Better performance for text searching, more suitable for data that be searched frequently

  • Schema less

  • Easier to integrated with third party, such as alerting, data ETL, ...

Cons

  • Database provide ACID properties to sustain the data integrity

  • Structured data and easier to present the complex relationship, such as join table