🖍️
Developer Note
  • Welcome
  • Git
    • Eslint & Prettier & Stylelint & Husky
  • Programming Language
    • JavaScript
      • Script Async vs Defer
      • Module
      • Const VS Let VS Var
      • Promise
      • Event Loop
      • Execution Context
      • Hoisting
      • Closure
      • Event Buddling and Capturing
      • Garbage Collection
      • This
      • Routing
      • Debounce and Throttle
      • Web Component
      • Iterator
      • Syntax
      • String
      • Array
      • Object
      • Proxy & Reflect
      • ProtoType
      • Class
      • Immutability
      • Typeof & Instanceof
      • Npm (Node package manager)
    • TypeScript
      • Utility Type
      • Type vs Interface
      • Any vs Unknown vs Never
      • Void and undefined
      • Strict Mode
      • Namespace
      • Enum
      • Module
      • Generic
    • Python
      • Local Development
      • Uv
      • Asyncio & Event loop
      • Context Manager
      • Iterator & Generator
      • Fast API
      • Pydantic & Data Class
    • Java
      • Compilation and Execution
      • Data Type
      • Enumeration
      • Data Structure
      • Try Catch
      • InputStream and OutputStream
      • Concurrent
      • Unicode Block
      • Build Tools
      • Servlet
      • Java 8
  • Coding Pattern
    • MVC vs MVVM
    • OOP vs Functional
    • Error Handling
    • MVC vs Flux
    • Imperative vs Declarative
    • Design Pattern
  • Web Communication
    • REST API
      • Web Hook
      • CORS issue
    • HTTPS
    • GraphQL
      • REST API vs GraphQL
      • Implementation (NodeJS + React)
    • Server-Sent Event
    • Web Socket
    • IP
    • Domain Name System (DNS)
  • Frontend
    • Progressive Web App (PWA)
    • Single Page & Multiple Page Application
    • Search Engine Optimiaztion (SEO)
    • Web bundling & Micro-frontend
      • Webpack
        • Using Webpack to build React Application
        • Using Webpack to build react library
      • Vite
      • Using rollup to build react library
      • Implementing micro frontend
    • Web Security
      • CSRF & Nonce
      • XSS
      • Click hijacking
    • Cypress
    • CSS
      • Core
        • Box Model
        • Inline vs Block
        • Flexbox & Grid
        • Pseudo Class
        • Position
      • Tailwind CSS
        • Shadcn
      • CSS In JS
        • Material UI
    • React
      • Core
        • Component Pattern
        • React Lazy & Suspense
        • React Portal
        • Error Boundary
        • Rendering Methods
        • Environment Variable
        • Conditional CSS
        • Memo
        • Forward Reference
        • High Order Component (HOC) & Custom Hook
        • TypeScript
      • State Management
        • Redux
        • Recoil
        • Zustand
      • Routing
        • React Router Dom
      • Data Fetching
        • Axios & Hook
        • React Query
        • Orval
      • Table
        • React Table
      • Form & Validation
        • React Hook Form
        • Zod
      • NextJS
        • Page Router
        • App Router
      • React Native
    • Angular
    • Svelte
      • Svelte Kit
  • Backend
    • Cache
      • Browser Cache
      • Web Browser Storage
      • Proxy
      • Redis
    • Rate limit
    • Monitoring
      • Logging
      • Distributed Tracing
    • Load Test
    • Encryption
    • Authentication
      • Password Protection
      • Cookie & Session
      • JSON Web Token
      • SSO
        • OAuth 2.0
        • OpenID Connect (OIDC)
        • SAML
    • Payment
      • Pre-built
      • Custom
    • File Handling
      • Upload & Download (Front-end)
      • Stream & Buffer
    • Microservice
      • API Gateway
      • Service Discovery
      • Load Balancer
      • Circuit Breaker
      • Message Broker
      • BulkHead & Zipkin
    • Elastic Search
    • Database
      • SQL
        • Group By vs Distinct
        • Index
        • N + 1 problem
        • Normalization
        • Foreign Key
        • Relationship
        • Union & Join
        • User Defined Type
      • NOSQL (MongoDB)
      • Transaction
      • Sharding
      • Lock (Concurrency Control)
    • NodeJS
      • NodeJS vs Java Spring
      • ExpressJS
      • NestJS
        • Swagger
        • Class Validator & Validation Pipe
        • Passport (Authentication)
      • Path Module
      • Database Connection
        • Integrating with MYSQL
        • Sequalize
        • Integrating with MongoDB
        • Prisma
        • MikroORM
        • Mongoose
      • Streaming
      • Worker Thread
      • Passport JS
      • JSON Web Token
      • Socket IO
      • Bull MQ
      • Pino (Logging)
      • Yeoman
    • Spring
      • Spring MVC
      • Spring REST
      • Spring Actuator
      • Aspect Oriented Programming (AOP)
      • Controller Advice
      • Filter
      • Interceptor
      • Concurrent
      • Spring Security
      • Spring Boot
      • Spring Cloud
        • Resilience 4j
      • Quartz vs Spring Batch
      • JPA and Hibernate
      • HATEOS
      • Swagger
      • Unit Test (Java Spring)
      • Unit Test (Spring boot)
  • DevOp
    • Docker
    • Kubernetes
      • Helm
    • Nginx
    • File System
    • Cloud
      • AWS
        • EC2 (Virtual Machine)
        • Network
        • IAM
          • Role-Service Binding
        • Database
        • Route 53
        • S3
        • Message Queue
        • Application Service
        • Serverless Framework
        • Data Analysis
        • Machine Learning
        • Monitoring
        • Security
      • Azure
        • Identity
        • Compute Resource
        • Networking
        • Storage
        • Monitoring
      • Google Cloud
        • IAM
          • Workload Identity Federation
        • Compute Engine
        • VPC Network
        • Storage
        • Kubernetes Engine
        • App Engine
        • Cloud function
        • Cloud Run
        • Infra as Code
        • Pub/Sub
    • Deployment Strategy
    • Jenkins
    • Examples
      • Deploy NextJS on GCP
      • Deploy Spring on Azure
      • Deploy React on Azure
  • Domain Knowledge
    • Web 3
      • Blockchain
      • Cryptocurrency
    • AI
      • Prompt
      • Chain & Agent
      • LangChain
      • Chunking
      • Search
      • Side Products
Powered by GitBook
On this page
  • Introduction
  • Strategies
  • Fixed Length
  • Text-structured / Recursive based
  • Semantic Chunking
  • References

Was this helpful?

  1. Domain Knowledge
  2. AI

Chunking

PreviousLangChainNextSearch

Last updated 2 months ago

Was this helpful?

Introduction

  • A critical component of RAG is the chunking process, where large documents are divided into smaller, more manageable pieces called “chunks.”

Chunking serves multiple purposes in RAG:

  • Efficiency: Smaller chunks reduce computational overhead during retrieval.

  • Relevance: Precise chunks increase the likelihood of retrieving relevant information.

  • Context Preservation: Proper chunking maintains the integrity of the information, ensuring coherent responses.

However, inappropriate chunking can lead to:

  • Loss of Context: Breaking information at arbitrary points can disrupt meaning.

  • Redundancy: Overlapping chunks may introduce repetitive information.

  • Inconsistency: Variable chunk sizes can complicate retrieval and indexing.

Strategies

Fixed Length

Advantages:

  • Simplicity: Easy to implement without complex algorithms.

  • Uniformity: Produces consistent chunk sizes, simplifying indexing.

Challenges:

  • Context Loss: May split sentences or ideas, leading to incomplete information.

  • Relevance Issues: Critical information might span multiple chunks, reducing retrieval effectiveness.

import { CharacterTextSplitter } from "@langchain/textsplitters";
const textSplitter = new CharacterTextSplitter({
  chunkSize: 100,
  chunkOverlap: 0,
});
const texts = await textSplitter.splitText(document);

Text-structured / Recursive based

  • Richer Context: Provides more information than sentence-based chunks.

  • Logical Division: Aligns with the natural structure of the text, split by the character that defined

Challenges:

  • Inconsistent Sizes: Paragraph lengths can vary widely.

  • Token Limits: Large paragraphs may exceed token limitations of the model.

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { Document } from "@langchain/core/documents";

const text = `Some other considerations include:

- Do you deploy your backend and frontend together, or separately?
- Do you deploy your backend co-located with your database, or separately?

**Production Support:** As you move your LangChains into production, we'd love to offer more hands-on support.
Fill out [this form](https://airtable.com/appwQzlErAS2qiP0L/shrGtGaVBVAz7NcV2) to share more about what you're building, and our team will get in touch.

## Deployment Options

See below for a list of deployment options for your LangChain app. If you don't see your preferred option, please get in touch and we can add it to this list.`;

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 50,
  chunkOverlap: 1,
  separators: ["|", "##", ">", "-"],
});

const docOutput = await splitter.splitDocuments([
  new Document({ pageContent: text }),
]);

console.log(docOutput.slice(0, 3));
[
  Document {
    pageContent: "Some other considerations include:",
    metadata: { loc: { lines: { from: 1, to: 1 } } }
  },
  Document {
    pageContent: "- Do you deploy your backend and frontend together",
    metadata: { loc: { lines: { from: 3, to: 3 } } }
  },
  Document {
    pageContent: "r, or separately?",
    metadata: { loc: { lines: { from: 3, to: 3 } } }
  }
]
const markdownText = `
# 🦜️🔗 LangChain

⚡ Building applications with LLMs through composability ⚡

## Quick Install

\`\`\`bash
# Hopefully this code block isn't split
pip install langchain
\`\`\`

As an open-source project in a rapidly developing field, we are extremely open to contributions.
`;

const mdSplitter = RecursiveCharacterTextSplitter.fromLanguage("markdown", {
  chunkSize: 60,
  chunkOverlap: 0,
});
const mdDocs = await mdSplitter.createDocuments([markdownText]);

mdDocs;
[
  Document {
    pageContent: "# 🦜️🔗 LangChain",
    metadata: { loc: { lines: { from: 2, to: 2 } } }
  },
  Document {
    pageContent: "⚡ Building applications with LLMs through composability ⚡",
    metadata: { loc: { lines: { from: 4, to: 4 } } }
  },
  Document {
    pageContent: "## Quick Install",
    metadata: { loc: { lines: { from: 6, to: 6 } } }
  },
  Document {
    pageContent: "```bash\n# Hopefully this code block isn't split",
    metadata: { loc: { lines: { from: 8, to: 9 } } }
  },
  Document {
    pageContent: "pip install langchain",
    metadata: { loc: { lines: { from: 10, to: 10 } } }
  },
  Document {
    pageContent: "```",
    metadata: { loc: { lines: { from: 11, to: 11 } } }
  },
  Document {
    pageContent: "As an open-source project in a rapidly developing field, we",
    metadata: { loc: { lines: { from: 13, to: 13 } } }
  },
  Document {
    pageContent: "are extremely open to contributions.",
    metadata: { loc: { lines: { from: 13, to: 13 } } }
  }
]

Semantic Chunking

How it works: Utilizes embeddings or machine learning models to split text based on semantic meaning, ensuring each chunk is cohesive in topic or idea.

Best for: Complex queries requiring deep understanding, such as technical manuals or academic papers.

Advantages:

  • Contextual Relevance: Chunks are meaningfully grouped, improving retrieval accuracy.

  • Flexibility: Adapts to the text’s inherent structure and content.

Challenges:

  • Complexity: Requires advanced NLP models and computational resources.

  • Processing Time: Semantic analysis can be time-consuming.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

References

LogoText splitters | 🦜️🔗 LangChain
Logo11 Chunking Strategies for RAG — Simplified & VisualizedMedium