Chunking

Introduction

  • A critical component of RAG is the chunking process, where large documents are divided into smaller, more manageable pieces called “chunks.”

Chunking serves multiple purposes in RAG:

  • Efficiency: Smaller chunks reduce computational overhead during retrieval.

  • Relevance: Precise chunks increase the likelihood of retrieving relevant information.

  • Context Preservation: Proper chunking maintains the integrity of the information, ensuring coherent responses.

However, inappropriate chunking can lead to:

  • Loss of Context: Breaking information at arbitrary points can disrupt meaning.

  • Redundancy: Overlapping chunks may introduce repetitive information.

  • Inconsistency: Variable chunk sizes can complicate retrieval and indexing.

Strategies

Fixed Length

Advantages:

  • Simplicity: Easy to implement without complex algorithms.

  • Uniformity: Produces consistent chunk sizes, simplifying indexing.

Challenges:

  • Context Loss: May split sentences or ideas, leading to incomplete information.

  • Relevance Issues: Critical information might span multiple chunks, reducing retrieval effectiveness.

import { CharacterTextSplitter } from "@langchain/textsplitters";
const textSplitter = new CharacterTextSplitter({
  chunkSize: 100,
  chunkOverlap: 0,
});
const texts = await textSplitter.splitText(document);

Text-structured / Recursive based

  • Richer Context: Provides more information than sentence-based chunks.

  • Logical Division: Aligns with the natural structure of the text, split by the character that defined

Challenges:

  • Inconsistent Sizes: Paragraph lengths can vary widely.

  • Token Limits: Large paragraphs may exceed token limitations of the model.

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { Document } from "@langchain/core/documents";

const text = `Some other considerations include:

- Do you deploy your backend and frontend together, or separately?
- Do you deploy your backend co-located with your database, or separately?

**Production Support:** As you move your LangChains into production, we'd love to offer more hands-on support.
Fill out [this form](https://airtable.com/appwQzlErAS2qiP0L/shrGtGaVBVAz7NcV2) to share more about what you're building, and our team will get in touch.

## Deployment Options

See below for a list of deployment options for your LangChain app. If you don't see your preferred option, please get in touch and we can add it to this list.`;

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 50,
  chunkOverlap: 1,
  separators: ["|", "##", ">", "-"],
});

const docOutput = await splitter.splitDocuments([
  new Document({ pageContent: text }),
]);

console.log(docOutput.slice(0, 3));
[
  Document {
    pageContent: "Some other considerations include:",
    metadata: { loc: { lines: { from: 1, to: 1 } } }
  },
  Document {
    pageContent: "- Do you deploy your backend and frontend together",
    metadata: { loc: { lines: { from: 3, to: 3 } } }
  },
  Document {
    pageContent: "r, or separately?",
    metadata: { loc: { lines: { from: 3, to: 3 } } }
  }
]
const markdownText = `
# 🦜️🔗 LangChain

⚡ Building applications with LLMs through composability ⚡

## Quick Install

\`\`\`bash
# Hopefully this code block isn't split
pip install langchain
\`\`\`

As an open-source project in a rapidly developing field, we are extremely open to contributions.
`;

const mdSplitter = RecursiveCharacterTextSplitter.fromLanguage("markdown", {
  chunkSize: 60,
  chunkOverlap: 0,
});
const mdDocs = await mdSplitter.createDocuments([markdownText]);

mdDocs;
[
  Document {
    pageContent: "# 🦜️🔗 LangChain",
    metadata: { loc: { lines: { from: 2, to: 2 } } }
  },
  Document {
    pageContent: "⚡ Building applications with LLMs through composability ⚡",
    metadata: { loc: { lines: { from: 4, to: 4 } } }
  },
  Document {
    pageContent: "## Quick Install",
    metadata: { loc: { lines: { from: 6, to: 6 } } }
  },
  Document {
    pageContent: "```bash\n# Hopefully this code block isn't split",
    metadata: { loc: { lines: { from: 8, to: 9 } } }
  },
  Document {
    pageContent: "pip install langchain",
    metadata: { loc: { lines: { from: 10, to: 10 } } }
  },
  Document {
    pageContent: "```",
    metadata: { loc: { lines: { from: 11, to: 11 } } }
  },
  Document {
    pageContent: "As an open-source project in a rapidly developing field, we",
    metadata: { loc: { lines: { from: 13, to: 13 } } }
  },
  Document {
    pageContent: "are extremely open to contributions.",
    metadata: { loc: { lines: { from: 13, to: 13 } } }
  }
]

Semantic Chunking

How it works: Utilizes embeddings or machine learning models to split text based on semantic meaning, ensuring each chunk is cohesive in topic or idea.

Best for: Complex queries requiring deep understanding, such as technical manuals or academic papers.

Advantages:

  • Contextual Relevance: Chunks are meaningfully grouped, improving retrieval accuracy.

  • Flexibility: Adapts to the text’s inherent structure and content.

Challenges:

  • Complexity: Requires advanced NLP models and computational resources.

  • Processing Time: Semantic analysis can be time-consuming.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

References

Last updated

Was this helpful?