Chunking
Introduction
A critical component of RAG is the chunking process, where large documents are divided into smaller, more manageable pieces called “chunks.”
Chunking serves multiple purposes in RAG:
Efficiency: Smaller chunks reduce computational overhead during retrieval.
Relevance: Precise chunks increase the likelihood of retrieving relevant information.
Context Preservation: Proper chunking maintains the integrity of the information, ensuring coherent responses.
However, inappropriate chunking can lead to:
Loss of Context: Breaking information at arbitrary points can disrupt meaning.
Redundancy: Overlapping chunks may introduce repetitive information.
Inconsistency: Variable chunk sizes can complicate retrieval and indexing.
Strategies
Fixed Length

Advantages:
Simplicity: Easy to implement without complex algorithms.
Uniformity: Produces consistent chunk sizes, simplifying indexing.
Challenges:
Context Loss: May split sentences or ideas, leading to incomplete information.
Relevance Issues: Critical information might span multiple chunks, reducing retrieval effectiveness.
import { CharacterTextSplitter } from "@langchain/textsplitters";
const textSplitter = new CharacterTextSplitter({
chunkSize: 100,
chunkOverlap: 0,
});
const texts = await textSplitter.splitText(document);
Text-structured / Recursive based

Richer Context: Provides more information than sentence-based chunks.
Logical Division: Aligns with the natural structure of the text, split by the character that defined
Challenges:
Inconsistent Sizes: Paragraph lengths can vary widely.
Token Limits: Large paragraphs may exceed token limitations of the model.
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { Document } from "@langchain/core/documents";
const text = `Some other considerations include:
- Do you deploy your backend and frontend together, or separately?
- Do you deploy your backend co-located with your database, or separately?
**Production Support:** As you move your LangChains into production, we'd love to offer more hands-on support.
Fill out [this form](https://airtable.com/appwQzlErAS2qiP0L/shrGtGaVBVAz7NcV2) to share more about what you're building, and our team will get in touch.
## Deployment Options
See below for a list of deployment options for your LangChain app. If you don't see your preferred option, please get in touch and we can add it to this list.`;
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 50,
chunkOverlap: 1,
separators: ["|", "##", ">", "-"],
});
const docOutput = await splitter.splitDocuments([
new Document({ pageContent: text }),
]);
console.log(docOutput.slice(0, 3));
[
Document {
pageContent: "Some other considerations include:",
metadata: { loc: { lines: { from: 1, to: 1 } } }
},
Document {
pageContent: "- Do you deploy your backend and frontend together",
metadata: { loc: { lines: { from: 3, to: 3 } } }
},
Document {
pageContent: "r, or separately?",
metadata: { loc: { lines: { from: 3, to: 3 } } }
}
]
const markdownText = `
# 🦜️🔗 LangChain
⚡ Building applications with LLMs through composability ⚡
## Quick Install
\`\`\`bash
# Hopefully this code block isn't split
pip install langchain
\`\`\`
As an open-source project in a rapidly developing field, we are extremely open to contributions.
`;
const mdSplitter = RecursiveCharacterTextSplitter.fromLanguage("markdown", {
chunkSize: 60,
chunkOverlap: 0,
});
const mdDocs = await mdSplitter.createDocuments([markdownText]);
mdDocs;
[
Document {
pageContent: "# 🦜️🔗 LangChain",
metadata: { loc: { lines: { from: 2, to: 2 } } }
},
Document {
pageContent: "⚡ Building applications with LLMs through composability ⚡",
metadata: { loc: { lines: { from: 4, to: 4 } } }
},
Document {
pageContent: "## Quick Install",
metadata: { loc: { lines: { from: 6, to: 6 } } }
},
Document {
pageContent: "```bash\n# Hopefully this code block isn't split",
metadata: { loc: { lines: { from: 8, to: 9 } } }
},
Document {
pageContent: "pip install langchain",
metadata: { loc: { lines: { from: 10, to: 10 } } }
},
Document {
pageContent: "```",
metadata: { loc: { lines: { from: 11, to: 11 } } }
},
Document {
pageContent: "As an open-source project in a rapidly developing field, we",
metadata: { loc: { lines: { from: 13, to: 13 } } }
},
Document {
pageContent: "are extremely open to contributions.",
metadata: { loc: { lines: { from: 13, to: 13 } } }
}
]
Semantic Chunking

How it works: Utilizes embeddings or machine learning models to split text based on semantic meaning, ensuring each chunk is cohesive in topic or idea.
Best for: Complex queries requiring deep understanding, such as technical manuals or academic papers.
Advantages:
Contextual Relevance: Chunks are meaningfully grouped, improving retrieval accuracy.
Flexibility: Adapts to the text’s inherent structure and content.
Challenges:
Complexity: Requires advanced NLP models and computational resources.
Processing Time: Semantic analysis can be time-consuming.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(OpenAIEmbeddings())
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
References
Last updated
Was this helpful?