Richer Context: Provides more information than sentence-based chunks.
Logical Division: Aligns with the natural structure of the text, split by the character that defined
Challenges:
Inconsistent Sizes: Paragraph lengths can vary widely.
Token Limits: Large paragraphs may exceed token limitations of the model.
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { Document } from "@langchain/core/documents";
const text = `Some other considerations include:
- Do you deploy your backend and frontend together, or separately?
- Do you deploy your backend co-located with your database, or separately?
**Production Support:** As you move your LangChains into production, we'd love to offer more hands-on support.
Fill out [this form](https://airtable.com/appwQzlErAS2qiP0L/shrGtGaVBVAz7NcV2) to share more about what you're building, and our team will get in touch.
## Deployment Options
See below for a list of deployment options for your LangChain app. If you don't see your preferred option, please get in touch and we can add it to this list.`;
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 50,
chunkOverlap: 1,
separators: ["|", "##", ">", "-"],
});
const docOutput = await splitter.splitDocuments([
new Document({ pageContent: text }),
]);
console.log(docOutput.slice(0, 3));
[
Document {
pageContent: "Some other considerations include:",
metadata: { loc: { lines: { from: 1, to: 1 } } }
},
Document {
pageContent: "- Do you deploy your backend and frontend together",
metadata: { loc: { lines: { from: 3, to: 3 } } }
},
Document {
pageContent: "r, or separately?",
metadata: { loc: { lines: { from: 3, to: 3 } } }
}
]
const markdownText = `
# 🦜️🔗 LangChain
⚡ Building applications with LLMs through composability ⚡
## Quick Install
\`\`\`bash
# Hopefully this code block isn't split
pip install langchain
\`\`\`
As an open-source project in a rapidly developing field, we are extremely open to contributions.
`;
const mdSplitter = RecursiveCharacterTextSplitter.fromLanguage("markdown", {
chunkSize: 60,
chunkOverlap: 0,
});
const mdDocs = await mdSplitter.createDocuments([markdownText]);
mdDocs;
How it works: Utilizes embeddings or machine learning models to split text based on semantic meaning, ensuring each chunk is cohesive in topic or idea.
Best for: Complex queries requiring deep understanding, such as technical manuals or academic papers.
Advantages:
Contextual Relevance: Chunks are meaningfully grouped, improving retrieval accuracy.
Flexibility: Adapts to the text’s inherent structure and content.
Challenges:
Complexity: Requires advanced NLP models and computational resources.
Processing Time: Semantic analysis can be time-consuming.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(OpenAIEmbeddings())
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)