- HeyCloud
- Posts
- Text splitting (chunking) for RAG applications
Text splitting (chunking) for RAG applications
How to split text for your RAG applications in a way to preserve semantics
Text splitting (chunking) for RAG applications
Context
Text splitting, or chunking, is usually the first step in a RAG (Retrieval Augmented Generation) workflow. It simply means transforming long text documents to smaller chunks that are embedded, indexed, stored then later used for information retrieval.
A typical RAG system
Some “naive chunking strategies” include:
Size-based chunking
You just split the document into chunks of a specific size, regardless of semantics.
Paragraph-based chunking
You split your document based on “end of paragraph” characters, like “\n\n”, “\n”, “;”…etc Obviously, this is an approximation for semantic chunking, where you assume (or hope) that each paragraph is holds semantically distinct information.
Problem
The chunking approaches mentioned above are purely syntactic. However, you would prefer to split your document into semantically distinct chunks.
Why?
Because when you do retrieval (at query time), you would like to return the chunk/chunks that is/are semantically closest to your query. If your chunks are not distinct enough semantically, then you may return information that was not asked for/about in the query, leading to lower quality results and higher LLM hallucination rate.
Solution: Semantic Chunking
The following are some strategies for more semantics-aware text splitting.
Sentence clustering-based chunking (needs a better name!):
The idea is to build your semantic chunks from the ground up.
Start with splitting your document into sentences. A sentence is usually a semantic unit as it contains a single idea about a single topic.
Embed the sentences.
Cluster close sentences together forming chunks, while respecting sentence order.
Semantic chunking — semantic sentence clustering
Propositional chunking
The idea is to iteratively build chunks with the help of an external LLM.
Start with a syntactic chunking iteration; Paragraph-based for example.
For each paragraph:
Generate standalone statements (propositions) using an LLM, with a prompt like “What are topics discussed in this text?” Propositions must be semantically self-contained and distinct statements.
Remove redundant propositions.
Index and store the generated propositions.
At query time, retrieve from the propositions corpus instead of the original documents corpus.
Semantic chunking — propositional chunking
This paper proposes a propositional chunking algorithm that is similar to the one we describe.
Conclusion
Naively splitting documents into chunks may result in suboptimal performance in downstream tasks like Q&A. We discussed two semantic chunking approaches that can greatly improve the quality of your RAG system.