HeyCloud
Posts
Text splitting (chunking) for RAG applications

Text splitting (chunking) for RAG applications

How to split text for your RAG applications in a way to preserve semantics

January 31, 2024

Text splitting (chunking) for RAG applications

Context

Text splitting, or chunking, is usually the first step in a RAG (Retrieval Augmented Generation) workflow. It simply means transforming long text documents to smaller chunks that are embedded, indexed, stored then later used for information retrieval.

A typical RAG system

Some “naive chunking strategies” include:

Size-based chunking

You just split the document into chunks of a specific size, regardless of semantics.

Paragraph-based chunking

You split your document based on “end of paragraph” characters, like “\n\n”, “\n”, “;”…etc Obviously, this is an approximation for semantic chunking, where you assume (or hope) that each paragraph is holds semantically distinct information.

Problem

The chunking approaches mentioned above are purely syntactic. However, you would prefer to split your document into semantically distinct chunks.

Why?

Because when you do retrieval (at query time), you would like to return the chunk/chunks that is/are semantically closest to your query. If your chunks are not distinct enough semantically, then you may return information that was not asked for/about in the query, leading to lower quality results and higher LLM hallucination rate.

Solution: Semantic Chunking

The following are some strategies for more semantics-aware text splitting.

Sentence clustering-based chunking (needs a better name!):

The idea is to build your semantic chunks from the ground up.

Start with splitting your document into sentences. A sentence is usually a semantic unit as it contains a single idea about a single topic.
Embed the sentences.
Cluster close sentences together forming chunks, while respecting sentence order.

Semantic chunking — semantic sentence clustering

Propositional chunking

The idea is to iteratively build chunks with the help of an external LLM.

Start with a syntactic chunking iteration; Paragraph-based for example.
For each paragraph:
- Generate standalone statements (propositions) using an LLM, with a prompt like “What are topics discussed in this text?” Propositions must be semantically self-contained and distinct statements.
Remove redundant propositions.
Index and store the generated propositions.
At query time, retrieve from the propositions corpus instead of the original documents corpus.
Semantic chunking — propositional chunking

This paper proposes a propositional chunking algorithm that is similar to the one we describe.

https://arxiv.org/pdf/2312.06648.pdf

Conclusion

Naively splitting documents into chunks may result in suboptimal performance in downstream tasks like Q&A. We discussed two semantic chunking approaches that can greatly improve the quality of your RAG system.