• HeyCloud
  • Posts
  • Long Context vs RAG: The Final Take.

Long Context vs RAG: The Final Take.

Will large context LLMs kill the need for RAG architecture?

Long Context vs RAG: The Final Take.

Context

Google released Gemini 1.5 last week, and with it, it spurred a large debate online on whether this is the end of RAG. The reason is that Gemini1.5 has a very large context window (input size) of 1M multimodal tokens, and up to 10M text tokens. On the surface, one might think: if the LLM can take all my data at once, why bother with building a RAG system?

Let’s analyse both sides of the debate:

Arguments for RAG's Continued Significance:

  • Efficiency and Cost-effectiveness: RAG remains attractive for scenarios where processing power (think inference GPUs) is limited.

  • Data freshness and Dynamic data: if the data at hand is very dynamic, then it would be super costly to ingest everything in the LLM. Example, if only part of the website/codebase changes, why ingest the whole thing while you can retrieve the updated part only?!

  • Deterministic Security and Access Control: The deterministic nature of RAG provides an edge in production-grade applications with strict security and access control requirements.

    For example, when you do a RAG over data that comes from different departments, you may attach access control rules to different chunks or different parts of the data. This is not feasible if you ingest all data to the LLM at once.

Arguments for Long Context

  • LLM-native retrieval is multimodal by default: It works on any data format, code, text, audio and video! Gemini 1.5 can retrieve a key frame in a video of 2h. Building an external multimodal RAG system would be very complex, as you need to deal with different data formats. When most RAG systems have hard time dealing with different data format, Gemini 1.5 has solved it on the model level.

  • Simplicity: For small scale retrieval tasks, like building an MVP or simple apps, there should be no need for a complex RAG architecture. You can easily feed all data to the LLM and retrieve information.

  • Diminishing costs: As Gemini evolves, its potential to handle larger datasets at lower costs could further diminish the need for RAG in much more use cases than simple apps.

How I think about it: it’s a tradeoff

I know it’s a boring conclusion but just like many things in life, it’s a tradeoff. RAG itself won’t be dead soon, but 90% of small scale use cases won't need it anymore. Most dataset can fit in 1M tokens and even if the cost of inference on 1M tokens is high, the cost of building a RAG system for a small project is usually not worth it.

In addition, LLM native retrieval is actually very similar to an internal RAG. LLMs use token Key-Value caching (KV cache) to retrieve relevant tokens during inference. Instead of using cosine similarity to "retrieve" the most relevant chunks, you use self-attention to attend on the most relevant tokens. But both just reuse the pre-computed embeddings.

This reduces the cost of inference but we still don’t have a rigorous cost comparison of KV caching vs external RAGs.

For large scale, production use cases, I think RAG will definitely stay dominant. Primarily for security control reasons and costs.

The RAM, Hard-Drive Analogy

A good way to think about this tradeoff is the analogy to memory layers in a computer. RAM is a much more suitable place to store immediately needed data for computation by the processor. However, since RAM is too expensive, we extend it with an external storage (Hard drive) that is way larger but a bit more complex to manage. For small programmes, you can load the entire thing in RAM and execute it. However, once your programme needs external files and data, you will need to use a hard drive.

In conclusion

My quick takes:

  • RAG will still be used for complex production systems

  • Long context models will eat up simpler/pre-production use cases

  • We are still to see a rigorous cost comparison between native retrieval and RAG.

Resources