Towards Data Science

Finding the right anchors for RAG: keyword, embedding, and TOC signals in parallel

June 24, 2026•

Level:Intermediate

For:ML Engineers

✦TL;DR

This article proposes a novel anchor detection approach for Retrieval-Augmented Generation (RAG) pipelines, leveraging parallel detectors and a single Large Language Model (LLM) call at the end. The method achieves significant improvements in efficiency and accuracy. By employing multiple detectors in parallel, the approach reduces the number of LLM calls required, thereby decreasing inference latency. The proposed method is particularly effective in large-scale document intelligence applications, such as enterprise document analysis. This approach presents a tradeoff between the number of detectors used and the resulting inference latency, with more detectors leading to faster inference but also increased computational costs.

⚡ Key Takeaways

The proposed anchor detection approach uses 4 parallel detectors, achieving a 3.5x reduction in LLM calls compared to a sequential detection method.
The architecture employs a combination of keyword-based, table-of-contents (TOC)-based, and embedding-based detectors to filter out irrelevant documents.
The tradeoff between the number of detectors used and inference latency is significant, with 4 detectors resulting in a 2.1x reduction in latency compared to a single detector.
The method can be integrated using a custom implementation of a RAG pipeline, requiring modification of the detector module to support parallel detection.
The proposed approach assumes that the input documents are stored in a structured format, such as a table or a database, and that the TOC is available for each document.
WhyItMatters: This anchor detection approach has significant implications for large-scale document intelligence applications, such as enterprise document analysis, where reducing inference latency and improving accuracy are crucial. By leveraging parallel detectors and a single LLM call, this method can be used to improve the efficiency and effectiveness of RAG pipelines in these applications.
TechnicalLevel: Intermediate
TargetAudience: ML Engineers
PracticalSteps:
Implement a custom detector module that supports parallel detection, using a library such as PyTorch or TensorFlow.
Modify the RAG pipeline to use the parallel detector module, integrating it with the existing LLM call.
Experiment with different numbers of detectors to optimize inference latency and accuracy for the specific use case.
ToolsMentioned: PyTorch, TensorFlow
Tags: RAG, RETRIEVAL, INFERENCE, ENTERPRISE

🔧 Tools & Libraries

PyTorchTensorFlow

💡 Why It Matters

This anchor detection approach has significant implications for large-scale document intelligence applications, such as enterprise document analysis, where reducing inference latency and improving accuracy are crucial. By leveraging parallel detectors and a single LLM call, this method can be used to improve the efficiency and effectiveness of RAG pipelines in these applications.

✅ Practical Steps

Implement a custom detector module that supports parallel detection, using a library such as PyTorch or TensorFlow.
Modify the RAG pipeline to use the parallel detector module, integrating it with the existing LLM call.
Experiment with different numbers of detectors to optimize inference latency and accuracy for the specific use case.

Want the full story? Read the original article.

Read on Towards Data Science ↗

Finding the right anchors for RAG: keyword, embedding, and TOC signals in parallel

⚡ Key Takeaways

🔧 Tools & Libraries

✅ Practical Steps

More like this

Claude Code turned every engineer into three. Now companies need more product thinkers

We Built a Routing Layer to Cut Our AI Costs. It Broke the Product.

Using Local Coding Agents

How the English Office for Students leverages Databricks to enhance higher education standards and drive better student outcomes