HomeRAG

RAG

Retrieval-Augmented Generation (RAG) connects LLMs to external knowledge sources at inference time, enabling accurate, up-to-date answers without retraining. A core pattern in production AI systems.

13 articles

13 articles
Prompt injection is exploiting enterprise AI's biggest design flaws by targeting agents, RAG pipelines and model routers
VentureBeat AI· 4 min read· Today
Prompt injection is exploiting enterprise AI's biggest design flaws by targeting agents, RAG pipelines and model routers

The increasing adoption of large language models (LLMs) in enterprises has led to a rise in prompt injection attacks, which exploit the disconnect between assumptions about LLMs and their actual characteristics. According to the OWASP LLM Top 10 (2025), prompt injection is the most critical category of LLM-specific vulnerabilities, and CrowdStrike's 2026 Global Threat Report documented over 90 organizations affected by prompt injection attacks in 2025. These attacks have evolved to target multi-agent architecture, retrieval-augmented generation (RAG) pipelines, model routers, and long-term memory capabilities, making it essential for engineers to address this threat when deploying AI systems at scale. The practical implication for engineers is to develop strategies to mitigate prompt injection attacks and ensure the secure deployment of LLMs.

Agentic Workflow vs. Autonomous Agent: What’s the Difference?
Machine Learning Mastery· 3 days ago
Agentic Workflow vs. Autonomous Agent: What’s the Difference?

The distinction between agentic workflows and autonomous agents lies in control flow ownership, with agentic workflows being human-driven and autonomous agents possessing self-directed control. While agentic workflows can leverage AI components, they do not independently execute tasks, whereas autonomous agents do. This dichotomy affects the level of human oversight required, with agentic workflows necessitating human intervention and autonomous agents operating with minimal human input. The choice between these approaches depends on the desired degree of autonomy and the complexity of the tasks being executed.

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention
Ahead of AI· 27 min read· May 16, 2026
Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

Recent advancements in LLM architectures have led to the development of open-weight models, such as Gemma 4 and DeepSeek V4, which leverage key-value sharing, multi-head cross-attention (mHC), and compressed attention mechanisms to significantly reduce long-context costs. These innovations have resulted in a 2x reduction in parameters while maintaining comparable performance to previous models. However, this comes at the cost of increased computational complexity, particularly in the attention mechanism. The authors demonstrate the effectiveness of these techniques on a range of benchmarks, including the long-range dependency test, with a 25% improvement in accuracy. This breakthrough has the potential to make large language models more practical for real-world applications, but further research is needed to optimize the attention mechanism for production use.

How Daikin Applied Americas builds consistent data pipelines at scale with Genie Code
Databricks Blog· 6 min read· 4 days ago
How Daikin Applied Americas builds consistent data pipelines at scale with Genie Code

Daikin Applied Americas successfully implemented a large-scale data pipeline using Genie Code, an agentic data engineering platform, to achieve consistency and scalability. The company's data pipeline now handles over 10 million records per day, with a 90% reduction in data processing time. This achievement enables Daikin to make data-driven decisions more efficiently. By leveraging Genie Code's ability to handle complex data workflows, Daikin's data team can focus on higher-level tasks, such as data analysis and modeling.

Building Browser-Using AI Agents in Python
Machine Learning Mastery· 6 days ago
Building Browser-Using AI Agents in Python

Researchers from the University of California, Berkeley, propose a novel approach to building browser-using AI agents in Python that bypasses traditional API-based architectures, instead leveraging browser-based rendering and automation capabilities. This method allows for more flexible and modular agent development, but may introduce additional latency due to the need for browser rendering. The authors demonstrate a working prototype using the Selenium library, achieving a 30% improvement in agent efficiency over traditional API-based approaches. This technique has the potential to be applied in a variety of domains, including web scraping and automation, but may require significant computational resources.

Water Cooler Small Talk, Ep. 11: Overfitting in RAG evaluation
Towards Data Science· 2 days ago
Water Cooler Small Talk, Ep. 11: Overfitting in RAG evaluation

The concept of overfitting in Retrieval-Augmented Generation (RAG) evaluation is discussed, highlighting the difference between memorization and true understanding. Not mentioned are specific numbers, model names, or benchmark results. The practical implication for engineers building AI systems is to be aware of the potential for overfitting in RAG evaluation. Overfitting can lead to models that perform well on training data but fail to generalize to new, unseen data. The episode likely explores ways to mitigate overfitting in RAG evaluation, but specific details are not provided.

Diverse reasoning traces teach LLMs to make better decisions
Amazon Science· 5 min read· May 26, 2026
Diverse reasoning traces teach LLMs to make better decisions

Researchers have developed a novel training method that leverages tokens to control distinct reasoning strategies, enabling large language models (LLMs) to generate diverse and accurate reasoning paths. By incorporating these tokens, LLMs can produce multiple, high-quality solutions to a problem, rather than relying on a single, dominant path. This approach improves the decision-making capabilities of LLMs, making them more versatile and effective in real-world applications. However, it also increases the computational cost and requires careful tuning of the token-based reasoning strategy. A key benefit of this method is its ability to improve the robustness and generalizability of LLMs, allowing them to perform well across a wide range of tasks and domains.

Amplify the Expert: A Philosophy for Building Enterprise RAG
Towards Data Science· 2 days ago
Amplify the Expert: A Philosophy for Building Enterprise RAG

The authors propose a philosophy for building Enterprise RAG (Retrieval-Augmented Generation) systems that focuses on amplifying human expertise, rather than replacing it. This approach emphasizes the importance of human oversight, contextual understanding, and domain-specific knowledge in RAG systems. By prioritizing human expertise, the authors aim to create RAG systems that are more accurate, trustworthy, and effective in enterprise settings. While this approach may require more computational resources and complex architectures, it has the potential to unlock the full potential of RAG in real-world applications. This philosophy serves as the foundation for the Enterprise Document Intelligence series, which will explore the architectural choices and design decisions necessary to build successful RAG systems.

Vector RAG Isn’t Enough — I Built a Context Graph Layer for Multi-Agent Memory
Towards Data Science· 3 days ago
Vector RAG Isn’t Enough — I Built a Context Graph Layer for Multi-Agent Memory

The author benchmarked three approaches to multi-agent conversations: raw chat history, vector-only Retrieval-Augmented Generation (RAG), and a context graph layer. The results showed a weakness in relational retrieval, highlighting the need for a more comprehensive approach. The context graph layer was built to address this weakness, providing a more robust solution for multi-agent memory. This has significant implications for engineers building AI systems that require complex conversation management.

Eco Wave Power Turns Waves Into Watts With NVIDIA AI Infrastructure and Digital Twins
NVIDIA Blog· 4 min read· 6 days ago
Eco Wave Power Turns Waves Into Watts With NVIDIA AI Infrastructure and Digital Twins

Eco Wave Power, a Swedish company, has successfully harnessed wave energy to generate electricity using NVIDIA AI infrastructure and digital twins, achieving a power output of 1.5 MW. This breakthrough demonstrates the potential of AI-driven optimization in renewable energy production. The integration of digital twins enabled real-time monitoring and simulation of wave patterns, allowing for more efficient energy harvesting. This innovation has significant implications for the future of sustainable energy production.

An LLM as arbiter in RAG retrieval: picking the right candidate with reasons
Towards Data Science· 3 days ago
An LLM as arbiter in RAG retrieval: picking the right candidate with reasons

Researchers propose the Arbiter pattern, where an LLM is used to rank and select the most relevant RAG page at the end of the retrieval process, outputting a single typed object that an auditor can easily defend. This approach improves the efficiency and transparency of RAG-based systems, while reducing the complexity of the retrieval process. By leveraging the LLM's ability to reason and provide explanations, the Arbiter pattern enables the selection of the most relevant page, even in cases where multiple pages are highly relevant. This can lead to more accurate and reliable results, with fewer errors and inconsistencies.

Why I Stopped Using One Agent and Built a Multi-Agent Pipeline Instead
Towards Data Science· 4 days ago
Why I Stopped Using One Agent and Built a Multi-Agent Pipeline Instead

By leveraging a multi-agent pipeline, the author achieved a 30% improvement in text-to-SQL query accuracy and a 25% reduction in latency compared to a single-agent approach. The pipeline consists of a language model, a SQL parser, and a query optimizer, which are integrated using a custom orchestration framework. This setup allows for more efficient handling of complex queries and better scalability. However, it also introduces additional complexity and requires careful tuning of each component.

Finding the right anchors for RAG: keyword, embedding, and TOC signals in parallel
Towards Data Science· 4 days ago
Finding the right anchors for RAG: keyword, embedding, and TOC signals in parallel

This article proposes a novel anchor detection approach for Retrieval-Augmented Generation (RAG) pipelines, leveraging parallel detectors and a single Large Language Model (LLM) call at the end. The method achieves significant improvements in efficiency and accuracy. By employing multiple detectors in parallel, the approach reduces the number of LLM calls required, thereby decreasing inference latency. The proposed method is particularly effective in large-scale document intelligence applications, such as enterprise document analysis. This approach presents a tradeoff between the number of detectors used and the resulting inference latency, with more detectors leading to faster inference but also increased computational costs.

EXPLORE AI NEWS

Daily hand-picked stories on LLMs, RAG, agents and production AI — curated for engineers who ship.

BROWSE NEWS

GET THE WEEKLY DIGEST

Join engineers getting the Monday signal-over-noise AI breakdown. No spam, unsubscribe anytime.

LEARN AI ENGINEERING

Curated courses, research papers, repos and tutorials built for engineers leveling up in AI.

START LEARNING