HomeDeployment

Deployment

Covering production AI deployment: inference infrastructure, latency optimization, cost management, monitoring, and best practices for shipping AI systems at scale.

15 articles

15 articles
The Protocol That Cleaned Up Our Agent Architecture
Towards Data Science· Today
The Protocol That Cleaned Up Our Agent Architecture

The authors successfully integrated the Model Context Protocol (MCP) into their agent architecture, achieving a 30% reduction in code complexity and a 25% decrease in server latency. This was accomplished by consolidating scattered tool definitions into a single, discoverable server using MCP's standardized protocol. The result is a more maintainable and scalable system. By leveraging MCP, the authors were able to simplify their architecture and improve performance, paving the way for future innovations.

AI Agent Failure Detection and Root Cause Analysis with Strands Evals
AWS ML Blog· 12 min read· Today
AI Agent Failure Detection and Root Cause Analysis with Strands Evals

The Strands Evals SDK introduces detectors that automate AI agent failure detection and root cause analysis, reducing diagnosis time from hours to minutes. Detectors analyze execution traces using large language model (LLM)-based analysis and provide structured output, including categorized failures, causal chains, and fix recommendations. This complements the evaluation framework by answering not only "how well did the agent do?" but also "why did it fail and how do I fix it?". The detector pipeline operates in two phases, with Phase 1 scanning each span in a session against a comprehensive failure taxonomy. For engineers building AI systems, this means they can quickly identify and fix issues, improving overall system reliability and performance.

Build context-rich research agents with Deep Agents and Bedrock AgentCore
AWS ML Blog· 11 min read· Today
Build context-rich research agents with Deep Agents and Bedrock AgentCore

The authors demonstrate building a competitive research agent with Deep Agents and Bedrock AgentCore for isolated execution environments in multi-step AI workflows. This walkthrough showcases a pattern end to end, utilizing Bedrock AgentCore for deployment. The resulting agent achieves state-of-the-art performance on a specific dataset, outperforming baseline models by 15% in terms of accuracy. This approach enables developers to seamlessly integrate and deploy AI agents in production environments. By leveraging Bedrock AgentCore, developers can isolate and manage complex AI workflows with ease, ensuring reproducibility and scalability.

GPU Time-Slicing for Concurrent LLM Agents on Kubernetes
Towards Data Science· Yesterday
GPU Time-Slicing for Concurrent LLM Agents on Kubernetes

The article provides a deep dive into the microarchitectural costs of Kubernetes GPU time-slicing for concurrent Large Language Model (LLM) agents. Not mentioned are specific numbers, model names, or benchmark results, but the post explores the systems-level implications of co-locating Agentic AI workloads on Kubernetes. The practical implication for engineers building AI systems is a better understanding of the hidden costs of GPU time-slicing, enabling more efficient deployment of LLM agents. The article focuses on the technical aspects of Kubernetes and GPU time-slicing, highlighting the need for careful consideration of resource allocation and workload management.

Three insights you may have missed from theCUBE’s coverage of Snowflake Summit 2026
SiliconANGLE AI· 4 days ago
Three insights you may have missed from theCUBE’s coverage of Snowflake Summit 2026

The next wave of enterprise AI is shifting focus from compute and foundation models to software and data infrastructure, enabling real-world business applications. This transition involves integrating AI with existing data systems and leveraging new tools for data management and analytics. As a result, companies can now focus on developing practical AI solutions that drive business outcomes, rather than just building complex models. This shift requires a new set of skills and expertise, including data engineering, software development, and domain-specific knowledge. Key challenges include integrating AI with existing infrastructure, managing complex data pipelines, and ensuring data quality and governance.

Build a meeting prep and follow-up assistant with Amazon Quick and Cisco Webex MCP servers
AWS ML Blog· 15 min read· 3 days ago
Build a meeting prep and follow-up assistant with Amazon Quick and Cisco Webex MCP servers

This article demonstrates the integration of Amazon Quick and Cisco Webex MCP servers to build a custom meeting prep and follow-up assistant. The assistant uses a single prompt to gather information from prior meeting summaries, transcripts, and Vidcast highlights, providing a comprehensive review of upcoming meetings. This solution leverages the strengths of both Amazon Quick and Webex MCP to streamline meeting preparation and follow-up. However, the complexity of integrating multiple services may lead to increased development time and potential compatibility issues.

Parse PDFs for RAG Locally with Docling: Rich Tables, No Cloud Upload
Towards Data Science· 2 days ago
Parse PDFs for RAG Locally with Docling: Rich Tables, No Cloud Upload

The Docling tool allows for parsing PDFs locally, enabling Retrieval-Augmented Generation (RAG) without the need for cloud uploads. This approach provides cloud-grade structure for table cells, OCR, captions, and headings, all while running on the user's own machine. The practical implication for engineers building AI systems is the ability to maintain data privacy and avoid per-page billing.

Extract Data with On-demand and Batch Pipelines Dynamically
AWS ML Blog· 13 min read· 4 days ago
Extract Data with On-demand and Batch Pipelines Dynamically

This article presents an intelligent document processing pipeline that utilizes both on-demand and batch inference options on Amazon Bedrock, enabling flexible document processing in terms of time and cost. The pipeline can dynamically specify large language models and prompts at the document level, allowing for the extraction of data from multiple types of documents. The on-demand pipeline processes documents one-by-one, returning results within seconds, while the batch pipeline processes multiple documents asynchronously. The pipeline uses AWS SQS FIFO queues, AWS Lambda functions, and Amazon Bedrock Prompt Management to manage prompts and extract data from documents. The practical implication for engineers building AI systems is the ability to design flexible and cost-effective document processing pipelines that can handle large volumes of documents.

How frontier teams are reinventing AI-native development
AWS ML Blog· 8 min read· 5 days ago
How frontier teams are reinventing AI-native development

Frontier teams are revolutionizing AI-native development by treating AI as the foundation of how software is built, resulting in 4.5x to 10x productivity gains. At Amazon, three paths to AI-native development have been identified, including a pathfinder initiative, structured sprint, and in-situ experiment, which have led to significant increases in developer productivity and code quality. The pathfinder initiative, for example, achieved a 20x increase in individual developer productivity and delivered a project in 76 days that was originally estimated to take 30 developers 12 to 18 months. This approach has significant implications for engineers building AI systems, as it enables them to focus on high-level goals and outcomes rather than discrete tasks.

The intelligence layer emerges as the control plane for enterprise AI
SiliconANGLE AI· 5 days ago
The intelligence layer emerges as the control plane for enterprise AI

The emergence of an "intelligence layer" as the control plane for enterprise AI enables organizations to manage the organizational context necessary for models to act reliably, addressing challenges in cost governance, data security, and accountability. This new layer integrates model management, data governance, and organizational processes, providing a unified framework for AI decision-making. By doing so, it enables enterprises to scale AI adoption while maintaining control and oversight. This shift is critical for large-scale AI deployment, where the complexity of organizational context can no longer be ignored.

Scale Robot Reinforcement Learning with NVIDIA Isaac Lab on Amazon SageMaker AI
AWS ML Blog· 24 min read· 6 days ago
Scale Robot Reinforcement Learning with NVIDIA Isaac Lab on Amazon SageMaker AI

NVIDIA Isaac Lab on Amazon SageMaker AI enables the scaling of robot reinforcement learning by providing a managed infrastructure for distributed training and inference. This allows robotics teams to iterate quickly during research and run production-grade training jobs without the operational burden of maintaining compute clusters. With Amazon SageMaker HyperPod, teams can achieve cluster resiliency and control, while SageMaker Training Jobs provide a flexible compute option for shorter iterative experiments. The practical implication for engineers building AI systems is that they can focus on developing robot policies rather than managing infrastructure.

Build an agentic incident triage assistant with Amazon Quick and New Relic
AWS ML Blog· 10 min read· 6 days ago
Build an agentic incident triage assistant with Amazon Quick and New Relic

Engineers can now build an agentic incident triage assistant using Amazon Quick and New Relic, leveraging the Model Context Protocol (MCP) Server to orchestrate a response. This assistant can be integrated with existing incident triage workflows, reducing mean time to detect (MTTD) and mean time to resolve (MTTR) by 30%. The assistant can be trained on New Relic's MCP Server to learn from historical data and adapt to new patterns, enabling more accurate and efficient incident triage.

Unlocking AI flexibility in Europe: A guide to cross-region inference for EU data processing and model access
AWS ML Blog· 11 min read· Jun 8, 2026
Unlocking AI flexibility in Europe: A guide to cross-region inference for EU data processing and model access

Amazon Bedrock's Cross-Region Inference (CRIS) capability allows customers to automatically route model inference requests across multiple AWS Regions within predefined geographic boundaries, enabling more resilient generative AI applications. CRIS offers system-defined inference profiles with global or geographic scopes, optimizing model throughput at low latency overhead. For EU customers, CRIS helps meet local data protection and processing requirements, including GDPR compliance. By using CRIS, customers can take advantage of model availability and capacity across multiple Regions while ensuring security and privacy.

NVIDIA Enables the Next Era Of Physical AI Research With Agent Skills For Autonomous Vehicles, Robotics And Vision AI
NVIDIA Blog· 7 min read· Jun 3, 2026
NVIDIA Enables the Next Era Of Physical AI Research With Agent Skills For Autonomous Vehicles, Robotics And Vision AI

NVIDIA is introducing Agent Skills for autonomous vehicles, robotics, and vision AI, enabling researchers to accelerate development by providing a complete workflow for physical AI research. This includes a set of pre-trained models, a simulation environment, and a suite of tools for data collection and training. By streamlining the development process, researchers can focus on higher-level tasks such as system integration and testing. This marks a significant step towards more efficient physical AI research, potentially leading to breakthroughs in autonomous vehicles and robotics.

NVIDIA Partners With Microsoft on Unified Stack for Agentic AI Deployment, From Windows Devices to Cloud to Local
NVIDIA Blog· 6 min read· Jun 2, 2026
NVIDIA Partners With Microsoft on Unified Stack for Agentic AI Deployment, From Windows Devices to Cloud to Local

NVIDIA and Microsoft are collaborating on a unified stack for agentic AI deployment, integrating AI models with fast hardware, secure runtimes, and a responsive data layer across Windows devices, cloud, and local environments. This stack is designed to support long-running reasoning and real-time decision-making in AI applications. The partnership aims to accelerate the development and deployment of agentic AI systems, enabling developers to build more sophisticated and responsive AI experiences. The unified stack is expected to bridge the gap between model development and deployment, reducing the complexity and increasing the efficiency of AI development.

EXPLORE AI NEWS

Daily hand-picked stories on LLMs, RAG, agents and production AI — curated for engineers who ship.

BROWSE NEWS

GET THE WEEKLY DIGEST

Join engineers getting the Monday signal-over-noise AI breakdown. No spam, unsubscribe anytime.

LEARN AI ENGINEERING

Curated courses, research papers, repos and tutorials built for engineers leveling up in AI.

START LEARNING