← Back
AWS ML Blog

Amazon SageMaker AI Async Inference now supports inline request payloads

6 min read
#deployment#inference#amazon
Level:Intermediate
For:AI Engineers
TL;DR

Amazon SageMaker AI Async Inference now supports inline request payloads, allowing customers to send inference payloads directly in the request body of the InvokeEndpointAsync API, removing the need to upload input data to Amazon S3 before each invocation. This feature is available for payloads up to 128,000 bytes and simplifies client-side code, reducing the operational surface area of asynchronous inference workloads. The new Body parameter is mutually exclusive with the InputLocation parameter, and the API rejects requests that set both. This change is designed to work with existing async endpoints, with no model or container changes expected. The practical implication for engineers building AI systems is that they can now use inline payloads to simplify their async inference workflows.

⚡ Key Takeaways

  • The new Body parameter in the InvokeEndpointAsync API allows for inline payloads up to 128,000 bytes.
  • The Body and InputLocation parameters are mutually exclusive, and the API rejects requests that set both.
  • The output behavior remains unchanged, with output written to the S3 OutputLocation.
  • The feature is available in 31 commercial AWS Regions.
  • The InvokeEndpointAsync API returns synchronous ValidationError responses for size and mutual-exclusivity violations.
💡 Why It Matters

This feature simplifies the async inference workflow for customers with small input payloads, reducing the operational surface area and latency associated with uploading input data to Amazon S3. This change can help engineers building AI systems to improve the efficiency and scalability of their async inference workloads.

✅ Practical Steps

  1. Use the new Body parameter in the InvokeEndpointAsync API to send inference payloads directly in the request body.
  2. Remove the S3 upload step from your async inference workflow for payloads up to 128,000 bytes.
  3. Update your client-side code to use the inline Body parameter instead of the InputLocation parameter.

Want the full story? Read the original article.

Read on AWS ML Blog

More like this

Monitor and debug generative AI inference with SageMaker detailed metrics and Insights dashboard on CloudWatch

AWS ML Blog#deployment

Databricks and NVIDIA: Building for the Agentic Era

Databricks Blog#rag

Building an End-to-End Sentiment Analysis Pipeline with Scikit-LLM

Machine Learning Mastery#llm

Graviton5’s improved design increases speed and energy efficiency — beyond Moore’s law

Amazon Science#compute

EXPLORE AI NEWS

Daily hand-picked stories on LLMs, RAG, agents and production AI — curated for engineers who ship.

BROWSE NEWS

GET THE WEEKLY DIGEST

Join engineers getting the Monday signal-over-noise AI breakdown. No spam, unsubscribe anytime.

LEARN AI ENGINEERING

Curated courses, research papers, repos and tutorials built for engineers leveling up in AI.

START LEARNING