Table of Contents

Optimized Autonomous Inference

This article introduces autonomous inferencing on Outerbounds - a way to process large volumes of prompts or run compute-intensive agentic systems with performance that can surpass typical off-the-shelf inference solutions, as shown in the benchmarks below. When handling high token volumes, the approach also delivers greater cost-efficiency. For a tl;dr, scroll down to the summary at the end.

As of 2025, most AI applications are built on top of existing LLM APIs, whether through proprietary models like GPT and Claude, or open-weight models served by services such as AWS Bedrock and specialized providers like Together.AI. Remarkably, the cost of LLM inference has plummeted, dropping by a factor of 1,000 over the past three years.

As with any technical product, choosing a provider depends on your application's specific needs. Some use cases require the most capable reasoning models, even if that means waiting minutes for a response, while others are best served by small, fast models.

Real-time and batch inference

Interactive use cases, like chatting with a customer support agent, tend to dominate the conversation around inference performance. The primary concern in these scenarios is maintaining a natural dialogue by minimizing response latency, necessitating fast real-time inference. At the same time, human cognitive bandwidth is extremely limited, so the number of input and output tokens exchanged in each interaction is necessarily small.

One of the great promises of AI is its ability to operate without a human in the loop. Rather than engaging in dialogue, the AI can crunch through arbitrary amounts of material autonomously. Since the machine doesn’t care about latency, the focus shifts to total task completion time - for example, how long it takes to analyze every S-1 filing from the past decade or review all research papers related to a specific gene.

Historically, use cases like these are served by batch inference APIs - for instance, consider batch inference provided by AWS Bedrock, Together.AI Batch API or OpenAI’s Batch API. These APIs are quite limited: They specifically don’t optimize for the total task completion time, but provide results in 24h in a best effort basis for a fixed set of input prompts. In exchange for this relaxed level of service, you typically receive a 50% pricing discount.

Autonomous inference

Autonomous inference refers to a form of inference that operates without a human in the loop. It overlaps with batch inference but has key differences:

  1. In contrast to batch inference, the focus is on the end-to-end task completion latency without a human in the loop. This is relevant for many products, such as a deep research service that processes millions of documents on the fly, or a system that generates videos on demand. Task completion latency is also critical during development, where you want to quickly evaluate different versions of your models, data, and code.
  2. Also in contrast to batch inference, prompts may evolve during execution - a defining characteristic of an agentic system.
  3. And unlike today’s batch inference APIs, the primary concern isn't cost minimization. While cost-per-completion still matters, the rapidly declining cost of inference and GPUs, combined with the potential for tremendous value creation, makes it irrelevant to optimize the last cents of inference costs.

In short, if your use case resembles batch inference, it's likely better developed as an autonomous inference system to enable faster iteration and more flexibility.

Autonomous inference on Outerbounds

You can build systems leveraging autonomous inference using any existing real-time LLM API. All it takes is a simple loop that calls the API repeatedly to process prompts at scale. To improve throughput and reduce total processing time, it's often beneficial to run multiple clients concurrently.

This brings us to the key question and the focus of this article: can you do better than simply looping over a real-time LLM API? Many of these APIs are optimized for interactive use cases, prioritizing metrics like time-to-first-token and intra-token latency. Serverless endpoints are incredibly easy to use and offer attractive per-token pricing, but it is less clear if they can provide the optimal throughput.

A clear limitation of off-the-shelf LLM APIs is that they are stateless and unaware of your workload. They simply try to handle whatever prompts you - and any concurrent users - send their way. This is suboptimal for autonomous inference, where you typically know the scale and type of prompts ahead of time and want to avoid noisy neighbors that can delay task completion.

To better service the needs of autonomous inferences on Outerbounds, we have been optimizing a solution for workload-aware inference. In contrast to workload-agnostic models with decoupled clients, we allow you to co-locate and optimize the models for your specific workloads - as depicted below:

Workload-aware autonomous inference requires two key components:

  1. Robust orchestration of tasks and models, which is something that we have been perfecting with Metaflow over the past seven years.
  2. Flexible, on-demand access to cost-effective GPU resources, which is a core feature of Outerbounds, thanks to our integration with hyperscalers as well GPU providers such as Nebius, CoreWeave, and NVIDIA’s DGX Cloud.

A key advantage of this approach is the ability to reduce task completion latency almost arbitrarily through horizontal scaling. Unlike workload-agnostic systems that reactively scale - often hitting rate limits and scaling constraints - workload-aware scheduling provisions resources in advance and tears them down immediately after the task completes, which also contributes to cost-effectiveness.

While this may sound promising in theory, how much of a difference does it actually make? Inference providers have spent years refining their offerings, so it's fair to question whether this approach can be cost-competitive. The benchmarks below illustrate the impact.

Benchmarking autonomous inference

In this context, we're not trying to identify the highest-throughput model available today or the inference provider with the best micro-optimizations. New models are released weekly, and popular inference engines - along with commercial providers - are under constant co-evolutionary pressure to match each other’s performance.

Rather, we are interested in general patterns. Hence we chose to benchmark a representative set of providers, each with particular structural strength and weaknesses which are likely to generalize to other providers in each category:

  • AWS Bedrock is a hyperscaler default and one of the most popular inference providers.
  • Nebius is a top-ranked neocloud with an inference service called Nebius AI Studio that benefits from the company’s vertically integrated GPU infrastructure.
  • Together.AI is a leading inference provider with a track record of solid R&D around inference.

For workload-aware inference on Outerbounds, we leverage our integration with compute pools on Nebius, giving us seamless access to on-demand H100s. While off-the-shelf inference providers benefit from economies of scale, they're constrained by the cost of keeping GPUs idle behind their serverless offerings, forcing them to tightly pack models on shared hardware.

In contrast, Outerbounds lets you spin up dedicated GPU instances tailored to your workload. We've optimized the entire startup path - from instance provisioning to container and model loading - allowing us to minimize task completion latency and maximize throughput. Perhaps surprisingly, this can result in better cost-performance than shared, token-priced offerings, as demonstrated below.

Test workloads

When it comes to models and test prompts, we highlight the performance along three scenarios which we have seen to be commonly used in real-world AI applications:

  1. Small models and small prompts - suitable for simple use cases, such as sentiment analysis, speech-to-text, or guardrails.
  2. Medium open-weight LLMs with medium prompts - suitable for general-purpose tasks, such as document understanding.
  3. Large reasoning models with a large context, increasingly common in agentic RAG, coding, reasoning, and elaborate tasks with a long system prompt.

Our focus is on measuring the end-to-end latency of completing a non-trivial volume of prompts- at least thousands - to get a realistic sense of model throughput. We're also interested in the spread of response latencies, which reveals how predictable that throughput is. A service with decent average performance can still be problematic if it exhibits high variability. Imagine developing a system where one iteration takes 15 minutes and the next, four hours.

Since the system is autonomous, occasional latency spikes aren’t a major concern (as they might be in interactive use cases). Instead, we define latency spread as the difference between the p90 and p10 latencies across all prompts sent to the endpoint, providing a robust measure of overall consistency.

Results: Small models, small prompts

We start with Meta Llama 3.1 8B as a representative small model. Our prompt is

Read the title of a wikipedia article below. Is the title related to the history of Europe? Answer with one word, YES or NO

...which we process with 1000 distinct titles with 4 concurrent clients hitting an endpoint. Note that while the prompt requires only one output token, we allow the model to output up to 500 tokens.

In the charts below, the bars represent the total task completion time - the lower the better - and the red line shows the latency spread, i.e. the difference between p90 - p10 latencies.

Small prompt + Meta Llama 3.1 8B small model

AWS Bedrock shines in this case. We expect AWS to have spent a considerable amount of effort optimizing this common model on their proprietary hardware, Inferentia, which may contribute to the excellent performance in this particular case.

Workload-aware inference on Outerbounds handles the task effortlessly. For a small workload like this, the solution is arguably overkill - the 5-10 minute instance startup time incurred by the cloud backend introduces overhead, and the per-token cost of using a small model with minimal output is negligible. On the bright side, the p90 latency on Outerbounds is just 21 milliseconds, making it feasible to process even billions of small prompts in a reasonable timeframe (with sufficient parallelism), which can be highly valuable for use cases needing massive scale.

To double-check the performance of small models, we ran the same set of prompts with another small model, Qwen3 4B, which is available as a marketplace deployment on AWS Bedrock (but not available on Together.AI). The results show that AWS Bedrock performs consistently well with small models:

Small prompt + Qwen3 4B small model

Results: A medium model, medium prompt

Next, we up the game to a dense, Qwen 2.5 72B model, and increase the number of input and output tokens with the following prompt:

Summarize the article below in two paragraphs. Use at most 400 words

…followed by a Wikipedia article with around 1000 tokens. 

While the model is deployable on AWS Bedrock, the recommended instance - g5.12xlarge, equipped with four A10G GPUs, is severely underpowered for the task. AWS does offer larger GPU instances, but they come at an extremely steep cost, and, as is often the case, we were unable to get our quota increased to access them due to insufficient capacity in the region. Correspondingly, the performance on AWS Bedrock is abysmal:

Summarization task + Qwen 2.5 72B dense model

This highlights a key limitation of AWS Bedrock and likely other similar hyperscaler services. As of today, they aren’t well-positioned to offer top-tier GPUs at competitive prices, limiting their ability to run larger models cost-effectively.

Results: A large context

In recent years, LLMs have expanded to support much longer context windows, reaching up to millions of tokens. The progress has unlocked a wide range of sophisticated new applications, from code analysis and generation to in-depth research. It has also enabled new RAG patterns, where models can be prompted with a broad set of documents - as well as reasoning agents with plenty of intermediate context.

Being able to serve sophisticated agents and applications with large amounts of contexts is a key use case for autonomous inference. To test this pattern, we prepared a prompt with around 20k-30k tokens - far from extreme by today’s standards - and asked the model to explain the meaning of numbers in documents, from 1 to 1000 - just to simulate a case where the model needs to comb through a non-trivial amount of context.

Encouraged by the performance of AWS Bedrock with small models, we ran the first test with a large context using Llama 3.1 8B:

Large context + Meta Llama 3.1 8B small model

AWS Bedrock started throttling our clients with a message, Too many tokens, please wait before trying again, failing about 30% of the requests, producing also a much higher latency spread compared to small prompts.

Highlighting the cost-performance benefits

Next, we tested the prompt with the 72B dense model - representing a realistic use case with a reasonable-sized model and a non-trivial prompt. We excluded AWS Bedrock from the benchmark due to the lack of suitable hardware to host the model reasonably.

Large context + Qwen 2.5 72B dense model

We used a serverless Qwen2.5 72B Instruct Turbo endpoint on Together.AI, which presumably should be the highest-performing variant of the model. This scenario highlights the cost benefits that are achievable when dealing with a large number of tokens: The workload includes 18.1M input tokens, producing 0.4M output tokens, for the total cost of $22.2 at Together.AI. 

The same set of prompts consumed 480 H100-minutes on Outerbounds, costing $22.80 in total - based on the list prices of Nebius. In other words, at this scale, both approaches are cost-equivalent while Outerbounds delivers over 7x faster completion. For any larger workload, Outerbounds would be strictly more cost-efficient.

Notably, the same model is much cheaper to run on Nebius AI Studio’s endpoints, bringing the total cost down to just $4.80. This highlights surprising discontinuities in inference pricing across providers.

One could argue for renting a dedicated GPU server or two from Together.AI to host the model. The drawback is taking on the familiar burden of managing expensive GPU infrastructure yourself. Moreover, the cost-performance ratio of workload-agnostic dedicated instances falls short of automatically scaling, workload-aware inferencing.

A reasoning agent

In the long run, a key application of autonomous inference will be AI agents processing large volumes of information without human involvement. To explore this scenario, we benchmarked a state-of-the-art open-weight reasoning model: Qwen QwQ 32B (not yet available on AWS Bedrock).

Nebius AI Studio and Together.AI perform neck to neck, although there’s over 2x price difference between the services and a significant difference in the latency spread.

Large context + Qwen QwQ 32B reasoning model

Workload-aware scheduling on Outerbounds shines with large contexts and models, delivering over 2x speedups compared to alternatives while maintaining minimal latency variance. As a rule of thumb, the more stateful your models are, whether due to large system prompts, extended context, or other characteristics, the more they benefit from workload-aware scheduling. This is exactly what autonomous inference systems of the future will require.

Summary

There's no free lunch in inference. Depending on your use case, you may want to optimize for task completion latency, cost, predictable performance, model availability, or other factors like geographic locality, privacy, or compliance.

Based on the benchmarks presented above, a few clear patterns emerge:

  • The price and performance of small models with small prompts are excellent across providers.
  • Hyperscalers like AWS have limited availability of reasonably-priced large GPUs, limiting the options with larger models.
  • Inference providers like Nebius AI Studio and Together.AI offer top-tier performance for many workloads. However, serverless offerings can show significant latency variability and surprising pricing discontinuities.
  • In all cases, Outerbounds’ workload-aware inference can significantly reduce task completion latency - accelerating development and enabling use cases involving autonomous inferencing.
  • When the workloads involve a non-trivial number of tokens, this approach can be highly cost-effective compared to per-token pricing.
  • As an added benefit, you get to own and control the models, avoiding frequent and disruptive deprecations, as well as having full control over privacy and compliance.

To see if this approach fits your use case, get started with our free trial - complete with complimentary GPU credits.

Start building today

Join our office hours for a live demo! Whether you're curious about Outerbounds or have specific questions - nothing is off limits.


We can't wait to meet you soon! Keep an eye out for a confirmation email with the deets.
Oops! Something went wrong while submitting the form.