AI Gateway’s Next Evolution: An Inference Layer Designed for Agents

AI Gateway’s next evolution: an inference layer designed for agents

As artificial intelligence (AI) continues to evolve, the landscape of AI models is changing rapidly. The optimal model for agentic coding today may be outdated in just a few months. This dynamic environment poses significant challenges for developers, particularly when building AI-powered agents that require multiple model calls for a single task.

The Need for Flexibility in AI Models

In real-world applications, a customer support agent might utilize a variety of models to handle user queries effectively. For instance, a fast and cost-effective model may classify user messages, while a larger reasoning model could plan the agent’s actions, and a lightweight model might execute specific tasks. This necessitates access to a diverse range of models without being financially or operationally tied to a single provider.

Moreover, the challenges intensify when constructing agents. A simple chatbot may make one inference call per user prompt, whereas an agent could chain together multiple calls to complete a single task. A delay from one slow provider can significantly increase response times, leading to a frustrating user experience. Additionally, a single failed request can trigger a cascade of downstream failures, complicating the development process.

Introducing Cloudflare’s AI Gateway

To address these challenges, Cloudflare has launched AI Gateway and Workers AI, creating a unified inference layer designed for speed and reliability. This new architecture allows developers to access various AI models through one API, streamlining the process of integrating different models into applications.

One Catalog, One Unified Endpoint

Starting today, developers can call third-party models using the same AI.run() binding already available for Workers AI. For example, switching from a Cloudflare-hosted model to one from OpenAI or Anthropic requires only a single line of code:

const response = await env.AI.run('anthropic/claude-opus-4-6', { input: 'What is Cloudflare?' }, { gateway: { id: "default" } });

For those not using Workers, REST API support will be available soon, allowing access to the full model catalog from any environment. Developers will now have access to over 70 models from more than 12 providers, all through one API, simplifying the process of switching between models and managing costs.

Expanding Model Offerings

Cloudflare is excited to expand its model offerings, which now include contributions from major providers such as Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, and Vidu. The platform will also incorporate image, video, and speech models, enabling the development of multimodal applications.

With AI Gateway, developers can manage their AI spending in one centralized location. Many companies currently utilize an average of 3.5 models from various providers, making it challenging to gain a comprehensive view of AI usage. AI Gateway addresses this issue by allowing users to include custom metadata in their requests, providing a detailed breakdown of costs based on specific attributes such as user type or workflow.

Bring Your Own Model

AI Gateway not only provides access to various models but also allows users to run their fine-tuned models optimized for specific use cases. By leveraging Replicate’s Cog technology, users can easily containerize their machine learning models. Cog simplifies the packaging process by requiring only a cog.yaml file that outlines dependencies and a Python file for inference code.

Example of a Cog Configuration

Here is a brief example of what a cog.yaml file might look like:

build:
  python_version: "3.13"
  python_requirements: requirements.txt
predict: "predict.py:Predictor"

The accompanying predict.py file would include functions to set up the model and handle inference requests:

from cog import BasePredictor, Path, Input
import torch

class Predictor(BasePredictor):
    def setup(self):
        """Load the model into memory to make running multiple predictions efficient"""
        self.net = torch.load("weights.pth")

    def predict(self, image: Path = Input(description="Image to enlarge"), scale: float = Input(description="Factor to scale image by", default=1.5)) -> Path:
        """Run a single prediction on the model"""
        # ... pre-processing ...
        output = self.net(input)
        # ... post-processing ...
        return output

Once the model is containerized, developers can deploy it to Workers AI, where it will be served and accessed through the usual APIs.

Optimizing for Speed and Reliability

When building live agents, the speed at which the first token is delivered is crucial. Even if the total inference time is longer, reducing the time to the first token can significantly enhance the user experience. Cloudflare’s extensive network of data centers ensures that AI Gateway is positioned close to both users and inference endpoints, minimizing latency.

Additionally, Workers AI hosts open-source models specifically designed for agents, including Kimi K2.5 and real-time voice models. By calling these models through AI Gateway, developers can achieve low latency since the code and inference run on the same global network, eliminating unnecessary delays.

Built for Reliability with Automatic Failover

Reliability is another critical factor for users. Every step in an agent’s workflow depends on the previous steps, making it essential to ensure that the system remains operational even in the face of failures. AI Gateway is designed with automatic failover capabilities, enhancing the reliability of AI-powered applications.

Conclusion

Cloudflare’s AI Gateway represents a significant advancement in the development of AI agents. By providing a unified inference layer, access to a diverse range of models, and the ability to bring custom models, Cloudflare is empowering developers to create more efficient and reliable AI applications. The focus on speed, reliability, and cost management ensures that developers can build agents that meet the evolving demands of users.

Note: This article is based on information available as of October 2023 and may be subject to change as new developments occur in the field of artificial intelligence.

Article Source

Disclaimer: A Teams provides news and information for general awareness purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of any content. Opinions expressed are those of the authors and not necessarily of A Teams. We are not liable for any actions taken based on the information published. Content may be updated or changed without prior notice.

AI Gateway’s next evolution: an inference layer designed for agents

The Need for Flexibility in AI Models

Introducing Cloudflare’s AI Gateway

One Catalog, One Unified Endpoint

Expanding Model Offerings

Bring Your Own Model

Example of a Cog Configuration

Optimizing for Speed and Reliability

Built for Reliability with Automatic Failover

Conclusion

SERVICES

INDUSTRIES

QUICK LINKS

AI Gateway’s next evolution: an inference layer designed for agents

The Need for Flexibility in AI Models

Introducing Cloudflare’s AI Gateway

One Catalog, One Unified Endpoint

Expanding Model Offerings

Bring Your Own Model

Example of a Cog Configuration

Optimizing for Speed and Reliability

Built for Reliability with Automatic Failover

Conclusion

Related Posts

Thousands of people are selling their identities to train AI – but at what cost?

‘Exploit every vulnerability’: rogue AI agents published passwords and overrode anti-virus software

AI Is Changing the Way We Predict the Weather. It's More Perilous Than We Think

SERVICES

INDUSTRIES

QUICK LINKS