How to Build Redundancy for AI Models: A Practical Guide

The SLA problem nobody is talking about

Before we get into solutions, it's worth understanding just how exposed most organisations are right now. Anthropic's uptime commitments vary dramatically by plan:

Plan	SLA Commitment	Max downtime/month	Reality (90-day actual)
Standard (API default)	None	Unlimited	98.97% = ~7.5hrs
Priority Tier	99.5%	~3.6 hrs	98.97% — already breached
Enterprise (~$50K/yr)	99.99%	~52 mins	Not publicly disclosed

The uncomfortable reality: Most organisations building on the Claude API are on the Standard tier — which means they have no contractual uptime protection whatsoever. When Claude goes down, they have no recourse and no compensation. The only answer is to build your own resilience.

The good news: building AI redundancy is entirely achievable and does not require a massive engineering effort. The key is understanding the architectural patterns available and choosing the right approach for your organisation.

What is an AI proxy server?

Before diving into the tools, it helps to see the problem visually. Here's the difference between a non-redundant and redundant AI architecture:

⚠ Diagram 1: Current Non-Redundant Setup — Single point of failure

✓ Diagram 2: Redundant Setup with AI Proxy — Automatic failover, zero visible impact

Diagram 3: Cross-Cloud Architecture — Maximum resilience for mission-critical AI

An AI proxy server (also called an AI gateway) sits between your application and the AI providers it calls. Instead of your application talking directly to Claude's API, it talks to the proxy — which then routes the request to the appropriate AI provider based on availability, cost, or performance.

Think of it like a load balancer, but for AI models. When your primary provider goes down, the proxy automatically routes requests to your fallback provider — transparently, without your application code needing to change.

Key insight: An AI proxy decouples your application from any specific AI provider. This is the single most important architectural decision you can make to protect your organisation from AI downtime.

A proxy typically handles:

Routing — directing requests to the right provider based on rules you define
Failover — automatically switching to a backup provider when the primary fails
Load balancing — distributing requests across providers to avoid rate limits
Caching — storing responses for repeated queries to reduce cost and latency
Logging and monitoring — tracking usage, costs, and performance across all providers
Authentication — managing API keys centrally rather than scattered across applications

Open source and commercial AI proxy tools

Rather than building a proxy from scratch, the vast majority of organisations are better served by an existing open-source or commercial AI gateway. These tools are production-hardened, actively maintained, and can be deployed in days rather than months. A good proxy will handle the key capabilities you need:

Automatic failover — switching to a backup provider when the primary fails, without your application noticing
Health checks — proactively detecting provider issues before your users do
Request queuing — holding requests during outages and processing them when service resumes
Response caching — serving cached responses for repeated queries to reduce cost and latency
Unified logging — tracking usage, cost, and performance across all providers in one place
Circuit breaker pattern — stopping requests to a failed provider during a cool-down period rather than hammering it repeatedly
API key management — centralising credentials rather than scattering them across applications

Here are the leading options available today:

Open Source

LiteLLM

The most widely adopted open-source AI proxy. Supports 100+ LLM providers through a single unified API interface. Drop-in replacement for OpenAI's SDK — switch providers by changing a single parameter. Includes load balancing, fallbacks, spend tracking, and a management UI. Can be self-hosted or used as a managed service.

Open Source

Portkey

Production-grade AI gateway with automatic failover, semantic caching, and detailed observability. Strong on governance features — per-team rate limits, audit logs, and policy enforcement. Particularly well-suited for regulated industries needing full control over what goes in and out of their AI systems.

Managed Service

AWS Bedrock

Amazon's managed AI service gives you access to Claude, Llama, Mistral, and others through a single AWS API. Native AWS integration means IAM, CloudWatch, and VPC all work out of the box. If you're already on AWS, Bedrock eliminates the need for a separate proxy layer entirely — failover is built in.

Managed Service

Azure AI Foundry

Microsoft's equivalent to Bedrock — access to GPT-4o, Claude, Llama, and others through a single Azure endpoint. Tight integration with Microsoft 365 and Azure Active Directory. Strong compliance posture for regulated industries. Well-suited for organisations already standardised on Microsoft's cloud.

Managed Service

Google Vertex AI

Google's AI platform providing access to Gemini models alongside third-party models including Claude. Native integration with Google Cloud's IAM, logging, and monitoring. Good choice for organisations standardised on GCP or using Google Workspace extensively.

Commercial SaaS

Requesty / Bifrost

Purpose-built AI gateways designed specifically for production reliability. Automatic failover in milliseconds, semantic caching, and detailed cost analytics. Lower operational overhead than self-hosting LiteLLM. Good choice for teams without dedicated platform engineering resource.

Choosing the right redundancy strategy

The right approach depends on your organisation's size, technical capability, and risk tolerance. Here's a framework for choosing:

Scenario	Recommended approach	Effort
Already on AWS	AWS Bedrock with Cross-Region Inference enabled	Low
Already on Azure	Azure AI Foundry	Low
Already on GCP	Google Vertex AI	Low
Multi-cloud or cloud-agnostic	LiteLLM self-hosted or Portkey	Medium
No platform engineering resource	Requesty or Portkey managed	Low
Regulated industry (finance, health)	Self-hosted LiteLLM or Portkey with full audit logging	Medium
Maximum control required	LiteLLM + direct API fallback bypassing cloud platform	High
Truly mission-critical AI	Cross-cloud: Bedrock (primary) + Azure AI Foundry (secondary) + direct API (tertiary)	Very High

This isn't just an Anthropic problem — it's industry-wide

To be clear: AI provider outages are not unique to Anthropic. Every major LLM provider has experienced significant downtime in 2026. This is a structural characteristic of the AI industry at its current maturity level — not a failing of any single company.

Provider	Recent incident	Impact
Claude (Anthropic)	Multiple incidents May 19, 2026. API uptime 98.97% over 90 days	Priority SLA of 99.5% already breached
ChatGPT (OpenAI)	Major outage Feb 3, 2026 — over 15,000 user reports, all services affected. Further outage April 20, 2026 affecting ChatGPT and Codex globally	52 incidents in 90 days, median duration 1hr 47mins
Gemini (Google)	Elevated error rates Feb 18, 2026 — chat history lost for users. Gemini API degraded performance April 17-18, 2026 for over 34 hours combined	Multiple incidents tracked since September 2025

The pattern is clear: No AI provider has achieved the reliability of traditional cloud infrastructure. OpenAI had 52 incidents in 90 days. Google's Gemini API was in a degraded state for nearly 34 hours across two consecutive days in April. Claude's API has not met its own Priority tier SLA commitment. The risk is not provider-specific — it is inherent to the current state of the industry. The only rational response is to build redundancy into your architecture.

This is precisely why the choice of which provider to use as your primary is less important than ensuring you have a tested fallback to a different provider on different infrastructure. When outages are this frequent across the entire industry, single-provider dependency is simply not a defensible architectural decision for any business-critical AI workload.

This is a question every architect should ask — and most don't. Routing all your AI traffic through AWS Bedrock, Azure AI Foundry, or Google Vertex AI solves the single-provider problem, but potentially introduces a new one: single-cloud platform dependency.

The good news is that AWS Bedrock is not simply a single-region service. AWS offers a feature called Cross-Region Inference — which automatically routes Bedrock requests to an available region if your primary region is degraded or unavailable. This provides meaningful resilience against the most common failure mode: a single AWS region going down.

Bedrock with Cross-Region Inference is significantly more resilient than a standard single-region setup and is sufficient for the vast majority of enterprise use cases. If you are using Bedrock, enabling Cross-Region Inference should be a baseline requirement, not an optional extra.

However, Cross-Region Inference has limits. It routes across AWS regions — it does not protect against a broader AWS-wide incident. These are rare, but they do happen. In December 2021, a major AWS us-east-1 outage cascaded and affected services across multiple regions simultaneously. In those scenarios, all Bedrock endpoints would be affected regardless of cross-region routing.

The cascading failure scenario: Your application calls AWS Bedrock → Cross-Region Inference tries alternative regions → a broader AWS network incident affects all regions simultaneously → everything fails at once. Your application, your gateway, and your underlying models are all on the same infrastructure.

Matching your redundancy to your risk appetite

The right level of redundancy depends on how critical AI is to your operations:

Important but not critical — Bedrock with Cross-Region Inference is sufficient. The probability of an AWS-wide incident is very low.
Critical — Bedrock with Cross-Region Inference as primary, plus a direct API fallback to a provider bypassing Bedrock, giving you a path that doesn't depend on AWS at all.
Mission-critical — Full cross-cloud diversity across two or more cloud platforms, each with independent AI gateways.

True cross-cloud redundancy

For organisations where AI is genuinely mission-critical, the architecture that provides the highest level of protection looks like this:

Layer	Primary	Secondary	Tertiary
Cloud Platform	AWS	Azure	GCP
AI Gateway	AWS Bedrock	Azure AI Foundry	Vertex AI
Model	Claude via Bedrock	GPT-4o via Azure	Gemini via Vertex

Each layer sits on completely independent infrastructure, so a single cloud outage cannot cascade through the entire stack.

The pragmatic middle ground for most organisations

Full cross-cloud redundancy is complex and expensive to build and maintain — realistic only for large enterprises with significant engineering resource. For most organisations, the pragmatic approach that balances resilience with manageability is:

Primary: AWS Bedrock with Cross-Region Inference enabled — provides strong resilience against regional failures, which are the most common cause of outages
Secondary: Direct API calls to one or two providers, bypassing the cloud platform entirely — so a broader AWS incident doesn't take your fallback with it
Abstraction layer: LiteLLM or similar sits in front of both, so switching between Bedrock and a direct API call is a configuration change, not a code change

The key principle: Your primary and secondary paths must not share a single point of failure. If both paths go through the same cloud platform, the same region, or the same network provider — you don't have redundancy, you have an illusion of it.

Building a multi-provider failover strategy

Regardless of which tools you choose, a sound multi-provider failover strategy follows these principles:

Define your provider hierarchy

Choose a primary provider, one or more secondary providers, and optionally a tertiary. For most organisations: Claude (primary) → GPT-4o (secondary) → Gemini (tertiary). Document this hierarchy and the criteria for switching.

Normalise your prompts

Different models respond differently to the same prompt. Test your prompts against all providers in your hierarchy and adjust so outputs are acceptable from any of them. Avoid provider-specific features in your critical path.

Set failure thresholds

Define what constitutes a failure — e.g. three consecutive 5xx errors, or error rate above 10% over 60 seconds. Don't switch providers on a single failed request, but don't wait too long either.

Implement graceful degradation

If all providers fail, your application should degrade gracefully — queuing requests, showing a helpful message, or falling back to a non-AI alternative — rather than returning an error to the user.

Monitor independently

Don't rely on provider status pages — they often lag behind actual incidents. Set up independent health checks that ping your providers directly every 30 seconds. Tools like Better Stack or Checkly make this straightforward.

Test your failover regularly

An untested failover is not a failover. Run quarterly "failover drills" where you deliberately disable your primary provider and verify that traffic routes correctly to the secondary. Time the switchover and document it.

Cost considerations

Multi-provider redundancy does add cost — but less than you might think, and far less than the cost of downtime.

Dual API keys — maintaining accounts with two providers costs nothing if you're only paying per token used
Proxy infrastructure — LiteLLM self-hosted runs comfortably on a small VM costing ~$20-50/month
Managed gateways — Portkey and similar services typically cost $50-200/month for most usage levels
Cloud platforms — AWS Bedrock, Azure AI Foundry, and Vertex AI add a small markup (typically 10-20%) on top of base model costs in exchange for the managed reliability

The maths: If your business generates £10,000/day in AI-assisted revenue and experiences 5 hours of downtime per month at 98.97% uptime, that's approximately £2,000/month in lost productivity. A £100/month proxy that eliminates that downtime pays for itself 20 times over.

What about open-source models as a fallback?

For organisations that want the ultimate fallback — one that doesn't depend on any external provider — running an open-source model locally or on your own infrastructure is worth considering.

Models like Llama 3.1, Mistral, and Qwen can run on commodity GPU hardware and provide a genuine zero-dependency fallback. They won't match frontier models on complex tasks, but for many operational workflows they're entirely sufficient.

This approach makes particular sense for:

Organisations with strict data residency requirements
Financial services and healthcare where external API calls raise compliance concerns
High-volume workloads where the economics of per-token pricing become significant
Organisations that classify AI as genuinely mission-critical and need maximum resilience

The governance layer

Redundancy isn't just a technical problem — it's a governance one. Your proxy or gateway is also the right place to enforce organisational AI policies:

Data classification — block requests containing PII or confidential data from being sent to external providers
Content filtering — enforce acceptable use policies at the gateway layer rather than relying on each application to implement them
Cost controls — set spend limits per team, per application, or per time period
Audit logging — log every request and response for compliance and incident investigation
Access control — manage which teams and applications can call which models

A well-configured AI gateway isn't just a reliability tool — it's a central point of control for your entire AI estate. This is what mature AI governance looks like in practice.

Getting started today

If you're currently calling an AI provider's API directly with no proxy or fallback in place, here's the minimum viable action plan:

This week — sign up for a second AI provider (OpenAI if you're using Claude, or vice versa). Cost: zero until you use it.
This month — implement a basic failover wrapper around your most critical AI calls. Even a simple try/except that switches providers is infinitely better than nothing.
This quarter — evaluate LiteLLM or a managed gateway for your use case and migrate your AI calls through it. Add independent monitoring.
Ongoing — run quarterly failover drills. Review your provider hierarchy as the model landscape evolves.

Need help building your AI resilience strategy?

AI Bods helps organisations design and implement AI architectures that are built to last — with redundancy, governance, and human oversight at their core. Get in touch to discuss your situation.

Talk to AI Bods →

Also download our free guide — AI-First Without the Risk — for the complete 8-point framework for AI business continuity, including how to classify your AI dependencies and build your incident response plan.

Download the free guide →