Building Resilient AI Systems: LLM Fallback Strategies

How I built multi-layer fallback mechanisms to handle LLM failures in production AI systems.

The Problem

When you're running AI agents in production, LLM failures are inevitable. Rate limits, gateway timeouts, service unavailability—these things happen. The question is: does your system gracefully recover, or does it crash and burn?

This weekend, I faced this head-on. My telemetry showed 13 failed tasks on February 15th alone. Some were transient failures that could have been recovered with a retry. Others needed a fallback to a different model entirely.

Real Data from Production

Here's what my telemetry captured over the weekend:

Metric Feb 15 Feb 16
Total Tasks 169 8
Completed 156 6
Failed 13 1
Success Rate 92.3% 75%
LLM Calls ~180 84
LLM-Level Failures Several 0
The interesting part: after implementing fallback mechanisms, Feb 16 showed 0 LLM-level failures. The one failed task was due to an application-level error (AttributeError), not an LLM issue.

The Solution: Multi-Layer Fallback

I implemented fallback at two levels:

1. Provider-Level Retry (LiteLLM)

The first line of defense is retrying transient errors within the same model:

max_attempts = 3  # initial try + up to 2 retries
last_error: Exception | None = None
llm_timeout = 600  # 10 minutes max per LLM call

for attempt in range(1, max_attempts + 1):
    try:
        response = await asyncio.wait_for(acompletion(**kwargs), timeout=llm_timeout)
        parsed = self._parse_response(response)
        parsed.meta = {
            "resolved_model": model,
            "attempts": attempt,
            "retries": attempt - 1,
        }
        return parsed
    except asyncio.TimeoutError:
        last_error = TimeoutError(f"LLM call timed out after {llm_timeout}s")
        break  # Don't retry timeouts
    except Exception as e:
        last_error = e
        if attempt >= max_attempts or not self._is_retryable_error(e):
            break
        backoff = min(4.0, 0.5  (2 * (attempt - 1)))
        await asyncio.sleep(backoff)

Key decisions here:

  • Exponential backoff: 0.5s, 1s, 2s between retries
  • Capped at 4 seconds: Prevents runaway delays
  • Timeouts don't retry: A 10-minute timeout means something is fundamentally wrong
  • Selective retry: Only retry transient errors (429, 502, 503, 504, connection issues)

2. Model-Level Fallback (Agent Loop)

If the primary model fails completely, fall back to alternatives:

def _next_fallback_model(self, already_tried: set[str]) -> str | None:
    """Return the next untried model from [primary, *fallback_models], or None."""
    primary = self.model_routing.default_chat_model.strip() or self.model
    if primary not in already_tried:
        return primary
    for m in self.model_routing.fallback_models:
        if m.strip() and m.strip() not in already_tried:
            return m.strip()
    return None

This is configured in my routing config:

{
  "default_chat_model": "anthropic/claude-opus-4-5",
  "fallback_models": [
    "openai/gpt-4o",
    "google/gemini-2.0-flash"
  ]
}

What Errors Are Retryable?

Not all errors should be retried. Here's my detection logic:

@staticmethod
def _is_retryable_error(err: Exception) -> bool:
    text = str(err).lower()
    if any(
        token in text
        for token in (
            "server disconnected",
            "connection reset",
            "connection aborted",
            "connection closed",
            "timeout",
            "timed out",
            "temporarily unavailable",
            "internalservererror",
            "service unavailable",
            "bad gateway",
            "gateway timeout",
            "hosted_vllmexception",
            "429",
            "502",
            "503",
            "504",
        )
    ):
        return True
    status = getattr(err, "status_code", None)
    if isinstance(status, int) and status in {429, 500, 502, 503, 504}:
        return True
    return False

The key insight: authentication errors, validation errors, and content policy violations should NOT be retried. They'll fail again. Only transient infrastructure issues benefit from retries.

The Results

After implementing these fallbacks:

  • Feb 15 (before): 13 failed tasks, several due to LLM issues
  • Feb 16 (after): 1 failed task, 0 LLM-level failures

The remaining failure was an application bug (AttributeError on _email_context_until), which is now tracked in PR #4 with a fix ready.

Commit Reference

This was implemented in commit 72d4bb7:

feat: Implement LLM fallback mechanisms for agents and subagents
to retry with alternative models upon LLM errors.

 nanobot/agent/loop.py     | 102 +++++++++++++++++++++++++++++++----
 nanobot/agent/subagent.py |  43 ++++++++++++---
 nanobot/config/schema.py  |   1 +
 nanobot/providers/base.py |   5 ++
 4 files changed, 135 insertions(+), 16 deletions(-)

Lessons Learned

  1. Layer your fallbacks: Provider retry + model fallback = comprehensive coverage
  2. Be selective about retries: Not all errors are transient
  3. Track everything: Telemetry proved the fix worked
  4. Exponential backoff with caps: Aggressive enough to matter, bounded enough to not hang

The system is now resilient enough that when an LLM provider has issues, it automatically tries alternatives. Users don't notice the difference.