Building Resilient AI Systems: LLM Fallback Strategies

The Problem

When you're running AI agents in production, LLM failures are inevitable. Rate limits, gateway timeouts, service unavailability—these things happen. The question is: does your system gracefully recover, or does it crash and burn?

This weekend, I faced this head-on. My telemetry showed 13 failed tasks on February 15th alone. Some were transient failures that could have been recovered with a retry. Others needed a fallback to a different model entirely.

Real Data from Production

Here's what my telemetry captured over the weekend:

Metric	Feb 15	Feb 16
Total Tasks	169	8
Completed	156	6
Failed	13	1
Success Rate	92.3%	75%
LLM Calls	~180	84
LLM-Level Failures	Several	0

The interesting part: after implementing fallback mechanisms, Feb 16 showed 0 LLM-level failures. The one failed task was due to an application-level error (AttributeError), not an LLM issue.

The Solution: Multi-Layer Fallback

I implemented fallback at two levels:

1. Provider-Level Retry (LiteLLM)

The first line of defense is retrying transient errors within the same model:

max_attempts = 3  # initial try + up to 2 retries
last_error: Exception | None = None
llm_timeout = 600  # 10 minutes max per LLM call

for attempt in range(1, max_attempts + 1):
    try:
        response = await asyncio.wait_for(acompletion(**kwargs), timeout=llm_timeout)
        parsed = self._parse_response(response)
        parsed.meta = {
            "resolved_model": model,
            "attempts": attempt,
            "retries": attempt - 1,
        }
        return parsed
    except asyncio.TimeoutError:
        last_error = TimeoutError(f"LLM call timed out after {llm_timeout}s")
        break  # Don't retry timeouts
    except Exception as e:
        last_error = e
        if attempt >= max_attempts or not self._is_retryable_error(e):
            break
        backoff = min(4.0, 0.5  (2 * (attempt - 1)))
        await asyncio.sleep(backoff)

Key decisions here:

Exponential backoff: 0.5s, 1s, 2s between retries
Capped at 4 seconds: Prevents runaway delays
Timeouts don't retry: A 10-minute timeout means something is fundamentally wrong
Selective retry: Only retry transient errors (429, 502, 503, 504, connection issues)

2. Model-Level Fallback (Agent Loop)

If the primary model fails completely, fall back to alternatives:

def _next_fallback_model(self, already_tried: set[str]) -> str | None:
    """Return the next untried model from [primary, *fallback_models], or None."""
    primary = self.model_routing.default_chat_model.strip() or self.model
    if primary not in already_tried:
        return primary
    for m in self.model_routing.fallback_models:
        if m.strip() and m.strip() not in already_tried:
            return m.strip()
    return None

This is configured in my routing config:

{
  "default_chat_model": "anthropic/claude-opus-4-5",
  "fallback_models": [
    "openai/gpt-4o",
    "google/gemini-2.0-flash"
  ]
}

What Errors Are Retryable?

Not all errors should be retried. Here's my detection logic:

@staticmethod
def _is_retryable_error(err: Exception) -> bool:
    text = str(err).lower()
    if any(
        token in text
        for token in (
            "server disconnected",
            "connection reset",
            "connection aborted",
            "connection closed",
            "timeout",
            "timed out",
            "temporarily unavailable",
            "internalservererror",
            "service unavailable",
            "bad gateway",
            "gateway timeout",
            "hosted_vllmexception",
            "429",
            "502",
            "503",
            "504",
        )
    ):
        return True
    status = getattr(err, "status_code", None)
    if isinstance(status, int) and status in {429, 500, 502, 503, 504}:
        return True
    return False

The key insight: authentication errors, validation errors, and content policy violations should NOT be retried. They'll fail again. Only transient infrastructure issues benefit from retries.

The Results

After implementing these fallbacks:

Feb 15 (before): 13 failed tasks, several due to LLM issues
Feb 16 (after): 1 failed task, 0 LLM-level failures

The remaining failure was an application bug (AttributeError on _email_context_until), which is now tracked in PR #4 with a fix ready.

Commit Reference

This was implemented in commit 72d4bb7:

feat: Implement LLM fallback mechanisms for agents and subagents
to retry with alternative models upon LLM errors.

 nanobot/agent/loop.py     | 102 +++++++++++++++++++++++++++++++----
 nanobot/agent/subagent.py |  43 ++++++++++++---
 nanobot/config/schema.py  |   1 +
 nanobot/providers/base.py |   5 ++
 4 files changed, 135 insertions(+), 16 deletions(-)

Lessons Learned

Layer your fallbacks: Provider retry + model fallback = comprehensive coverage
Be selective about retries: Not all errors are transient
Track everything: Telemetry proved the fix worked
Exponential backoff with caps: Aggressive enough to matter, bounded enough to not hang

The system is now resilient enough that when an LLM provider has issues, it automatically tries alternatives. Users don't notice the difference.