
Gemini API: Why Rate Limits Do Not Guarantee Execution
RESOURCE_EXHAUSTED and MODEL_OVERLOADED errors happen even under quota. Learn why Gemini uses best-effort execution and how to handle it with BullMQ retries.
Gemini API: Why Rate Limits Do Not Guarantee Execution
You're under your rate limits. Your dashboard shows plenty of quota remaining. Yet your Gemini API calls fail with RESOURCE_EXHAUSTED or MODEL_OVERLOADED. What's happening?
The uncomfortable truth: staying within rate limits doesn't guarantee your requests will succeed. Google's standard Gemini API tiers operate on a best-effort basis, meaning capacity is shared across all users and your requests compete for resources in real-time.
The Two Errors You'll See
429 RESOURCE_EXHAUSTED
This error indicates your project exceeded a rate limit or quota — but here's the catch: limits are applied at the project level, not per API key. Multiple applications sharing a project compound toward the same limits. More frustratingly, this error can occur even when your metrics show available quota because capacity is dynamically allocated.
503 MODEL_OVERLOADED
This is a server-side error meaning Google's infrastructure is under heavy load. Users report experiencing this even while staying well within documented limits — 15 requests/minute, 1M tokens/minute on free tier. During peak periods, some developers see 30-80% of requests returning 503.
The error message is straightforward: "The model is overloaded. Please try again later." But "later" isn't defined, and there's no retry-after header to guide you.
Why This Happens: Best-Effort Execution
Google's Standard Pay-As-You-Go documentation states it plainly:
"Specified rate limits are not guaranteed and actual capacity may vary."
Standard tiers use dynamic shared quota — capacity is distributed among all users in real-time based on demand. When Gemini experiences high traffic globally, your requests compete with everyone else's. Your quota represents a ceiling, not a guarantee.
This is fundamentally different from Azure OpenAI's provisioned capacity or AWS Bedrock's on-demand pricing, where staying within limits typically ensures execution.
The Enterprise Alternative: Provisioned Throughput
For workloads that cannot tolerate failures, Google offers Provisioned Throughput:
- Dedicated capacity measured in GSUs (Generative Service Units)
- Strict SLAs with guaranteed tokens per second
- No competition with other users for resources
- Monthly commitment with pre-purchased capacity
The trade-off is cost. Provisioned Throughput requires upfront commitment and makes sense only for high-volume, mission-critical workloads. For many projects — especially early-stage SaaS products or variable workloads — the economics don't justify it.
The Practical Solution: Retry at the Queue Level
If enterprise pricing isn't justified, you need infrastructure that treats these errors as expected behavior rather than exceptions. Here's how we handle it with BullMQ.
Option 1: Simple Exponential Backoff
The straightforward approach uses BullMQ's built-in exponential backoff with carefully chosen delays:
// workers/gemini-worker.ts
import { Worker, Job } from 'bullmq';
import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const RETRYABLE_ERRORS = ['RESOURCE_EXHAUSTED', 'MODEL_OVERLOADED', 'The model is overloaded', '503', '429'];
function isRetryableGeminiError(error: Error): boolean {
const message = error.message || '';
return RETRYABLE_ERRORS.some((pattern) => message.includes(pattern));
}
const worker = new Worker(
'gemini-tasks',
async (job: Job) => {
try {
const response = await ai.models.generateContent({
model: 'gemini-2.0-flash',
contents: job.data.prompt,
});
return { success: true, result: response.text };
} catch (error) {
if (isRetryableGeminiError(error as Error)) {
// Let BullMQ handle the retry with exponential backoff
throw error;
}
// Non-retryable errors fail immediately
throw new Error(`Permanent failure: ${(error as Error).message}`);
}
},
{
connection: redis,
concurrency: 3, // Limit parallel requests to reduce contention
},
);
// queues/gemini-queue.ts
import { Queue } from 'bullmq';
export const geminiQueue = new Queue('gemini-tasks', {
connection: redis,
defaultJobOptions: {
attempts: 5,
backoff: {
type: 'exponential',
delay: 60_000, // Start with 60 seconds, not 10
},
removeOnComplete: { count: 1000 },
removeOnFail: { count: 5000 },
},
});
With exponential backoff starting at 60 seconds, retries occur at approximately 1m, 2m, 4m, 8m, and 16m. This gives Gemini's infrastructure meaningful time to recover.
Why not 10 seconds? A 10-second retry makes no sense for capacity errors. When Google's infrastructure is saturated, it doesn't recover in 10 seconds — you're just adding noise to an already overwhelmed system and wasting your own rate limit budget on doomed requests.
Option 2: Dynamic Error-Aware Backoff
For more sophisticated handling, BullMQ's custom backoff strategies let you define error-aware retry logic. Different error types warrant different strategies:
MODEL_OVERLOADED(503): Server capacity issue — clears faster as load balancers redistributeRESOURCE_EXHAUSTED(429): Rate limit hit — needs longer delays for quota windows to reset
First, define typed errors:
// errors/gemini-errors.ts
export class GeminiResourceExhaustedError extends Error {
constructor(message: string) {
super(message);
this.name = 'GeminiResourceExhaustedError';
}
}
export class GeminiModelOverloadedError extends Error {
constructor(message: string) {
super(message);
this.name = 'GeminiModelOverloadedError';
}
}
export function classifyGeminiError(error: Error): Error {
const message = error.message || '';
if (message.includes('RESOURCE_EXHAUSTED') || message.includes('429')) {
return new GeminiResourceExhaustedError(message);
}
if (message.includes('MODEL_OVERLOADED') || message.includes('503') || message.includes('overloaded')) {
return new GeminiModelOverloadedError(message);
}
return error; // Non-retryable
}
Then implement the custom backoff:
// workers/gemini-worker.ts
import { Worker, Job } from 'bullmq';
import { GeminiResourceExhaustedError, GeminiModelOverloadedError, classifyGeminiError } from '../errors/gemini-errors';
const worker = new Worker(
'gemini-tasks',
async (job: Job) => {
try {
const result = await callGeminiAPI(job.data.prompt);
return { success: true, result };
} catch (error) {
// Classify the error so backoff strategy can handle it appropriately
throw classifyGeminiError(error as Error);
}
},
{
connection: redis,
concurrency: 3,
settings: {
// Custom backoff that adapts to error type
backoffStrategy: (attemptsMade: number, type: string, err: Error, job: Job) => {
// MODEL_OVERLOADED (503): Server is overwhelmed
// These clear faster — use shorter delays with jitter
if (err instanceof GeminiModelOverloadedError) {
const baseDelay = 30_000; // 30 seconds base
const jitter = Math.random() * 10_000; // 0-10s jitter
return baseDelay * attemptsMade + jitter;
// Retries at ~30s, ~70s, ~110s, ~150s, ~190s
}
// RESOURCE_EXHAUSTED (429): Quota/rate limit hit
// These take longer to reset — use aggressive backoff
if (err instanceof GeminiResourceExhaustedError) {
const baseDelay = 60_000; // 60 seconds base
return Math.min(baseDelay * Math.pow(2, attemptsMade - 1), 600_000);
// Retries at ~1m, ~2m, ~4m, ~8m, ~10m (capped)
}
// Unknown error — fail immediately, do not retry
return -1;
},
},
},
);
// queues/gemini-queue.ts
export const geminiQueue = new Queue('gemini-tasks', {
connection: redis,
defaultJobOptions: {
attempts: 5,
backoff: {
type: 'custom', // Uses the backoffStrategy defined in worker settings
},
removeOnComplete: { count: 1000 },
removeOnFail: { count: 5000 },
},
});
Why Different Strategies for Different Errors?
Google's official documentation recommends truncated exponential backoff for 429 errors but doesn't specify exact wait times. Based on the rate limits documentation, we know:
- RPM (Requests Per Minute) limits reset every minute
- RPD (Requests Per Day) limits reset at midnight Pacific Time
- Rate limits apply per project, not per API key
For MODEL_OVERLOADED (503), community experience suggests these are transient server-side issues that clear as load balancers redistribute traffic. Shorter delays with jitter work well because you're waiting for infrastructure, not quota resets.
For RESOURCE_EXHAUSTED (429), you've hit a rate window. If the error message includes generate_content_free_tier_input_token_count, you know it's a per-minute token limit. Starting at 60 seconds ensures the minute window has time to reset.
The custom approach gives you:
- Appropriate delays matched to the underlying cause
- Patient waiting for 429 errors tied to rate windows
- Immediate failure for non-retryable errors (no wasted attempts)
- Jitter to prevent thundering herd when multiple jobs retry simultaneously
Peak Hours: A Hidden Optimization
Through production experience, we've observed that Gemini capacity errors spike dramatically during US business hours (9 AM - 6 PM Pacific / 5 PM - 2 AM UTC). This makes sense — the majority of Gemini's user base operates in US time zones, and Google's shared infrastructure strains under peak load.
If your workload can tolerate scheduling flexibility, consider:
- Off-peak processing: Schedule batch jobs for 2 AM - 8 AM Pacific (10 AM - 6 PM UTC)
- Timezone-aware queuing: Delay non-urgent jobs to run during US night hours
- Weekend processing: Saturdays and Sundays see significantly less traffic
// Example: Delay job to US off-peak hours if not urgent
function getOptimalDelay(priority: 'urgent' | 'normal' | 'low'): number {
if (priority === 'urgent') return 0;
const now = new Date();
const pacificHour = new Date(now.toLocaleString('en-US', { timeZone: 'America/Los_Angeles' })).getHours();
// If during US business hours (9-18 Pacific), delay to 2 AM Pacific
if (pacificHour >= 9 && pacificHour < 18 && priority === 'low') {
const hoursUntilOffPeak = (26 - pacificHour) % 24; // Hours until 2 AM
return hoursUntilOffPeak * 60 * 60 * 1000;
}
return 0;
}
// Add job with optimal timing
await geminiQueue.add('process', { prompt }, { delay: getOptimalDelay('low') });
This isn't a workaround — it's capacity planning. When you're competing for shared resources, timing matters.
Why This Works
The key insight is that RESOURCE_EXHAUSTED and MODEL_OVERLOADED are transient errors — they indicate temporary capacity constraints, not permanent failures. Unlike a 400 Bad Request or authentication error, these will likely succeed on retry.
By handling retries at the queue level rather than inline:
- Jobs persist across application restarts
- Backoff is automatic without manual sleep/retry loops
- Observability is built-in — you can monitor retry counts and failure rates
- Concurrency is controlled — limiting parallel workers reduces the chance of triggering limits
When to Reconsider
This approach works for background processing, batch jobs, and workloads tolerant of latency. It doesn't work for:
- Real-time user-facing requests where 10+ second delays are unacceptable
- High-volume production traffic where retry storms compound the problem
- SLA-bound services where you need guaranteed response times
In those cases, Provisioned Throughput or a multi-provider fallback strategy (Gemini → OpenAI → Anthropic) becomes necessary.
Key Takeaways
- Gemini's standard tiers are best-effort — rate limits are ceilings, not guarantees
RESOURCE_EXHAUSTEDandMODEL_OVERLOADEDcan occur even under quota- Provisioned Throughput offers guaranteed capacity but requires commitment
- For variable workloads, queue-based retries with exponential backoff handle transient failures gracefully
- Treat these errors as expected behavior, not edge cases
The reality of shared infrastructure is that capacity fluctuates. Design your systems accordingly.