RESOURCE_EXHAUSTED and MODEL_OVERLOADED errors happen even under quota. Learn why Gemini uses best-effort execution and how to handle it with BullMQ retries.

#Gemini API: Why Rate Limits Do Not Guarantee Execution

You're under your rate limits. Your dashboard shows plenty of quota remaining. Yet your Gemini API calls fail with RESOURCE_EXHAUSTED or MODEL_OVERLOADED. What's happening?

The uncomfortable truth: staying within rate limits doesn't guarantee your requests will succeed. Google's standard Gemini API tiers operate on a best-effort basis, meaning capacity is shared across all users and your requests compete for resources in real-time.

#The Two Errors You'll See

#429 RESOURCE_EXHAUSTED

This error indicates your project exceeded a rate limit or quota — but here's the catch: limits are applied at the project level, not per API key. Multiple applications sharing a project compound toward the same limits. More frustratingly, this error can occur even when your metrics show available quota because capacity is dynamically allocated.

#503 MODEL_OVERLOADED

This is a server-side error meaning Google's infrastructure is under heavy load. Users report experiencing this even while staying well within documented limits — 15 requests/minute, 1M tokens/minute on free tier. During peak periods, some developers see 30-80% of requests returning 503.

The error message is straightforward: "The model is overloaded. Please try again later." But "later" isn't defined, and there's no retry-after header to guide you.

#Why This Happens: Best-Effort Execution

Google's Standard Pay-As-You-Go documentation states it plainly:

"Specified rate limits are not guaranteed and actual capacity may vary."

Standard tiers use dynamic shared quota — capacity is distributed among all users in real-time based on demand. When Gemini experiences high traffic globally, your requests compete with everyone else's. Your quota represents a ceiling, not a guarantee.

This is fundamentally different from Azure OpenAI's provisioned capacity or AWS Bedrock's on-demand pricing, where staying within limits typically ensures execution.

#The Enterprise Alternative: Provisioned Throughput

For workloads that cannot tolerate failures, Google offers Provisioned Throughput:

Dedicated capacity measured in GSUs (Generative Service Units)
Strict SLAs with guaranteed tokens per second
No competition with other users for resources
Monthly commitment with pre-purchased capacity

The trade-off is cost. Provisioned Throughput requires upfront commitment and makes sense only for high-volume, mission-critical workloads. For many projects — especially early-stage SaaS products or variable workloads — the economics don't justify it.

#The Practical Solution: Retry at the Queue Level

If enterprise pricing isn't justified, you need infrastructure that treats these errors as expected behavior rather than exceptions. Here's how we handle it with BullMQ.

#Option 1: Simple Exponential Backoff

The straightforward approach uses BullMQ's built-in exponential backoff with carefully chosen delays:

// workers/gemini-worker.ts
import { Worker, Job } from 'bullmq';
import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const RETRYABLE_ERRORS = ['RESOURCE_EXHAUSTED', 'MODEL_OVERLOADED', 'The model is overloaded', '503', '429'];

function isRetryableGeminiError(error: Error): boolean {
	const message = error.message || '';
	return RETRYABLE_ERRORS.some((pattern) => message.includes(pattern));
}

const worker = new Worker(
	'gemini-tasks',
	async (job: Job) => {
		try {
			const response = await ai.models.generateContent({
				model: 'gemini-2.0-flash',
				contents: job.data.prompt,
			});
			return { success: true, result: response.text };
		} catch (error) {
			if (isRetryableGeminiError(error as Error)) {
				// Let BullMQ handle the retry with exponential backoff
				throw error;
			}
			// Non-retryable errors fail immediately
			throw new Error(`Permanent failure: ${(error as Error).message}`);
		}
	},
	{
		connection: redis,
		concurrency: 3, // Limit parallel requests to reduce contention
	},
);

// queues/gemini-queue.ts
import { Queue } from 'bullmq';

export const geminiQueue = new Queue('gemini-tasks', {
	connection: redis,
	defaultJobOptions: {
		attempts: 5,
		backoff: {
			type: 'exponential',
			delay: 60_000, // Start with 60 seconds, not 10
		},
		removeOnComplete: { count: 1000 },
		removeOnFail: { count: 5000 },
	},
});

With exponential backoff starting at 60 seconds, retries occur at approximately 1m, 2m, 4m, 8m, and 16m. This gives Gemini's infrastructure meaningful time to recover.

Why not 10 seconds? A 10-second retry makes no sense for capacity errors. When Google's infrastructure is saturated, it doesn't recover in 10 seconds — you're just adding noise to an already overwhelmed system and wasting your own rate limit budget on doomed requests.

#Option 2: Dynamic Error-Aware Backoff

For more sophisticated handling, BullMQ's custom backoff strategies let you define error-aware retry logic. Different error types warrant different strategies:

MODEL_OVERLOADED (503): Server capacity issue — clears faster as load balancers redistribute
RESOURCE_EXHAUSTED (429): Rate limit hit — needs longer delays for quota windows to reset

First, define typed errors:

// errors/gemini-errors.ts
export class GeminiResourceExhaustedError extends Error {
	constructor(message: string) {
		super(message);
		this.name = 'GeminiResourceExhaustedError';
	}
}

export class GeminiModelOverloadedError extends Error {
	constructor(message: string) {
		super(message);
		this.name = 'GeminiModelOverloadedError';
	}
}

export function classifyGeminiError(error: Error): Error {
	const message = error.message || '';

	if (message.includes('RESOURCE_EXHAUSTED') || message.includes('429')) {
		return new GeminiResourceExhaustedError(message);
	}

	if (message.includes('MODEL_OVERLOADED') || message.includes('503') || message.includes('overloaded')) {
		return new GeminiModelOverloadedError(message);
	}

	return error; // Non-retryable
}

Then implement the custom backoff:

// workers/gemini-worker.ts
import { Worker, Job } from 'bullmq';
import { GeminiResourceExhaustedError, GeminiModelOverloadedError, classifyGeminiError } from '../errors/gemini-errors';

const worker = new Worker(
	'gemini-tasks',
	async (job: Job) => {
		try {
			const result = await callGeminiAPI(job.data.prompt);
			return { success: true, result };
		} catch (error) {
			// Classify the error so backoff strategy can handle it appropriately
			throw classifyGeminiError(error as Error);
		}
	},
	{
		connection: redis,
		concurrency: 3,
		settings: {
			// Custom backoff that adapts to error type
			backoffStrategy: (attemptsMade: number, type: string, err: Error, job: Job) => {
				// MODEL_OVERLOADED (503): Server is overwhelmed
				// These clear faster — use shorter delays with jitter
				if (err instanceof GeminiModelOverloadedError) {
					const baseDelay = 30_000; // 30 seconds base
					const jitter = Math.random() * 10_000; // 0-10s jitter
					return baseDelay * attemptsMade + jitter;
					// Retries at ~30s, ~70s, ~110s, ~150s, ~190s
				}

				// RESOURCE_EXHAUSTED (429): Quota/rate limit hit
				// These take longer to reset — use aggressive backoff
				if (err instanceof GeminiResourceExhaustedError) {
					const baseDelay = 60_000; // 60 seconds base
					return Math.min(baseDelay * Math.pow(2, attemptsMade - 1), 600_000);
					// Retries at ~1m, ~2m, ~4m, ~8m, ~10m (capped)
				}

				// Unknown error — fail immediately, do not retry
				return -1;
			},
		},
	},
);

// queues/gemini-queue.ts
export const geminiQueue = new Queue('gemini-tasks', {
	connection: redis,
	defaultJobOptions: {
		attempts: 5,
		backoff: {
			type: 'custom', // Uses the backoffStrategy defined in worker settings
		},
		removeOnComplete: { count: 1000 },
		removeOnFail: { count: 5000 },
	},
});

#Why Different Strategies for Different Errors?

Google's official documentation recommends truncated exponential backoff for 429 errors but doesn't specify exact wait times. Based on the rate limits documentation, we know:

RPM (Requests Per Minute) limits reset every minute
RPD (Requests Per Day) limits reset at midnight Pacific Time
Rate limits apply per project, not per API key

For MODEL_OVERLOADED (503), community experience suggests these are transient server-side issues that clear as load balancers redistribute traffic. Shorter delays with jitter work well because you're waiting for infrastructure, not quota resets.

For RESOURCE_EXHAUSTED (429), you've hit a rate window. If the error message includes generate_content_free_tier_input_token_count, you know it's a per-minute token limit. Starting at 60 seconds ensures the minute window has time to reset.

The custom approach gives you:

Appropriate delays matched to the underlying cause
Patient waiting for 429 errors tied to rate windows
Immediate failure for non-retryable errors (no wasted attempts)
Jitter to prevent thundering herd when multiple jobs retry simultaneously

#Peak Hours: A Hidden Optimization

Through production experience, we've observed that Gemini capacity errors spike dramatically during US business hours (9 AM - 6 PM Pacific / 5 PM - 2 AM UTC). This makes sense — the majority of Gemini's user base operates in US time zones, and Google's shared infrastructure strains under peak load.

If your workload can tolerate scheduling flexibility, consider:

Off-peak processing: Schedule batch jobs for 2 AM - 8 AM Pacific (10 AM - 6 PM UTC)
Timezone-aware queuing: Delay non-urgent jobs to run during US night hours
Weekend processing: Saturdays and Sundays see significantly less traffic

// Example: Delay job to US off-peak hours if not urgent
function getOptimalDelay(priority: 'urgent' | 'normal' | 'low'): number {
	if (priority === 'urgent') return 0;

	const now = new Date();
	const pacificHour = new Date(now.toLocaleString('en-US', { timeZone: 'America/Los_Angeles' })).getHours();

	// If during US business hours (9-18 Pacific), delay to 2 AM Pacific
	if (pacificHour >= 9 && pacificHour < 18 && priority === 'low') {
		const hoursUntilOffPeak = (26 - pacificHour) % 24; // Hours until 2 AM
		return hoursUntilOffPeak * 60 * 60 * 1000;
	}

	return 0;
}

// Add job with optimal timing
await geminiQueue.add('process', { prompt }, { delay: getOptimalDelay('low') });

This isn't a workaround — it's capacity planning. When you're competing for shared resources, timing matters.

#Why This Works

The key insight is that RESOURCE_EXHAUSTED and MODEL_OVERLOADED are transient errors — they indicate temporary capacity constraints, not permanent failures. Unlike a 400 Bad Request or authentication error, these will likely succeed on retry.

By handling retries at the queue level rather than inline:

Jobs persist across application restarts
Backoff is automatic without manual sleep/retry loops
Observability is built-in — you can monitor retry counts and failure rates
Concurrency is controlled — limiting parallel workers reduces the chance of triggering limits

#When to Reconsider

This approach works for background processing, batch jobs, and workloads tolerant of latency. It doesn't work for:

Real-time user-facing requests where 10+ second delays are unacceptable
High-volume production traffic where retry storms compound the problem
SLA-bound services where you need guaranteed response times

In those cases, Provisioned Throughput or a multi-provider fallback strategy (Gemini → OpenAI → Anthropic) becomes necessary.

#Key Takeaways

Gemini's standard tiers are best-effort — rate limits are ceilings, not guarantees
RESOURCE_EXHAUSTED and MODEL_OVERLOADED can occur even under quota
Provisioned Throughput offers guaranteed capacity but requires commitment
For variable workloads, queue-based retries with exponential backoff handle transient failures gracefully
Treat these errors as expected behavior, not edge cases

The reality of shared infrastructure is that capacity fluctuates. Design your systems accordingly.

Gemini API: Why Rate Limits Do Not Guarantee Execution