Back to Blog
Gemini API: Why Rate Limits Do Not Guarantee Execution

Gemini API: Why Rate Limits Do Not Guarantee Execution

January 3, 2026
Stefan Mentović
geminiapirate-limitingbullmqerror-handling

RESOURCE_EXHAUSTED and MODEL_OVERLOADED errors happen even under quota. Learn why Gemini uses best-effort execution and how to handle it with BullMQ retries.

#Gemini API: Why Rate Limits Do Not Guarantee Execution

You're under your rate limits. Your dashboard shows plenty of quota remaining. Yet your Gemini API calls fail with RESOURCE_EXHAUSTED or MODEL_OVERLOADED. What's happening?

The uncomfortable truth: staying within rate limits doesn't guarantee your requests will succeed. Google's standard Gemini API tiers operate on a best-effort basis, meaning capacity is shared across all users and your requests compete for resources in real-time.

#The Two Errors You'll See

#429 RESOURCE_EXHAUSTED

This error indicates your project exceeded a rate limit or quota — but here's the catch: limits are applied at the project level, not per API key. Multiple applications sharing a project compound toward the same limits. More frustratingly, this error can occur even when your metrics show available quota because capacity is dynamically allocated.

#503 MODEL_OVERLOADED

This is a server-side error meaning Google's infrastructure is under heavy load. Users report experiencing this even while staying well within documented limits — 15 requests/minute, 1M tokens/minute on free tier. During peak periods, some developers see 30-80% of requests returning 503.

The error message is straightforward: "The model is overloaded. Please try again later." But "later" isn't defined, and there's no retry-after header to guide you.

#Why This Happens: Best-Effort Execution

Google's Standard Pay-As-You-Go documentation states it plainly:

"Specified rate limits are not guaranteed and actual capacity may vary."

Standard tiers use dynamic shared quota — capacity is distributed among all users in real-time based on demand. When Gemini experiences high traffic globally, your requests compete with everyone else's. Your quota represents a ceiling, not a guarantee.

This is fundamentally different from Azure OpenAI's provisioned capacity or AWS Bedrock's on-demand pricing, where staying within limits typically ensures execution.

#The Enterprise Alternative: Provisioned Throughput

For workloads that cannot tolerate failures, Google offers Provisioned Throughput:

  • Dedicated capacity measured in GSUs (Generative Service Units)
  • Strict SLAs with guaranteed tokens per second
  • No competition with other users for resources
  • Monthly commitment with pre-purchased capacity

The trade-off is cost. Provisioned Throughput requires upfront commitment and makes sense only for high-volume, mission-critical workloads. For many projects — especially early-stage SaaS products or variable workloads — the economics don't justify it.

#The Practical Solution: Retry at the Queue Level

If enterprise pricing isn't justified, you need infrastructure that treats these errors as expected behavior rather than exceptions. Here's how we handle it with BullMQ.

#Option 1: Simple Exponential Backoff

The straightforward approach uses BullMQ's built-in exponential backoff with carefully chosen delays:

// workers/gemini-worker.ts
import { Worker, Job } from 'bullmq';
import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const RETRYABLE_ERRORS = ['RESOURCE_EXHAUSTED', 'MODEL_OVERLOADED', 'The model is overloaded', '503', '429'];

function isRetryableGeminiError(error: Error): boolean {
	const message = error.message || '';
	return RETRYABLE_ERRORS.some((pattern) => message.includes(pattern));
}

const worker = new Worker(
	'gemini-tasks',
	async (job: Job) => {
		try {
			const response = await ai.models.generateContent({
				model: 'gemini-2.0-flash',
				contents: job.data.prompt,
			});
			return { success: true, result: response.text };
		} catch (error) {
			if (isRetryableGeminiError(error as Error)) {
				// Let BullMQ handle the retry with exponential backoff
				throw error;
			}
			// Non-retryable errors fail immediately
			throw new Error(`Permanent failure: ${(error as Error).message}`);
		}
	},
	{
		connection: redis,
		concurrency: 3, // Limit parallel requests to reduce contention
	},
);
// queues/gemini-queue.ts
import { Queue } from 'bullmq';

export const geminiQueue = new Queue('gemini-tasks', {
	connection: redis,
	defaultJobOptions: {
		attempts: 5,
		backoff: {
			type: 'exponential',
			delay: 60_000, // Start with 60 seconds, not 10
		},
		removeOnComplete: { count: 1000 },
		removeOnFail: { count: 5000 },
	},
});

With exponential backoff starting at 60 seconds, retries occur at approximately 1m, 2m, 4m, 8m, and 16m. This gives Gemini's infrastructure meaningful time to recover.

Why not 10 seconds? A 10-second retry makes no sense for capacity errors. When Google's infrastructure is saturated, it doesn't recover in 10 seconds — you're just adding noise to an already overwhelmed system and wasting your own rate limit budget on doomed requests.

#Option 2: Dynamic Error-Aware Backoff

For more sophisticated handling, BullMQ's custom backoff strategies let you define error-aware retry logic. Different error types warrant different strategies:

  • MODEL_OVERLOADED (503): Server capacity issue — clears faster as load balancers redistribute
  • RESOURCE_EXHAUSTED (429): Rate limit hit — needs longer delays for quota windows to reset

First, define typed errors:

// errors/gemini-errors.ts
export class GeminiResourceExhaustedError extends Error {
	constructor(message: string) {
		super(message);
		this.name = 'GeminiResourceExhaustedError';
	}
}

export class GeminiModelOverloadedError extends Error {
	constructor(message: string) {
		super(message);
		this.name = 'GeminiModelOverloadedError';
	}
}

export function classifyGeminiError(error: Error): Error {
	const message = error.message || '';

	if (message.includes('RESOURCE_EXHAUSTED') || message.includes('429')) {
		return new GeminiResourceExhaustedError(message);
	}

	if (message.includes('MODEL_OVERLOADED') || message.includes('503') || message.includes('overloaded')) {
		return new GeminiModelOverloadedError(message);
	}

	return error; // Non-retryable
}

Then implement the custom backoff:

// workers/gemini-worker.ts
import { Worker, Job } from 'bullmq';
import { GeminiResourceExhaustedError, GeminiModelOverloadedError, classifyGeminiError } from '../errors/gemini-errors';

const worker = new Worker(
	'gemini-tasks',
	async (job: Job) => {
		try {
			const result = await callGeminiAPI(job.data.prompt);
			return { success: true, result };
		} catch (error) {
			// Classify the error so backoff strategy can handle it appropriately
			throw classifyGeminiError(error as Error);
		}
	},
	{
		connection: redis,
		concurrency: 3,
		settings: {
			// Custom backoff that adapts to error type
			backoffStrategy: (attemptsMade: number, type: string, err: Error, job: Job) => {
				// MODEL_OVERLOADED (503): Server is overwhelmed
				// These clear faster — use shorter delays with jitter
				if (err instanceof GeminiModelOverloadedError) {
					const baseDelay = 30_000; // 30 seconds base
					const jitter = Math.random() * 10_000; // 0-10s jitter
					return baseDelay * attemptsMade + jitter;
					// Retries at ~30s, ~70s, ~110s, ~150s, ~190s
				}

				// RESOURCE_EXHAUSTED (429): Quota/rate limit hit
				// These take longer to reset — use aggressive backoff
				if (err instanceof GeminiResourceExhaustedError) {
					const baseDelay = 60_000; // 60 seconds base
					return Math.min(baseDelay * Math.pow(2, attemptsMade - 1), 600_000);
					// Retries at ~1m, ~2m, ~4m, ~8m, ~10m (capped)
				}

				// Unknown error — fail immediately, do not retry
				return -1;
			},
		},
	},
);
// queues/gemini-queue.ts
export const geminiQueue = new Queue('gemini-tasks', {
	connection: redis,
	defaultJobOptions: {
		attempts: 5,
		backoff: {
			type: 'custom', // Uses the backoffStrategy defined in worker settings
		},
		removeOnComplete: { count: 1000 },
		removeOnFail: { count: 5000 },
	},
});

#Why Different Strategies for Different Errors?

Google's official documentation recommends truncated exponential backoff for 429 errors but doesn't specify exact wait times. Based on the rate limits documentation, we know:

  • RPM (Requests Per Minute) limits reset every minute
  • RPD (Requests Per Day) limits reset at midnight Pacific Time
  • Rate limits apply per project, not per API key

For MODEL_OVERLOADED (503), community experience suggests these are transient server-side issues that clear as load balancers redistribute traffic. Shorter delays with jitter work well because you're waiting for infrastructure, not quota resets.

For RESOURCE_EXHAUSTED (429), you've hit a rate window. If the error message includes generate_content_free_tier_input_token_count, you know it's a per-minute token limit. Starting at 60 seconds ensures the minute window has time to reset.

The custom approach gives you:

  • Appropriate delays matched to the underlying cause
  • Patient waiting for 429 errors tied to rate windows
  • Immediate failure for non-retryable errors (no wasted attempts)
  • Jitter to prevent thundering herd when multiple jobs retry simultaneously

#Peak Hours: A Hidden Optimization

Through production experience, we've observed that Gemini capacity errors spike dramatically during US business hours (9 AM - 6 PM Pacific / 5 PM - 2 AM UTC). This makes sense — the majority of Gemini's user base operates in US time zones, and Google's shared infrastructure strains under peak load.

If your workload can tolerate scheduling flexibility, consider:

  • Off-peak processing: Schedule batch jobs for 2 AM - 8 AM Pacific (10 AM - 6 PM UTC)
  • Timezone-aware queuing: Delay non-urgent jobs to run during US night hours
  • Weekend processing: Saturdays and Sundays see significantly less traffic
// Example: Delay job to US off-peak hours if not urgent
function getOptimalDelay(priority: 'urgent' | 'normal' | 'low'): number {
	if (priority === 'urgent') return 0;

	const now = new Date();
	const pacificHour = new Date(now.toLocaleString('en-US', { timeZone: 'America/Los_Angeles' })).getHours();

	// If during US business hours (9-18 Pacific), delay to 2 AM Pacific
	if (pacificHour >= 9 && pacificHour < 18 && priority === 'low') {
		const hoursUntilOffPeak = (26 - pacificHour) % 24; // Hours until 2 AM
		return hoursUntilOffPeak * 60 * 60 * 1000;
	}

	return 0;
}

// Add job with optimal timing
await geminiQueue.add('process', { prompt }, { delay: getOptimalDelay('low') });

This isn't a workaround — it's capacity planning. When you're competing for shared resources, timing matters.

#Why This Works

The key insight is that RESOURCE_EXHAUSTED and MODEL_OVERLOADED are transient errors — they indicate temporary capacity constraints, not permanent failures. Unlike a 400 Bad Request or authentication error, these will likely succeed on retry.

By handling retries at the queue level rather than inline:

  • Jobs persist across application restarts
  • Backoff is automatic without manual sleep/retry loops
  • Observability is built-in — you can monitor retry counts and failure rates
  • Concurrency is controlled — limiting parallel workers reduces the chance of triggering limits

#When to Reconsider

This approach works for background processing, batch jobs, and workloads tolerant of latency. It doesn't work for:

  • Real-time user-facing requests where 10+ second delays are unacceptable
  • High-volume production traffic where retry storms compound the problem
  • SLA-bound services where you need guaranteed response times

In those cases, Provisioned Throughput or a multi-provider fallback strategy (Gemini → OpenAI → Anthropic) becomes necessary.

#Key Takeaways

  • Gemini's standard tiers are best-effort — rate limits are ceilings, not guarantees
  • RESOURCE_EXHAUSTED and MODEL_OVERLOADED can occur even under quota
  • Provisioned Throughput offers guaranteed capacity but requires commitment
  • For variable workloads, queue-based retries with exponential backoff handle transient failures gracefully
  • Treat these errors as expected behavior, not edge cases

The reality of shared infrastructure is that capacity fluctuates. Design your systems accordingly.

#Further Reading

Enjoyed this article? Stay updated: