Build resilient, scalable APIs using the step orchestrator pattern with BullMQ. Learn partial retries, data integrity, and why Redis queues beat cloud-native alternatives.

#Event-Driven Architecture with Hono and BullMQ: A Production Guide

Your API receives a request to process an order. This order requires checking inventory across multiple warehouses, processing payment through a gateway, calculating shipping with three different carriers, and sending confirmation notifications. Each step can fail independently. Some steps take 30 seconds, others take 5 minutes. The user expects a response in under 2 seconds.

Traditional synchronous request-response architecture breaks down here. Event-driven architecture with job queues solves these problems elegantly — but getting it right for production requires understanding patterns that aren't obvious from documentation.

This guide covers implementing event-driven architecture with Hono (a lightweight TypeScript API framework) and BullMQ (a Redis-based job queue), focusing on the step orchestrator pattern, partial retries, data integrity, and the tradeoffs involved.

#Why Event-Driven Architecture?

Synchronous APIs hit limits quickly when operations are:

Long-running — LLM calls, file processing, or external API integrations
Composed of multiple steps — Workflows where each step depends on previous results
Failure-prone — External services that may be temporarily unavailable
Resource-intensive — Operations that would block your API from handling other requests

Event-driven architecture decouples the request from the processing:

// Synchronous: Blocks for 2+ minutes
app.post('/api/orders/process', async (c) => {
	const inventory = await checkInventory(items); // 10s
	const payment = await processPayment(order); // 15s
	const shipping = await calculateShipping(address); // 5s
	const confirmation = await sendConfirmation(user); // 5s
	return c.json(confirmation);
});

// Event-driven: Returns immediately
app.post('/api/orders/process', async (c) => {
	const { orderId, userId } = await c.req.json();
	const job = await orderQueue.add('process-order', { orderId, userId });
	return c.json({ jobId: job.id, status: 'processing' });
});

The first approach ties up a connection for 2+ minutes. The second returns in milliseconds and processes the work asynchronously.

#Why BullMQ Over Cloud-Native Queues?

Before diving into implementation, consider why you might choose BullMQ over AWS SQS, Azure Service Bus, or Google Cloud Pub/Sub.

#Advantages of BullMQ

Cost predictability — Redis pricing is straightforward. Cloud queues charge per message, per API call, and sometimes per GB transferred. High-throughput applications can see surprising bills.

Local development — Run Redis locally with zero configuration. Cloud queues require either cloud credentials in development or local emulators that don't perfectly replicate production behavior.

Rich job semantics — BullMQ provides features cloud queues lack: job priorities, rate limiting, delayed jobs, repeatable jobs, and parent-child job relationships. Implementing these on SQS requires significant custom code.

Visibility and debugging — Bull Board provides real-time job monitoring. Cloud queue dashboards are often limited or require additional tooling.

Portability — Redis runs anywhere. Switching cloud providers doesn't require rewriting your queue logic.

#When Cloud Queues Win

Massive scale — If you're processing millions of messages per second, managed cloud queues handle the infrastructure complexity.

Existing cloud investment — Deep integration with AWS Lambda, Azure Functions, or similar serverless platforms.

Zero Redis operations — If managing Redis (even managed Redis) is unacceptable overhead.

For most SaaS applications processing thousands to hundreds of thousands of jobs daily, BullMQ offers the better balance of features, cost, and operational simplicity.

#Setting Up BullMQ with Hono

#Installation

npm install bullmq ioredis hono

#Redis Connection Configuration

BullMQ uses ioredis under the hood. Create a shared connection to minimize Redis connection count:

// lib/redis.ts
import { Redis } from 'ioredis';

let connection: Redis | null = null;

export function getRedisConnection(): Redis {
	if (!connection) {
		const redisUrl = process.env.REDIS_URL;
		if (!redisUrl) {
			throw new Error('REDIS_URL environment variable is required');
		}

		connection = new Redis(redisUrl, {
			maxRetriesPerRequest: null, // Required for BullMQ workers
			enableReadyCheck: false,
		});

		connection.on('error', (err) => {
			console.error('[Redis Error]', err.message);
		});
	}
	return connection;
}

export async function closeRedisConnection(): Promise<void> {
	if (connection) {
		await connection.quit();
		connection = null;
	}
}

Two critical settings for BullMQ:

maxRetriesPerRequest: null — Workers need this to handle blocking operations
enableReadyCheck: false — Prevents startup issues with some Redis configurations

#Queue Definition

Define queues with typed job data:

// queues/report.queue.ts
import { Queue } from 'bullmq';
import { getRedisConnection } from '../lib/redis';

interface OrderJobData {
	orderId: string;
	userId: string;
}

export const orderQueue = new Queue<OrderJobData>('order-processing', {
	connection: getRedisConnection(),
	defaultJobOptions: {
		removeOnComplete: 10, // Keep last 10 completed jobs
		removeOnFail: 5, // Keep last 5 failed jobs
		attempts: 5, // Retry up to 5 times
		backoff: {
			type: 'exponential',
			delay: 60000, // 1 minute initial delay
		},
	},
});

orderQueue.on('error', (err) => {
	console.error('[Queue Error]', err.message);
});

#API Integration with Hono

// routes/orders.ts
import { Hono } from 'hono';
import { orderQueue } from '../queues/order.queue';

const app = new Hono();

app.post('/orders/process', async (c) => {
	const { userId, items } = await c.req.json();
	const orderId = crypto.randomUUID();

	const job = await orderQueue.add(
		'process-order',
		{ orderId, userId },
		{
			// Prevent duplicate jobs for the same order
			deduplication: { id: orderId },
			priority: 2,
		},
	);

	return c.json({
		orderId,
		jobId: job.id,
		status: 'processing',
	});
});

export default app;

#The Step Orchestrator Pattern

Simple workflows process a single task per job. Complex workflows — like processing an order that requires inventory checks, payment processing, and shipping calculations running in parallel — need orchestration.

BullMQ's FlowProducer allows defining parent-child job relationships, but it has a limitation: child jobs must be known upfront. When children dynamically spawn grandchildren (like an orchestrator that spawns carrier-specific jobs based on runtime data), FlowProducer becomes unreliable.

The step orchestrator pattern solves this by using WaitingChildrenError to pause a job while its children complete:

// workers/orchestrator.worker.ts
import { Worker, Job, WaitingChildrenError } from 'bullmq';
import { getRedisConnection } from '../lib/redis';

interface OrchestratorData {
	orderId: string;
	step?: 'spawn-children' | 'collect-results';
}

interface OrchestratorResult {
	success: boolean;
	results?: unknown[];
}

const worker = new Worker<OrchestratorData, OrchestratorResult>(
	'orchestrator',
	async (job: Job<OrchestratorData>) => {
		const step = job.data.step || 'spawn-children';

		if (step === 'spawn-children') {
			// Step 1: Spawn child jobs
			await spawnChildJobs(job);

			// Update job data for next step
			await job.updateData({ ...job.data, step: 'collect-results' });

			// Pause this job until children complete
			throw new WaitingChildrenError();
		}

		// Step 2: Children are done, collect results
		const childResults = await job.getChildrenValues();
		const results = Object.values(childResults);

		return { success: true, results };
	},
	{
		connection: getRedisConnection(),
		concurrency: 5,
	},
);

async function spawnChildJobs(parentJob: Job) {
	const { orderId } = parentJob.data;

	// Spawn multiple child jobs that run in parallel
	const childPromises = ['fedex', 'ups', 'dhl'].map((carrier) =>
		shippingQuoteQueue.add(
			`quote-${carrier}`,
			{ orderId, carrier },
			{
				parent: {
					id: parentJob.id!,
					queue: parentJob.queueQualifiedName,
				},
			},
		),
	);

	await Promise.all(childPromises);
}

#How WaitingChildrenError Works

Parent job spawns children with parent option linking them
Parent throws WaitingChildrenError
BullMQ moves parent to "waiting-children" state
Children execute in parallel on their respective workers
When all children complete, parent resumes from where it left off
Parent accesses child results via getChildrenValues()

#Multi-Step Orchestration

For workflows with more than two steps, extend the pattern:

const step = job.data.step || 'validate-order';

switch (step) {
	case 'validate-order':
		const validation = await validateOrder(job.data);
		await spawnShippingQuoteJobs(job, validation);
		await job.updateData({ ...job.data, step: 'wait-for-quotes' });
		throw new WaitingChildrenError();

	case 'wait-for-quotes':
		const quoteResults = await job.getChildrenValues();
		await spawnPaymentJob(job, quoteResults);
		await job.updateData({ ...job.data, step: 'wait-for-payment' });
		throw new WaitingChildrenError();

	case 'wait-for-payment':
		const paymentResults = await job.getChildrenValues();
		return { success: true, confirmation: paymentResults };
}

Each step spawns children, updates the job's step marker, and throws WaitingChildrenError. When children complete, the job resumes at the next step.

#Partial Retries and Error Handling

One major advantage of the step orchestrator pattern is partial retries. If step 3 fails after steps 1 and 2 succeeded, only step 3 retries — not the entire workflow.

#Configuring Retry Behavior

const queue = new Queue('shipping-quotes', {
	connection: getRedisConnection(),
	defaultJobOptions: {
		attempts: 5,
		backoff: {
			type: 'exponential',
			delay: 60000, // 1 minute * 2^attempt
		},
	},
});

With exponential backoff starting at 60 seconds:

Attempt 1: Immediate
Attempt 2: 1 minute delay
Attempt 3: 2 minutes delay
Attempt 4: 4 minutes delay
Attempt 5: 8 minutes delay

Total time to exhaust retries: ~15 minutes, giving transient failures time to resolve.

#Adding Jitter to Prevent Thundering Herds

When many jobs fail simultaneously (like when an external API goes down), they all retry at the same time — creating a "thundering herd" that can overwhelm the recovering service.

Jitter adds randomness to retry delays, spreading retries over time instead of clustering them:

backoff: {
  type: 'exponential',
  delay: 60000,
  jitter: 0.5, // Randomize delay by up to 50%
}

Without jitter, 100 failed jobs retry exactly at the 1-minute mark. With jitter: 0.5, those same jobs retry randomly between 30 seconds and 1 minute, distributing the load.

The jitter value ranges from 0 to 1, representing the percentage of randomization applied to the delay. A value of 0 means no randomization — all retries happen at exactly the calculated delay. A value of 0.5 randomizes the delay by up to 50%, so a 60-second base delay becomes anywhere from 30 to 60 seconds. A value of 1.0 provides maximum randomization, spreading retries from 0 to the full delay duration.

Recommendation: A jitter value of 0.5 provides good distribution while keeping retries reasonably predictable. Higher values may cause some retries to happen too quickly, while lower values don't spread the load enough.

#Custom Backoff Strategies

For specific error types, implement custom logic:

const worker = new Worker('shipping-quotes', processor, {
	connection: getRedisConnection(),
	settings: {
		backoffStrategy: (attemptsMade, type, err, job) => {
			// Rate limit errors: wait longer
			if (err?.message?.includes('rate limit')) {
				return 120000; // 2 minutes
			}
			// Timeout errors: retry quickly
			if (err?.message?.includes('timeout')) {
				return 10000; // 10 seconds
			}
			// Default exponential
			return Math.pow(2, attemptsMade - 1) * 60000;
		},
	},
});

#Handling Child Failures

When orchestrating parent-child workflows, check for child failures before proceeding:

async function checkChildFailures(job: Job): Promise<string[]> {
	const dependencies = await job.getDependencies({
		processed: {},
		unprocessed: {},
	});

	const failedChildren: string[] = [];

	if (dependencies.processed) {
		for (const [key, value] of Object.entries(dependencies.processed)) {
			if (value === null) {
				// null indicates child failed
				failedChildren.push(key);
			}
		}
	}

	return failedChildren;
}

// In orchestrator
const failedChildren = await checkChildFailures(job);
if (failedChildren.length > 0) {
	throw new Error(`Child jobs failed: ${failedChildren.join(', ')}`);
}

#Data Integrity and Redundancy

Job queues introduce eventual consistency. Ensure data integrity with these patterns.

Note: The following examples use PostgreSQL with Drizzle ORM for database operations. The patterns apply regardless of your database choice.

#Idempotent Job Processing

Jobs may execute multiple times due to retries or stalled job recovery. Design processors to be idempotent:

async function processOrder(job: Job<OrderJobData>) {
	const { orderId } = job.data;

	// Check if already processed
	const existing = await db.query.orders.findFirst({
		where: eq(orders.id, orderId),
	});

	if (existing?.status === 'completed') {
		// Already done, return cached result
		return existing.result;
	}

	// Process and save atomically
	const result = await fulfillOrder(job.data);

	await db.update(orders).set({ status: 'completed', result }).where(eq(orders.id, orderId));

	return result;
}

#Deduplication

Prevent duplicate jobs for the same entity:

await orderQueue.add(
	'process-order',
	{ customerId, orderId },
	{
		deduplication: {
			id: customerId, // Only one job per customer at a time
		},
	},
);

BullMQ's deduplication automatically clears when the job completes or fails, unlike custom job IDs which can cause "reserved ID" issues.

#Database State Synchronization

Keep database state synchronized with job progress:

// Update status when job starts
worker.on('active', async (job) => {
	await db.update(orders).set({ status: 'processing', startedAt: new Date() }).where(eq(orders.id, job.data.orderId));
});

// Update on completion
worker.on('completed', async (job, result) => {
	await db
		.update(orders)
		.set({ status: 'completed', result, completedAt: new Date() })
		.where(eq(orders.id, job.data.orderId));
});

// Update on failure
worker.on('failed', async (job, err) => {
	await db
		.update(orders)
		.set({ status: 'failed', error: err.message, failedAt: new Date() })
		.where(eq(orders.id, job.data.orderId));
});

#Stale Job Recovery

Jobs can become stalled if a worker crashes mid-processing. Configure stall detection:

const worker = new Worker('orders', processor, {
	connection: getRedisConnection(),
	stalledInterval: 30000, // Check every 30 seconds
	maxStalledCount: 2, // Retry stalled jobs twice before failing
});

#Graceful Shutdown

Prevent data loss during deployments:

// server.ts
import { closeRedisConnection } from './lib/redis';
import { worker } from './workers/report.worker';

async function shutdown() {
	console.log('Shutting down gracefully...');

	// Stop accepting new jobs
	await worker.close();

	// Close Redis connection
	await closeRedisConnection();

	process.exit(0);
}

process.on('SIGTERM', shutdown);
process.on('SIGINT', shutdown);

The worker will finish processing active jobs before closing.

#Monitoring with Bull Board

Add visibility into your queues:

// routes/admin.ts
import { createBullBoard } from '@bull-board/api';
import { BullMQAdapter } from '@bull-board/api/bullMQAdapter';
import { HonoAdapter } from '@bull-board/hono';
import { orderQueue, shippingQuoteQueue } from '../queues';

const serverAdapter = new HonoAdapter('/admin/queues');

createBullBoard({
	queues: [new BullMQAdapter(orderQueue), new BullMQAdapter(shippingQuoteQueue)],
	serverAdapter,
});

export default serverAdapter.registerPlugin();

Bull Board provides real-time visibility into:

Active, waiting, completed, and failed jobs
Job data and return values
Retry history and error messages
Queue throughput metrics

#Production Checklist

Before deploying event-driven architecture:

Redis Configuration

Set maxmemory-policy=noeviction to prevent data loss
Configure appropriate memory limits
Enable persistence (RDB or AOF) for durability
Use managed Redis (Azure Cache, AWS ElastiCache) for production

Queue Configuration

Set appropriate removeOnComplete and removeOnFail limits
Configure backoff strategies for each queue type
Implement deduplication for critical jobs
Add error event handlers to all queues

Worker Configuration

Set maxRetriesPerRequest: null for worker connections
Configure appropriate concurrency limits
Implement graceful shutdown handlers
Add comprehensive logging

Monitoring

Deploy Bull Board for visibility
Set up alerts for failed jobs
Monitor Redis memory and connection counts
Track job processing times

#Key Takeaways

Event-driven architecture decouples request handling from processing, enabling responsive APIs for long-running operations
BullMQ provides richer job semantics than cloud queues at lower cost for most SaaS applications
The step orchestrator pattern with WaitingChildrenError enables complex workflows with partial retries
Idempotent processing and deduplication ensure data integrity despite retries
Proper shutdown handling prevents data loss during deployments
Bull Board provides essential visibility into queue health and job status

Event-driven architecture requires more upfront design than synchronous APIs, but the resilience, scalability, and user experience improvements justify the investment for any application with non-trivial background processing needs.

Ready to build with us? Check out our architecture consulting services or get in touch to discuss your project.