
Event-Driven Architecture with Hono and BullMQ: A Production Guide
Build resilient, scalable APIs using the step orchestrator pattern with BullMQ. Learn partial retries, data integrity, and why Redis queues beat cloud-native alternatives.
Event-Driven Architecture with Hono and BullMQ: A Production Guide
Your API receives a request to process an order. This order requires checking inventory across multiple warehouses, processing payment through a gateway, calculating shipping with three different carriers, and sending confirmation notifications. Each step can fail independently. Some steps take 30 seconds, others take 5 minutes. The user expects a response in under 2 seconds.
Traditional synchronous request-response architecture breaks down here. Event-driven architecture with job queues solves these problems elegantly — but getting it right for production requires understanding patterns that aren't obvious from documentation.
This guide covers implementing event-driven architecture with Hono (a lightweight TypeScript API framework) and BullMQ (a Redis-based job queue), focusing on the step orchestrator pattern, partial retries, data integrity, and the tradeoffs involved.
Why Event-Driven Architecture?
Synchronous APIs hit limits quickly when operations are:
- Long-running — LLM calls, file processing, or external API integrations
- Composed of multiple steps — Workflows where each step depends on previous results
- Failure-prone — External services that may be temporarily unavailable
- Resource-intensive — Operations that would block your API from handling other requests
Event-driven architecture decouples the request from the processing:
// Synchronous: Blocks for 2+ minutes
app.post('/api/orders/process', async (c) => {
const inventory = await checkInventory(items); // 10s
const payment = await processPayment(order); // 15s
const shipping = await calculateShipping(address); // 5s
const confirmation = await sendConfirmation(user); // 5s
return c.json(confirmation);
});
// Event-driven: Returns immediately
app.post('/api/orders/process', async (c) => {
const { orderId, userId } = await c.req.json();
const job = await orderQueue.add('process-order', { orderId, userId });
return c.json({ jobId: job.id, status: 'processing' });
});
The first approach ties up a connection for 2+ minutes. The second returns in milliseconds and processes the work asynchronously.
Why BullMQ Over Cloud-Native Queues?
Before diving into implementation, consider why you might choose BullMQ over AWS SQS, Azure Service Bus, or Google Cloud Pub/Sub.
Advantages of BullMQ
Cost predictability — Redis pricing is straightforward. Cloud queues charge per message, per API call, and sometimes per GB transferred. High-throughput applications can see surprising bills.
Local development — Run Redis locally with zero configuration. Cloud queues require either cloud credentials in development or local emulators that don't perfectly replicate production behavior.
Rich job semantics — BullMQ provides features cloud queues lack: job priorities, rate limiting, delayed jobs, repeatable jobs, and parent-child job relationships. Implementing these on SQS requires significant custom code.
Visibility and debugging — Bull Board provides real-time job monitoring. Cloud queue dashboards are often limited or require additional tooling.
Portability — Redis runs anywhere. Switching cloud providers doesn't require rewriting your queue logic.
When Cloud Queues Win
Massive scale — If you're processing millions of messages per second, managed cloud queues handle the infrastructure complexity.
Existing cloud investment — Deep integration with AWS Lambda, Azure Functions, or similar serverless platforms.
Zero Redis operations — If managing Redis (even managed Redis) is unacceptable overhead.
For most SaaS applications processing thousands to hundreds of thousands of jobs daily, BullMQ offers the better balance of features, cost, and operational simplicity.
Setting Up BullMQ with Hono
Installation
npm install bullmq ioredis hono
Redis Connection Configuration
BullMQ uses ioredis under the hood. Create a shared connection to minimize Redis connection count:
// lib/redis.ts
import { Redis } from 'ioredis';
let connection: Redis | null = null;
export function getRedisConnection(): Redis {
if (!connection) {
const redisUrl = process.env.REDIS_URL;
if (!redisUrl) {
throw new Error('REDIS_URL environment variable is required');
}
connection = new Redis(redisUrl, {
maxRetriesPerRequest: null, // Required for BullMQ workers
enableReadyCheck: false,
});
connection.on('error', (err) => {
console.error('[Redis Error]', err.message);
});
}
return connection;
}
export async function closeRedisConnection(): Promise<void> {
if (connection) {
await connection.quit();
connection = null;
}
}
Two critical settings for BullMQ:
maxRetriesPerRequest: null— Workers need this to handle blocking operationsenableReadyCheck: false— Prevents startup issues with some Redis configurations
Queue Definition
Define queues with typed job data:
// queues/report.queue.ts
import { Queue } from 'bullmq';
import { getRedisConnection } from '../lib/redis';
interface OrderJobData {
orderId: string;
userId: string;
}
export const orderQueue = new Queue<OrderJobData>('order-processing', {
connection: getRedisConnection(),
defaultJobOptions: {
removeOnComplete: 10, // Keep last 10 completed jobs
removeOnFail: 5, // Keep last 5 failed jobs
attempts: 5, // Retry up to 5 times
backoff: {
type: 'exponential',
delay: 60000, // 1 minute initial delay
},
},
});
orderQueue.on('error', (err) => {
console.error('[Queue Error]', err.message);
});
API Integration with Hono
// routes/orders.ts
import { Hono } from 'hono';
import { orderQueue } from '../queues/order.queue';
const app = new Hono();
app.post('/orders/process', async (c) => {
const { userId, items } = await c.req.json();
const orderId = crypto.randomUUID();
const job = await orderQueue.add(
'process-order',
{ orderId, userId },
{
// Prevent duplicate jobs for the same order
deduplication: { id: orderId },
priority: 2,
},
);
return c.json({
orderId,
jobId: job.id,
status: 'processing',
});
});
export default app;
The Step Orchestrator Pattern
Simple workflows process a single task per job. Complex workflows — like processing an order that requires inventory checks, payment processing, and shipping calculations running in parallel — need orchestration.
BullMQ's FlowProducer allows defining parent-child job relationships, but it has a limitation: child jobs must be known upfront. When children dynamically spawn grandchildren (like an orchestrator that spawns carrier-specific jobs based on runtime data), FlowProducer becomes unreliable.
The step orchestrator pattern solves this by using WaitingChildrenError to pause a job while its children complete:
// workers/orchestrator.worker.ts
import { Worker, Job, WaitingChildrenError } from 'bullmq';
import { getRedisConnection } from '../lib/redis';
interface OrchestratorData {
orderId: string;
step?: 'spawn-children' | 'collect-results';
}
interface OrchestratorResult {
success: boolean;
results?: unknown[];
}
const worker = new Worker<OrchestratorData, OrchestratorResult>(
'orchestrator',
async (job: Job<OrchestratorData>) => {
const step = job.data.step || 'spawn-children';
if (step === 'spawn-children') {
// Step 1: Spawn child jobs
await spawnChildJobs(job);
// Update job data for next step
await job.updateData({ ...job.data, step: 'collect-results' });
// Pause this job until children complete
throw new WaitingChildrenError();
}
// Step 2: Children are done, collect results
const childResults = await job.getChildrenValues();
const results = Object.values(childResults);
return { success: true, results };
},
{
connection: getRedisConnection(),
concurrency: 5,
},
);
async function spawnChildJobs(parentJob: Job) {
const { orderId } = parentJob.data;
// Spawn multiple child jobs that run in parallel
const childPromises = ['fedex', 'ups', 'dhl'].map((carrier) =>
shippingQuoteQueue.add(
`quote-${carrier}`,
{ orderId, carrier },
{
parent: {
id: parentJob.id!,
queue: parentJob.queueQualifiedName,
},
},
),
);
await Promise.all(childPromises);
}
How WaitingChildrenError Works
- Parent job spawns children with
parentoption linking them - Parent throws
WaitingChildrenError - BullMQ moves parent to "waiting-children" state
- Children execute in parallel on their respective workers
- When all children complete, parent resumes from where it left off
- Parent accesses child results via
getChildrenValues()
Multi-Step Orchestration
For workflows with more than two steps, extend the pattern:
const step = job.data.step || 'validate-order';
switch (step) {
case 'validate-order':
const validation = await validateOrder(job.data);
await spawnShippingQuoteJobs(job, validation);
await job.updateData({ ...job.data, step: 'wait-for-quotes' });
throw new WaitingChildrenError();
case 'wait-for-quotes':
const quoteResults = await job.getChildrenValues();
await spawnPaymentJob(job, quoteResults);
await job.updateData({ ...job.data, step: 'wait-for-payment' });
throw new WaitingChildrenError();
case 'wait-for-payment':
const paymentResults = await job.getChildrenValues();
return { success: true, confirmation: paymentResults };
}
Each step spawns children, updates the job's step marker, and throws WaitingChildrenError. When children complete, the job resumes at the next step.
Partial Retries and Error Handling
One major advantage of the step orchestrator pattern is partial retries. If step 3 fails after steps 1 and 2 succeeded, only step 3 retries — not the entire workflow.
Configuring Retry Behavior
const queue = new Queue('shipping-quotes', {
connection: getRedisConnection(),
defaultJobOptions: {
attempts: 5,
backoff: {
type: 'exponential',
delay: 60000, // 1 minute * 2^attempt
},
},
});
With exponential backoff starting at 60 seconds:
- Attempt 1: Immediate
- Attempt 2: 1 minute delay
- Attempt 3: 2 minutes delay
- Attempt 4: 4 minutes delay
- Attempt 5: 8 minutes delay
Total time to exhaust retries: ~15 minutes, giving transient failures time to resolve.
Adding Jitter to Prevent Thundering Herds
When many jobs fail simultaneously (like when an external API goes down), they all retry at the same time — creating a "thundering herd" that can overwhelm the recovering service.
Jitter adds randomness to retry delays, spreading retries over time instead of clustering them:
backoff: {
type: 'exponential',
delay: 60000,
jitter: 0.5, // Randomize delay by up to 50%
}
Without jitter, 100 failed jobs retry exactly at the 1-minute mark. With jitter: 0.5, those same jobs retry randomly between 30 seconds and 1 minute, distributing the load.
The jitter value ranges from 0 to 1, representing the percentage of randomization applied to the delay. A value of 0 means no randomization — all retries happen at exactly the calculated delay. A value of 0.5 randomizes the delay by up to 50%, so a 60-second base delay becomes anywhere from 30 to 60 seconds. A value of 1.0 provides maximum randomization, spreading retries from 0 to the full delay duration.
Recommendation: A jitter value of
0.5provides good distribution while keeping retries reasonably predictable. Higher values may cause some retries to happen too quickly, while lower values don't spread the load enough.
Custom Backoff Strategies
For specific error types, implement custom logic:
const worker = new Worker('shipping-quotes', processor, {
connection: getRedisConnection(),
settings: {
backoffStrategy: (attemptsMade, type, err, job) => {
// Rate limit errors: wait longer
if (err?.message?.includes('rate limit')) {
return 120000; // 2 minutes
}
// Timeout errors: retry quickly
if (err?.message?.includes('timeout')) {
return 10000; // 10 seconds
}
// Default exponential
return Math.pow(2, attemptsMade - 1) * 60000;
},
},
});
Handling Child Failures
When orchestrating parent-child workflows, check for child failures before proceeding:
async function checkChildFailures(job: Job): Promise<string[]> {
const dependencies = await job.getDependencies({
processed: {},
unprocessed: {},
});
const failedChildren: string[] = [];
if (dependencies.processed) {
for (const [key, value] of Object.entries(dependencies.processed)) {
if (value === null) {
// null indicates child failed
failedChildren.push(key);
}
}
}
return failedChildren;
}
// In orchestrator
const failedChildren = await checkChildFailures(job);
if (failedChildren.length > 0) {
throw new Error(`Child jobs failed: ${failedChildren.join(', ')}`);
}
Data Integrity and Redundancy
Job queues introduce eventual consistency. Ensure data integrity with these patterns.
Note: The following examples use PostgreSQL with Drizzle ORM for database operations. The patterns apply regardless of your database choice.
Idempotent Job Processing
Jobs may execute multiple times due to retries or stalled job recovery. Design processors to be idempotent:
async function processOrder(job: Job<OrderJobData>) {
const { orderId } = job.data;
// Check if already processed
const existing = await db.query.orders.findFirst({
where: eq(orders.id, orderId),
});
if (existing?.status === 'completed') {
// Already done, return cached result
return existing.result;
}
// Process and save atomically
const result = await fulfillOrder(job.data);
await db.update(orders).set({ status: 'completed', result }).where(eq(orders.id, orderId));
return result;
}
Deduplication
Prevent duplicate jobs for the same entity:
await orderQueue.add(
'process-order',
{ customerId, orderId },
{
deduplication: {
id: customerId, // Only one job per customer at a time
},
},
);
BullMQ's deduplication automatically clears when the job completes or fails, unlike custom job IDs which can cause "reserved ID" issues.
Database State Synchronization
Keep database state synchronized with job progress:
// Update status when job starts
worker.on('active', async (job) => {
await db.update(orders).set({ status: 'processing', startedAt: new Date() }).where(eq(orders.id, job.data.orderId));
});
// Update on completion
worker.on('completed', async (job, result) => {
await db
.update(orders)
.set({ status: 'completed', result, completedAt: new Date() })
.where(eq(orders.id, job.data.orderId));
});
// Update on failure
worker.on('failed', async (job, err) => {
await db
.update(orders)
.set({ status: 'failed', error: err.message, failedAt: new Date() })
.where(eq(orders.id, job.data.orderId));
});
Stale Job Recovery
Jobs can become stalled if a worker crashes mid-processing. Configure stall detection:
const worker = new Worker('orders', processor, {
connection: getRedisConnection(),
stalledInterval: 30000, // Check every 30 seconds
maxStalledCount: 2, // Retry stalled jobs twice before failing
});
Graceful Shutdown
Prevent data loss during deployments:
// server.ts
import { closeRedisConnection } from './lib/redis';
import { worker } from './workers/report.worker';
async function shutdown() {
console.log('Shutting down gracefully...');
// Stop accepting new jobs
await worker.close();
// Close Redis connection
await closeRedisConnection();
process.exit(0);
}
process.on('SIGTERM', shutdown);
process.on('SIGINT', shutdown);
The worker will finish processing active jobs before closing.
Monitoring with Bull Board
Add visibility into your queues:
// routes/admin.ts
import { createBullBoard } from '@bull-board/api';
import { BullMQAdapter } from '@bull-board/api/bullMQAdapter';
import { HonoAdapter } from '@bull-board/hono';
import { orderQueue, shippingQuoteQueue } from '../queues';
const serverAdapter = new HonoAdapter('/admin/queues');
createBullBoard({
queues: [new BullMQAdapter(orderQueue), new BullMQAdapter(shippingQuoteQueue)],
serverAdapter,
});
export default serverAdapter.registerPlugin();
Bull Board provides real-time visibility into:
- Active, waiting, completed, and failed jobs
- Job data and return values
- Retry history and error messages
- Queue throughput metrics
Production Checklist
Before deploying event-driven architecture:
Redis Configuration
- Set
maxmemory-policy=noevictionto prevent data loss - Configure appropriate memory limits
- Enable persistence (RDB or AOF) for durability
- Use managed Redis (Azure Cache, AWS ElastiCache) for production
Queue Configuration
- Set appropriate
removeOnCompleteandremoveOnFaillimits - Configure backoff strategies for each queue type
- Implement deduplication for critical jobs
- Add error event handlers to all queues
Worker Configuration
- Set
maxRetriesPerRequest: nullfor worker connections - Configure appropriate concurrency limits
- Implement graceful shutdown handlers
- Add comprehensive logging
Monitoring
- Deploy Bull Board for visibility
- Set up alerts for failed jobs
- Monitor Redis memory and connection counts
- Track job processing times
Key Takeaways
- Event-driven architecture decouples request handling from processing, enabling responsive APIs for long-running operations
- BullMQ provides richer job semantics than cloud queues at lower cost for most SaaS applications
- The step orchestrator pattern with
WaitingChildrenErrorenables complex workflows with partial retries - Idempotent processing and deduplication ensure data integrity despite retries
- Proper shutdown handling prevents data loss during deployments
- Bull Board provides essential visibility into queue health and job status
Event-driven architecture requires more upfront design than synchronous APIs, but the resilience, scalability, and user experience improvements justify the investment for any application with non-trivial background processing needs.
Ready to build with us? Check out our architecture consulting services or get in touch to discuss your project.