Deploy AI agents on Azure with managed identity, Terraform provisioning, rate limit handling, and BullMQ job orchestration. Complete setup guide with working code.

#Azure AI Foundry: Building Production AI Agents with TypeScript

Running AI agents in production is different from calling an API in a notebook. You need capacity planning, authentication that doesn't involve hardcoded keys, graceful handling of rate limits, and infrastructure that scales with demand. Azure AI Foundry provides the foundation — but the documentation assumes you'll figure out the production patterns yourself.

This guide covers the complete setup: provisioning Azure AI Foundry with Terraform, configuring model deployments with appropriate capacity, authenticating from containerized applications using managed identity, and orchestrating agent workloads through BullMQ job queues. Every code example runs in production.

Azure AI Foundry architecture diagram showing Terraform provisioning, Container Apps with managed identity, AI project with model deployments, and BullMQ job orchestration

#Why Azure AI Foundry

Before diving into implementation, understand the architectural decision. Azure AI Foundry (formerly Azure AI Studio) provides:

Unified endpoint for multiple model providers (OpenAI, Meta, Mistral, Cohere)
Managed capacity with predictable pricing through PTU (Provisioned Throughput Units)
Enterprise security with VNet integration, private endpoints, and managed identity
Built-in agent capabilities with the Azure AI Agent Service

The alternative — calling OpenAI directly — works for prototypes but lacks enterprise controls. Azure AI Foundry wraps the same models with Azure's security and compliance guarantees.

#Infrastructure Setup with Terraform

Azure AI Foundry infrastructure can be fully provisioned with Terraform using the azurerm provider. The official Terraform documentation covers the foundational setup, which we extend here with production patterns.

#AI Foundry Hub and Project

Azure AI Foundry uses a hub-and-spoke model. The hub contains shared resources (compute, storage, key vault), while projects contain model deployments and agent configurations:

# AI Foundry Hub - shared infrastructure
resource "azurerm_ai_foundry" "hub" {
  name                = "${var.prefix}-${var.environment}-ai-hub"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location

  storage_account_id  = azurerm_storage_account.ai.id
  key_vault_id        = azurerm_key_vault.main.id

  identity {
    type = "SystemAssigned"
  }

  # Network isolation for production
  public_network_access = var.environment == "prod" ? "Disabled" : "Enabled"
}

# AI Foundry Project - contains model deployments
resource "azurerm_ai_foundry_project" "main" {
  name               = "${var.prefix}-${var.environment}-ai-project"
  ai_foundry_id      = azurerm_ai_foundry.hub.id
  location           = azurerm_resource_group.main.location

  identity {
    type = "SystemAssigned"
  }
}

#Azure OpenAI Service and Model Deployment

The AI Foundry project connects to an Azure OpenAI resource for model access:

# Azure OpenAI Service
resource "azurerm_cognitive_account" "openai" {
  name                = "${var.prefix}-${var.environment}-openai"
  resource_group_name = azurerm_resource_group.main.name
  location            = var.openai_location  # Not all regions support all models
  kind                = "OpenAI"
  sku_name            = "S0"

  identity {
    type = "SystemAssigned"
  }

  # Network rules for production
  network_acls {
    default_action = var.environment == "prod" ? "Deny" : "Allow"
    virtual_network_rules {
      subnet_id = azurerm_subnet.container_apps.id
    }
  }
}

# Model deployment with capacity planning
resource "azurerm_cognitive_deployment" "gpt4" {
  name                 = "gpt-4o"
  cognitive_account_id = azurerm_cognitive_account.openai.id

  model {
    format  = "OpenAI"
    name    = "gpt-4o"
    version = "2024-11-20"
  }

  sku {
    name     = "GlobalStandard"
    capacity = var.gpt4_capacity  # Tokens per minute in thousands
  }
}

resource "azurerm_cognitive_deployment" "gpt4_mini" {
  name                 = "gpt-4o-mini"
  cognitive_account_id = azurerm_cognitive_account.openai.id

  model {
    format  = "OpenAI"
    name    = "gpt-4o-mini"
    version = "2024-07-18"
  }

  sku {
    name     = "GlobalStandard"
    capacity = var.gpt4_mini_capacity
  }
}

#Capacity Planning

Capacity is measured in thousands of tokens per minute (TPM). The capacity value in Terraform directly maps to TPM limits:

# environments/dev.tfvars
gpt4_capacity      = 30   # 30K TPM - sufficient for development
gpt4_mini_capacity = 60   # 60K TPM - higher for bulk operations

# environments/prod.tfvars
gpt4_capacity      = 150  # 150K TPM - production workload
gpt4_mini_capacity = 300  # 300K TPM - high-volume tasks

Tip: Azure also enforces Requests Per Minute (RPM) limits alongside TPM. A deployment with 150K TPM typically allows ~900 RPM. For agent workloads with many small requests, RPM often becomes the bottleneck before TPM.

Start conservative and scale based on actual usage. Azure OpenAI offers two billing models: Standard deployments use pay-per-token pricing where you only pay for actual consumption, while Provisioned Throughput Units (PTU) are billed hourly regardless of usage. For most applications, Standard deployments with adequate TPM capacity provide the best balance of cost and performance — you get predictable rate limits without paying for idle capacity. PTU deployments make sense only when you need guaranteed throughput for latency-sensitive workloads and can consistently utilize the provisioned capacity.

When to consider PTU: If your application consistently uses 80%+ of provisioned Standard capacity during business hours, PTU may offer cost savings. Calculate: (hourly_PTU_cost × hours) vs (token_price × average_daily_tokens). PTU also guarantees latency SLAs that Standard deployments don't provide.

#Managed Identity for Container Apps

The Container App authenticates to Azure OpenAI using managed identity — no API keys to rotate or leak. User-assigned identities are preferred over system-assigned because they can be pre-created and assigned RBAC roles before the Container App exists:

# User-assigned managed identity
resource "azurerm_user_assigned_identity" "api" {
  name                = "${var.prefix}-${var.environment}-api-identity"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
}

# Grant Cognitive Services User role to the identity
resource "azurerm_role_assignment" "api_openai" {
  scope                = azurerm_cognitive_account.openai.id
  role_definition_name = "Cognitive Services OpenAI User"
  principal_id         = azurerm_user_assigned_identity.api.principal_id
}

# Assign identity to Container App
resource "azurerm_container_app" "api" {
  name                         = "${var.prefix}-${var.environment}-api"
  container_app_environment_id = azurerm_container_app_environment.main.id
  resource_group_name          = azurerm_resource_group.main.name
  revision_mode                = "Single"

  identity {
    type         = "UserAssigned"
    identity_ids = [azurerm_user_assigned_identity.api.id]
  }

  template {
    container {
      name   = "api"
      image  = "${azurerm_container_registry.main.login_server}/api:${var.image_tag}"
      cpu    = var.api_cpu
      memory = var.api_memory

      # Environment variables for SDK configuration
      env {
        name  = "AZURE_OPENAI_ENDPOINT"
        value = azurerm_cognitive_account.openai.endpoint
      }

      env {
        name  = "AZURE_CLIENT_ID"
        value = azurerm_user_assigned_identity.api.client_id
      }

      env {
        name  = "AZURE_OPENAI_DEPLOYMENT_GPT4"
        value = azurerm_cognitive_deployment.gpt4.name
      }

      env {
        name  = "AZURE_OPENAI_DEPLOYMENT_GPT4_MINI"
        value = azurerm_cognitive_deployment.gpt4_mini.name
      }
    }
  }
}

#TypeScript SDK Setup

#Dependencies

Install the Azure AI packages:

pnpm add @azure/openai @azure/identity

#Client Configuration with Managed Identity

The SDK automatically uses managed identity when running on Azure:

// lib/azure-openai.ts
import { AzureOpenAI } from '@azure/openai';
import { DefaultAzureCredential } from '@azure/identity';

// Singleton client instance
let client: AzureOpenAI | null = null;

export function getOpenAIClient(): AzureOpenAI {
	if (client) return client;

	const endpoint = process.env.AZURE_OPENAI_ENDPOINT;
	if (!endpoint) {
		throw new Error('AZURE_OPENAI_ENDPOINT is required');
	}

	// DefaultAzureCredential automatically uses:
	// - Managed Identity in Azure (Container Apps, VMs, etc.)
	// - Azure CLI credentials locally
	// - Environment variables (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID)
	const credential = new DefaultAzureCredential({
		managedIdentityClientId: process.env.AZURE_CLIENT_ID,
	});

	client = new AzureOpenAI({
		endpoint,
		credential,
		apiVersion: '2024-10-21',
	});

	return client;
}

// Deployment names from environment
export const deployments = {
	gpt4: process.env.AZURE_OPENAI_DEPLOYMENT_GPT4 || 'gpt-4o',
	gpt4Mini: process.env.AZURE_OPENAI_DEPLOYMENT_GPT4_MINI || 'gpt-4o-mini',
};

#Basic Completion Example

// lib/ai/completion.ts
import { getOpenAIClient, deployments } from '../azure-openai';

interface CompletionOptions {
	prompt: string;
	systemPrompt?: string;
	model?: 'gpt4' | 'gpt4Mini';
	maxTokens?: number;
	temperature?: number;
}

export async function generateCompletion(options: CompletionOptions): Promise<string> {
	const { prompt, systemPrompt, model = 'gpt4Mini', maxTokens = 1000, temperature = 0.7 } = options;

	const client = getOpenAIClient();
	const deployment = deployments[model];

	const messages = [];

	if (systemPrompt) {
		messages.push({ role: 'system' as const, content: systemPrompt });
	}

	messages.push({ role: 'user' as const, content: prompt });

	const response = await client.chat.completions.create({
		model: deployment,
		messages,
		max_tokens: maxTokens,
		temperature,
	});

	const content = response.choices[0]?.message?.content;

	if (!content) {
		throw new Error('No content in response');
	}

	return content;
}

#Rate Limit Handling

Azure OpenAI enforces rate limits at the deployment level based on Tokens Per Minute (TPM) and Requests Per Minute (RPM). When you exceed these limits, the API returns 429 Too Many Requests with a retry-after header indicating when to retry. Production systems must handle this gracefully.

#Retry with Exponential Backoff

// lib/ai/retry.ts
import { RestError } from '@azure/core-rest-pipeline';

interface RetryOptions {
	maxRetries?: number;
	baseDelay?: number; // Initial delay in ms
	maxDelay?: number; // Cap on delay to prevent excessive waits
}

export async function withRetry<T>(fn: () => Promise<T>, options: RetryOptions = {}): Promise<T> {
	const { maxRetries = 5, baseDelay = 1000, maxDelay = 60000 } = options;

	let lastError: Error | null = null;

	for (let attempt = 0; attempt <= maxRetries; attempt++) {
		try {
			return await fn();
		} catch (error) {
			lastError = error as Error;

			// Only retry transient errors - fail fast on permanent errors
			if (!isRetryable(error)) {
				throw error;
			}

			if (attempt === maxRetries) {
				break;
			}

			// Exponential backoff with jitter prevents thundering herd
			// Attempt 0: 1-2s, Attempt 1: 2-3s, Attempt 2: 4-5s, etc.
			const delay = Math.min(baseDelay * Math.pow(2, attempt) + Math.random() * 1000, maxDelay);

			// IMPORTANT: Always prefer retry-after header when available
			// Azure provides this header on 429 responses with exact wait time
			const retryAfter = getRetryAfter(error);
			const waitTime = retryAfter ? retryAfter * 1000 : delay;

			console.log(`Rate limited. Retrying in ${waitTime}ms (attempt ${attempt + 1}/${maxRetries})`);

			await sleep(waitTime);
		}
	}

	throw lastError;
}

function isRetryable(error: unknown): boolean {
	if (error instanceof RestError) {
		// 429 Too Many Requests - rate limited
		// 503 Service Unavailable - temporary outage
		// 500 Internal Server Error - sometimes transient
		return [429, 503, 500].includes(error.statusCode || 0);
	}

	// Network errors are retryable
	if (error instanceof Error && error.message.includes('ECONNRESET')) {
		return true;
	}

	return false;
}

function getRetryAfter(error: unknown): number | null {
	if (error instanceof RestError && error.response?.headers) {
		const retryAfter = error.response.headers.get('retry-after');
		if (retryAfter) {
			return parseInt(retryAfter, 10);
		}
	}
	return null;
}

function sleep(ms: number): Promise<void> {
	return new Promise((resolve) => setTimeout(resolve, ms));
}

#Usage with Retry Wrapper

// lib/ai/completion.ts
import { withRetry } from './retry';

export async function generateCompletionWithRetry(options: CompletionOptions): Promise<string> {
	return withRetry(() => generateCompletion(options), {
		maxRetries: 5,
		baseDelay: 2000,
	});
}

#Building AI Agents

Agents combine LLM reasoning with tool execution. The Azure AI Agent Service provides hosted agent infrastructure, but you can also build agents directly with the SDK.

#Bing Grounding for Real-Time Web Data

One of the most powerful features in Azure AI Foundry is Bing grounding — the ability to augment agent responses with live web search results. This enables agents to answer questions about current events, verify facts, and provide up-to-date information beyond the model's training cutoff.

#Infrastructure: Bing Grounding Resource

Bing grounding requires a separate Microsoft.Bing/accounts resource connected to your AI Foundry project. This uses the azapi provider since the resource type isn't available in the standard azurerm provider:

# Bing Grounding resource (global location)
resource "azapi_resource" "bing_grounding" {
  type                      = "Microsoft.Bing/accounts@2020-06-10"
  name                      = "${var.prefix}-bing-grounding"
  parent_id                 = azurerm_resource_group.main.id
  location                  = "global"
  schema_validation_enabled = false

  body = {
    kind = "Bing.Grounding"
    sku = {
      name = "G1"
    }
    properties = {
      statisticsEnabled = false
    }
  }

  response_export_values = ["properties.endpoint"]
}

# Get API keys for the Bing resource
data "azapi_resource_action" "bing_keys" {
  type                   = "Microsoft.Bing/accounts@2020-06-10"
  resource_id            = azapi_resource.bing_grounding.id
  action                 = "listKeys"
  response_export_values = ["*"]
}

#Connecting Bing to AI Foundry

The Bing resource must be connected to your AI Foundry project via a connection resource. This is the critical step that makes Bing grounding available as a tool for your agents. The connection uses the azapi provider because AI Foundry connections aren't available in the standard azurerm provider.

Understanding the connection architecture:

Bing resource (Microsoft.Bing/accounts) — The grounding service itself
Connection (Microsoft.CognitiveServices/accounts/connections) — Links Bing to your AI Foundry project
Agent tool — References the connection ID when creating agents

# Connection: Bing Grounding to AI Foundry GA Project
# This enables Bing grounding capabilities for AI agents in the project
# API Version: 2025-06-01 (latest GA as of early 2026)
resource "azapi_resource" "bing_connection" {
  type                      = "Microsoft.CognitiveServices/accounts/connections@2025-06-01"
  name                      = "${var.prefix}-bing-connection"
  parent_id                 = azurerm_cognitive_account.foundry.id  # Your AI Foundry account
  schema_validation_enabled = false  # Required - schema validation fails on this resource type

  body = {
    properties = {
      # Category must be "ApiKey" for Bing connections (not "BingGrounding")
      category      = "ApiKey"
      target        = azapi_resource.bing_grounding.output.properties.endpoint
      authType      = "ApiKey"
      isSharedToAll = true  # Makes connection available to all projects under this hub

      metadata = {
        Location   = azurerm_resource_group.main.location  # Can differ from Bing resource location
        ApiType    = "Azure"
        ResourceId = azapi_resource.bing_grounding.id
      }

      credentials = {
        key = data.azapi_resource_action.bing_keys.output.key1
      }
    }
  }

  depends_on = [
    data.azapi_resource_action.bing_keys,  # Ensure keys are available
    azapi_resource.ai_foundry_project      # Project must exist first
  ]
}

Important: The category must be "ApiKey" for Bing connections, not "BingGrounding". This is a common source of confusion — the category describes the authentication method, not the service type.

The connection ID format is specific and must be constructed correctly for the SDK:

# Pass to Container App environment
env {
  name  = "BING_CONNECTION_ID"
  # Format: {project_id}/connections/{connection_name}
  value = "${azapi_resource.ai_foundry_project.id}/connections/${azapi_resource.bing_connection.name}"
}

Troubleshooting: If your agent fails to use Bing grounding, verify the connection ID format matches exactly. The SDK silently ignores invalid connection IDs rather than throwing an error.

#Agent Service with Bing Grounding

The @azure/ai-agents SDK provides built-in support for Bing grounding through the ToolUtility class:

// lib/ai/azure-agent.service.ts
import { AgentsClient, ToolUtility, isOutputOfType } from '@azure/ai-agents';
import type { MessageTextContent, ThreadMessage, MessageTextUrlCitationAnnotation } from '@azure/ai-agents';
import { DefaultAzureCredential } from '@azure/identity';

interface AgentResponse {
	content: string;
	citations?: Array<{
		title: string;
		url: string;
	}>;
	metadata: {
		agentId: string;
		threadId: string;
		runId: string;
		tokensUsed?: {
			prompt: number;
			completion: number;
			total: number;
		};
	};
}

export class AzureAgentService {
	private client: AgentsClient;
	private modelDeploymentName: string;
	private bingConnectionId?: string;

	constructor() {
		const projectEndpoint = process.env.PROJECT_ENDPOINT;
		this.modelDeploymentName = process.env.MODEL_DEPLOYMENT_NAME || 'gpt-4o';
		this.bingConnectionId = process.env.BING_CONNECTION_ID;

		if (!projectEndpoint) {
			throw new Error('PROJECT_ENDPOINT environment variable is required');
		}

		// DefaultAzureCredential handles managed identity in Azure,
		// service principal locally, and Azure CLI as fallback
		const credential = new DefaultAzureCredential();
		this.client = new AgentsClient(projectEndpoint, credential);
	}

	/**
	 * Create an agent with Bing grounding enabled
	 */
	async createAgent(name: string, instructions?: string) {
		const tools = [];

		if (this.bingConnectionId) {
			// Configure Bing grounding with optimized search count
			const bingTool = ToolUtility.createBingGroundingTool([
				{
					connectionId: this.bingConnectionId,
					count: 20, // Number of search results (max: 50, default: 5)
				},
			]);
			tools.push(bingTool.definition);
		}

		const agent = await this.client.createAgent(this.modelDeploymentName, {
			name,
			instructions,
			tools: tools.length > 0 ? tools : undefined,
			temperature: 0.2, // Low temperature for factual responses
		});

		return agent;
	}

	/**
	 * Send a message and get a grounded response with citations
	 */
	async sendMessage(agentId: string, userMessage: string): Promise<AgentResponse> {
		// Create a thread for this conversation
		const thread = await this.client.threads.create();

		try {
			// Send user message
			await this.client.messages.create(thread.id, 'user', userMessage);

			// Run agent and poll for completion
			// IMPORTANT: The SDK's createAndPoll handles the async nature of agent runs.
			// Agent runs can take 10-60+ seconds depending on:
			// - Model complexity (GPT-4o vs GPT-4o-mini)
			// - Tool usage (Bing grounding adds latency for web searches)
			// - Token count in prompt and response
			//
			// Polling interval considerations:
			// - Default (1 second) can exhaust rate limits with parallel operations
			// - 15 seconds works well for production with multiple concurrent jobs
			// - Calculate based on: capacity / (parallel_operations * polls_per_run)
			const runStatus = await this.client.runs.createAndPoll(thread.id, agentId, {
				pollingOptions: {
					intervalInMs: 15000, // 15 seconds - tuned for 400 capacity with parallel jobs
				},
			});

			if (runStatus.status === 'failed') {
				// Common failure reasons:
				// - Rate limit exceeded (429)
				// - Invalid tool configuration
				// - Context length exceeded
				// - Content policy violation
				const errorCode = runStatus.lastError?.code || 'unknown';
				const errorMessage = runStatus.lastError?.message || 'Unknown error';
				throw new Error(`Agent run failed: ${errorMessage} (code: ${errorCode})`);
			}

			// Extract response with citations
			const response = await this.getLatestAgentMessage(thread.id);

			return {
				content: response.content,
				citations: response.citations,
				metadata: {
					agentId,
					threadId: thread.id,
					runId: runStatus.id,
					tokensUsed: runStatus.usage
						? {
								prompt: runStatus.usage.promptTokens,
								completion: runStatus.usage.completionTokens,
								total: runStatus.usage.totalTokens,
						  }
						: undefined,
				},
			};
		} finally {
			// Clean up thread
			await this.client.threads.delete(thread.id);
		}
	}

	/**
	 * Extract text content and Bing citations from agent response
	 */
	private async getLatestAgentMessage(threadId: string) {
		const messagesIterator = this.client.messages.list(threadId, {
			order: 'desc',
			limit: 1,
		});

		let message: ThreadMessage | undefined;
		for await (const msg of messagesIterator) {
			if (msg.role === 'assistant') {
				message = msg;
				break;
			}
		}

		if (!message) {
			throw new Error('No assistant message found');
		}

		const contentParts: string[] = [];
		const citations: Array<{ title: string; url: string }> = [];

		for (const content of message.content) {
			if (isOutputOfType<MessageTextContent>(content, 'text')) {
				contentParts.push(content.text.value);

				// Extract URL citations from Bing grounding
				if (content.text.annotations) {
					for (const annotation of content.text.annotations) {
						if (isOutputOfType<MessageTextUrlCitationAnnotation>(annotation, 'url_citation')) {
							citations.push({
								title: annotation.urlCitation.title || annotation.urlCitation.url,
								url: annotation.urlCitation.url,
							});
						}
					}
				}
			}
		}

		return {
			content: contentParts.join('\n'),
			citations,
		};
	}

	async deleteAgent(agentId: string): Promise<void> {
		await this.client.deleteAgent(agentId);
	}
}

#Model Compatibility

Bing grounding works with most Azure OpenAI models, but has specific version requirements:

Compatible: gpt-4o, gpt-4, gpt-4-turbo, gpt-3.5-turbo
Not compatible: gpt-4o-mini-2024-07-18 (older mini version)

Always verify compatibility with your specific model version in the Azure AI documentation.

#Polling Strategy for Agent Runs

Agent runs are asynchronous — the SDK's createAndPoll method handles the polling loop, but the default 1-second interval can cause problems in production:

// ❌ Default polling - can exhaust rate limits with parallel operations
const runStatus = await this.client.runs.createAndPoll(threadId, agentId);

// ✅ Production polling - tuned for your capacity and concurrency
const runStatus = await this.client.runs.createAndPoll(threadId, agentId, {
	pollingOptions: {
		intervalInMs: 15000, // 15 seconds between status checks
	},
});

Why 15 seconds? Consider this calculation for a system with 400 TPM capacity (40 requests/second):

3 parallel BullMQ jobs running agent queries
Each agent run takes ~30 seconds average
Default 1-second polling = 30 polls per run × 3 jobs = 90 requests
15-second polling = 2 polls per run × 3 jobs = 6 requests

The longer interval dramatically reduces API calls while still providing responsive completion detection. Agent runs typically take 10-60 seconds depending on model, tools, and token count — polling every second wastes capacity.

Tip: Calculate your polling interval as capacity / (parallel_operations × expected_polls_per_run) to stay within rate limits while maintaining responsiveness.

#Agent with Tool Calling

// lib/ai/agent.ts
import { getOpenAIClient, deployments } from '../azure-openai';
import { withRetry } from './retry';
import type { ChatCompletionTool, ChatCompletionMessageParam } from '@azure/openai';

// Define available tools
const tools: ChatCompletionTool[] = [
	{
		type: 'function',
		function: {
			name: 'search_database',
			description: 'Search the database for relevant records',
			parameters: {
				type: 'object',
				properties: {
					query: { type: 'string', description: 'Search query' },
					limit: { type: 'number', description: 'Maximum results' },
				},
				required: ['query'],
			},
		},
	},
	{
		type: 'function',
		function: {
			name: 'fetch_external_data',
			description: 'Fetch data from external API',
			parameters: {
				type: 'object',
				properties: {
					endpoint: { type: 'string', description: 'API endpoint' },
					params: { type: 'object', description: 'Query parameters' },
				},
				required: ['endpoint'],
			},
		},
	},
];

// Tool implementations
const toolHandlers: Record<string, (args: Record<string, unknown>) => Promise<unknown>> = {
	search_database: async (args) => {
		// Implement database search
		return { results: [], query: args.query };
	},
	fetch_external_data: async (args) => {
		// Implement external API call
		return { data: null, endpoint: args.endpoint };
	},
};

export async function runAgent(userMessage: string, context?: string): Promise<string> {
	const client = getOpenAIClient();
	const messages: ChatCompletionMessageParam[] = [];

	// System prompt with context
	messages.push({
		role: 'system',
		content: `You are a helpful assistant. ${context || ''}

Use the available tools when you need to search for data or fetch external information.
Always explain your reasoning before taking actions.`,
	});

	messages.push({ role: 'user', content: userMessage });

	// Agent loop - continue until no more tool calls
	const maxIterations = 10;

	for (let i = 0; i < maxIterations; i++) {
		const response = await withRetry(() =>
			client.chat.completions.create({
				model: deployments.gpt4,
				messages,
				tools,
				tool_choice: 'auto',
			}),
		);

		const choice = response.choices[0];
		const message = choice.message;

		// Add assistant message to history
		messages.push(message);

		// Check if we're done
		if (choice.finish_reason === 'stop' || !message.tool_calls?.length) {
			return message.content || '';
		}

		// Execute tool calls
		for (const toolCall of message.tool_calls) {
			const handler = toolHandlers[toolCall.function.name];

			if (!handler) {
				throw new Error(`Unknown tool: ${toolCall.function.name}`);
			}

			const args = JSON.parse(toolCall.function.arguments);
			const result = await handler(args);

			// Add tool result to messages
			messages.push({
				role: 'tool',
				tool_call_id: toolCall.id,
				content: JSON.stringify(result),
			});
		}
	}

	throw new Error('Agent exceeded maximum iterations');
}

#Orchestrating Agents with BullMQ

Long-running agent tasks belong in background jobs. BullMQ provides reliable job processing with retry logic, and critically, it prevents your API from timing out while waiting for AI responses.

Why background jobs for AI agents?

Agent runs take 10-60+ seconds — too long for synchronous HTTP requests
Rate limits require queuing — bursts of requests will hit 429 errors without throttling
Retries need state — failed jobs should retry with exponential backoff, not fail silently
Observability — job queues provide metrics on processing time, failure rates, and throughput

#Queue Configuration

// lib/queues/agent-queue.ts
import { Queue, Worker, Job } from 'bullmq';
import { Redis } from 'ioredis';

const connection = new Redis(process.env.REDIS_URL!, {
	maxRetriesPerRequest: null,
});

// Job data interface
interface AgentJobData {
	taskId: string;
	prompt: string;
	context?: string;
	userId: string;
	webhookUrl?: string;
}

// Job result interface
interface AgentJobResult {
	taskId: string;
	response: string;
	tokensUsed: number;
	duration: number;
}

// Create the queue
export const agentQueue = new Queue<AgentJobData, AgentJobResult>('agent-tasks', {
	connection,
	defaultJobOptions: {
		attempts: 3,
		backoff: {
			type: 'exponential',
			delay: 5000,
		},
		removeOnComplete: { count: 1000 },
		removeOnFail: { count: 5000 },
	},
});

#Worker Implementation

// workers/agent-worker.ts
import { Worker, Job } from 'bullmq';
import { Redis } from 'ioredis';
import { runAgent } from '../lib/ai/agent';
import { db } from '../lib/db';
import { tasks } from '../lib/db/schema';
import { eq } from 'drizzle-orm';

const connection = new Redis(process.env.REDIS_URL!, {
	maxRetriesPerRequest: null,
});

const worker = new Worker<AgentJobData, AgentJobResult>(
	'agent-tasks',
	async (job: Job<AgentJobData>) => {
		const { taskId, prompt, context, webhookUrl } = job.data;
		const startTime = Date.now();

		// Update task status
		await db.update(tasks).set({ status: 'processing' }).where(eq(tasks.id, taskId));

		try {
			// Run the agent
			const response = await runAgent(prompt, context);

			const duration = Date.now() - startTime;

			// Update task with result
			await db
				.update(tasks)
				.set({
					status: 'completed',
					result: response,
					completedAt: new Date(),
				})
				.where(eq(tasks.id, taskId));

			// Notify via webhook if configured
			if (webhookUrl) {
				await notifyWebhook(webhookUrl, { taskId, status: 'completed', response });
			}

			return {
				taskId,
				response,
				tokensUsed: 0, // Extract from response metadata
				duration,
			};
		} catch (error) {
			// Update task with error
			await db
				.update(tasks)
				.set({
					status: 'failed',
					error: error instanceof Error ? error.message : 'Unknown error',
				})
				.where(eq(tasks.id, taskId));

			throw error;
		}
	},
	{
		connection,
		// Concurrency: How many jobs run in parallel
		// Each job makes multiple API calls (create thread, send message, poll, get response)
		// With 150K TPM capacity: 5 concurrent jobs × ~4 calls each = 20 parallel API calls
		concurrency: 5,

		// Rate limiter: Prevents overwhelming the API during burst traffic
		// This is a secondary safeguard - polling intervals handle steady-state limiting
		limiter: {
			max: 10, // Max 10 jobs started per duration window
			duration: 60000, // Per minute - aligns with Azure's RPM limits
		},
	},
);

// Event handlers for observability
worker.on('completed', (job, result) => {
	console.log(`Job ${job.id} completed in ${result.duration}ms`);
});

worker.on('failed', (job, error) => {
	console.error(`Job ${job?.id} failed:`, error.message);
});

// IMPORTANT: Handle worker errors to prevent silent failures
worker.on('error', (error) => {
	console.error('Worker error:', error.message);
});

async function notifyWebhook(url: string, payload: Record<string, unknown>): Promise<void> {
	await fetch(url, {
		method: 'POST',
		headers: { 'Content-Type': 'application/json' },
		body: JSON.stringify(payload),
	});
}

#API Endpoint to Queue Tasks

// routes/tasks.ts
import { Hono } from 'hono';
import { zValidator } from '@hono/zod-validator';
import { z } from 'zod';
import { agentQueue } from '../lib/queues/agent-queue';
import { db } from '../lib/db';
import { tasks } from '../lib/db/schema';

const app = new Hono();

const createTaskSchema = z.object({
	prompt: z.string().min(1).max(10000),
	context: z.string().optional(),
	webhookUrl: z.string().url().optional(),
});

app.post('/', zValidator('json', createTaskSchema), async (c) => {
	const user = c.get('user');
	const { prompt, context, webhookUrl } = c.req.valid('json');

	// Create task record
	const [task] = await db
		.insert(tasks)
		.values({
			userId: user.id,
			prompt,
			status: 'pending',
		})
		.returning();

	// Queue the job
	await agentQueue.add(
		'process',
		{
			taskId: task.id,
			prompt,
			context,
			userId: user.id,
			webhookUrl,
		},
		{
			priority: user.tier === 'premium' ? 1 : 10, // Premium users get priority
		},
	);

	return c.json({ taskId: task.id, status: 'queued' }, 202);
});

app.get('/:id', async (c) => {
	const user = c.get('user');
	const taskId = c.req.param('id');

	const task = await db.query.tasks.findFirst({
		where: and(eq(tasks.id, taskId), eq(tasks.userId, user.id)),
	});

	if (!task) {
		return c.json({ error: 'Task not found' }, 404);
	}

	return c.json(task);
});

export default app;

#Optimization Strategies

#Model Selection by Task Complexity

Use cheaper models for simple tasks:

// lib/ai/model-router.ts
import { generateCompletionWithRetry } from './completion';

type TaskComplexity = 'simple' | 'moderate' | 'complex';

function assessComplexity(prompt: string): TaskComplexity {
	const wordCount = prompt.split(/\s+/).length;
	const hasCodeRequest = /code|function|implement|debug/i.test(prompt);
	const hasAnalysisRequest = /analyze|compare|evaluate|reason/i.test(prompt);

	if (hasCodeRequest || hasAnalysisRequest || wordCount > 500) {
		return 'complex';
	}

	if (wordCount > 100) {
		return 'moderate';
	}

	return 'simple';
}

export async function smartCompletion(prompt: string, systemPrompt?: string): Promise<string> {
	const complexity = assessComplexity(prompt);

	const modelMap: Record<TaskComplexity, 'gpt4' | 'gpt4Mini'> = {
		simple: 'gpt4Mini',
		moderate: 'gpt4Mini',
		complex: 'gpt4',
	};

	return generateCompletionWithRetry({
		prompt,
		systemPrompt,
		model: modelMap[complexity],
	});
}

#Response Caching

Cache identical requests to reduce API calls:

// lib/ai/cache.ts
import { Redis } from 'ioredis';
import { createHash } from 'crypto';

const redis = new Redis(process.env.REDIS_URL!);
const CACHE_TTL = 3600; // 1 hour

function hashRequest(prompt: string, systemPrompt?: string): string {
	return createHash('sha256').update(JSON.stringify({ prompt, systemPrompt })).digest('hex');
}

export async function getCachedCompletion(prompt: string, systemPrompt?: string): Promise<string | null> {
	const key = `ai:completion:${hashRequest(prompt, systemPrompt)}`;
	return redis.get(key);
}

export async function cacheCompletion(prompt: string, response: string, systemPrompt?: string): Promise<void> {
	const key = `ai:completion:${hashRequest(prompt, systemPrompt)}`;
	await redis.setex(key, CACHE_TTL, response);
}

// Wrapper with caching
export async function cachedCompletion(prompt: string, systemPrompt?: string): Promise<string> {
	const cached = await getCachedCompletion(prompt, systemPrompt);
	if (cached) return cached;

	const response = await generateCompletionWithRetry({ prompt, systemPrompt });
	await cacheCompletion(prompt, response, systemPrompt);

	return response;
}

#Token Usage Tracking

Monitor consumption for cost control:

// lib/ai/usage.ts
import { db } from '../lib/db';
import { aiUsage } from '../lib/db/schema';

interface UsageRecord {
	userId: string;
	model: string;
	promptTokens: number;
	completionTokens: number;
	totalTokens: number;
}

export async function trackUsage(usage: UsageRecord): Promise<void> {
	await db.insert(aiUsage).values({
		...usage,
		timestamp: new Date(),
	});
}

export async function getUserUsage(userId: string, since: Date): Promise<number> {
	const result = await db
		.select({ total: sql<number>`sum(total_tokens)` })
		.from(aiUsage)
		.where(and(eq(aiUsage.userId, userId), gte(aiUsage.timestamp, since)));

	return result[0]?.total || 0;
}

// Enforce usage limits
export async function checkUsageLimit(userId: string, limit: number): Promise<boolean> {
	const startOfMonth = new Date();
	startOfMonth.setDate(1);
	startOfMonth.setHours(0, 0, 0, 0);

	const usage = await getUserUsage(userId, startOfMonth);
	return usage < limit;
}

#Environment Variable Format

The correct format for Azure OpenAI environment variables:

# Azure OpenAI endpoint - full URL including resource name
AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"

# Managed identity client ID (for user-assigned identity)
AZURE_CLIENT_ID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

# Deployment names - must match Terraform deployment names
AZURE_OPENAI_DEPLOYMENT_GPT4="gpt-4o"
AZURE_OPENAI_DEPLOYMENT_GPT4_MINI="gpt-4o-mini"

# API version - use latest stable
AZURE_OPENAI_API_VERSION="2024-10-21"

Common mistakes:

Using the API key instead of managed identity
Missing the https:// prefix on endpoint
Confusing model names with deployment names
Using an outdated API version

#Key Takeaways

Building production AI agents on Azure requires deliberate infrastructure:

Managed identity eliminates API key management and rotation
Terraform provisioning ensures reproducible deployments across environments
Capacity planning balances cost against rate limit headroom
Exponential backoff with retry-after headers handles rate limits gracefully
BullMQ orchestration moves long-running tasks to background workers
Model routing optimizes costs by matching task complexity to model capability
Response caching reduces redundant API calls

The Azure AI Foundry ecosystem provides enterprise-grade infrastructure for AI workloads. The patterns here scale from prototype to production without architectural rewrites.

Ready to build AI-powered features? Check out our development services or get in touch to discuss your project.

#Further Reading

Official Microsoft documentation for the technologies covered in this guide:

Create an Azure AI Foundry hub using Terraform — Infrastructure provisioning fundamentals
Create Azure AI Foundry resources using Terraform — Projects, deployments, and connections
Azure OpenAI Service quotas and limits — TPM/RPM limits and quota management
Provisioned Throughput Units (PTU) onboarding — When and how to use PTU deployments
Managed identities in Azure Container Apps — Authentication without secrets
Bing grounding tool for Azure AI Agents — Real-time web search integration
Azure AI Agents SDK for JavaScript — Getting started with the agents SDK
Azure OpenAI Service REST API reference — Complete API documentation