BoxLang AI 3.0 Series Β· Part 4 of 7
Here's the question every team eventually asks about their AI agents: how do we test these things?
Agents make live LLM calls. They invoke real tools. They have non-deterministic outputs. Standard unit testing approaches fall apart. You can't mock every provider. You can't replay a conversation from three weeks ago. You can't confidently tell stakeholders that the agent you deployed today behaves the same way it did when you signed off on it.
And before testing, there's production: how do you add logging without touching provider code? How do you retry transient failures without wrapping every call? How do you block dangerous tool invocations without forking the agent logic?
BoxLang AI 3.0 solves all of this with middleware. Six battle-tested middleware classes ship out of the box, covering the most common cross-cutting concerns. And if none of them fit your use case exactly, writing your own is a matter of extending one class or defining a struct of closures.
ποΈ The Middleware Architecture
Middleware sits between the agent's run() call and the actual LLM invocations and tool calls. Both AiModel and AiAgent support it. When an agent runs, its middleware is prepended to the model's middleware β agent hooks fire first, then model hooks.
There are two hook styles:
Sequential hooks β fire in registration order (or reverse for after* hooks). Return AiMiddlewareResult to control flow.
| Hook | Fires | Direction |
|---|---|---|
beforeAgentRun( context ) | Before agent starts | Forward |
afterAgentRun( context ) | After agent completes | Reverse |
beforeLLMCall( context ) | Before each LLM call | Forward |
afterLLMCall( context ) | After each LLM call | Reverse |
beforeToolCall( context ) | Before each tool invocation | Forward |
afterToolCall( context ) | After each tool returns | Reverse |
onError( context ) | When any hook throws | β |
Wrap hooks β nested closures. Call handler() to proceed, intercept the result.
| Hook | Purpose |
|---|---|
wrapLLMCall( context, handler ) | Surround each LLM provider call |
wrapToolCall( context, handler ) | Surround each tool invocation |
π― AiMiddlewareResult β Typed Flow Control
Every sequential hook must return an AiMiddlewareResult. The static factory methods make this expressive:
import bxModules.bxai.models.middleware.AiMiddlewareResult;
// Continue normally β chain proceeds
return AiMiddlewareResult.continue()
// Stop everything immediately
return AiMiddlewareResult.cancel( "Rate limit exceeded for this tenant." )
// Human approved β used by HITL middleware
return AiMiddlewareResult.approve()
// Human rejected β terminal, stops the chain
return AiMiddlewareResult.reject( "Operator rejected: amounts over $1000 require VP approval." )
// Human edited the tool args β patched args flow to the tool
return AiMiddlewareResult.edit( { correctedArgs: { amount: 100 } } )
// Suspend for async human review β terminal
return AiMiddlewareResult.suspend( { toolName: "transferFunds", args: toolArgs } )
Terminal results (cancel, reject, suspend) stop the chain immediately. Non-terminal results continue to the next middleware.
// Predicates for checking results
result.isContinue() // chain continues
result.isCancelled() // was stopped
result.isApproved() // human approved
result.isRejected() // human rejected (terminal)
result.isEdit() // args were modified
result.isSuspended() // waiting for async input (terminal)
result.isTerminal() // cancelled OR rejected OR suspended
π LoggingMiddleware β Instant Observability
Drop this in and every LLM call, tool invocation, agent run start/end, and error gets logged to BoxLang's ai log file and optionally to the console β with zero code changes to your agents:
agent = aiAgent(
name : "support-bot",
middleware : new LoggingMiddleware(
logToConsole : true,
logLevel : "info",
prefix : "[SupportBot]"
)
)
The implementation is a clean example of how sequential hooks compose:
// From LoggingMiddleware.bx
AiMiddlewareResult function beforeAgentRun( required struct context ) {
emit( "Agent run starting | input: #left( toString( context.input ), 120 )#" )
return AiMiddlewareResult.continue()
}
AiMiddlewareResult function afterToolCall( required struct context ) {
var toolName = context.tool?.getName() ?: "unknown"
emit( "Tool call complete | tool: #toolName# | result: #left( toString( context.result ), 120 )#" )
return AiMiddlewareResult.continue()
}
AiMiddlewareResult function onError( required struct context ) {
emit( "Error in phase '#context.phase#': #context.error?.message#", "error" )
return AiMiddlewareResult.continue() // don't stop the chain on logging errors
}
Options:
| Option | Default | Description |
|---|---|---|
logToFile | true | Write to BoxLang ai log |
logToConsole | false | Also print to stdout |
logLevel | "info" | info, debug, warning, error |
prefix | "[AI Middleware]" | Prepended to every message |
π RetryMiddleware β Resilience Without Boilerplate
LLM providers have rate limits. Networks have transient failures. RetryMiddleware wraps both LLM calls and tool calls with exponential backoff β transparently, without any code in your tools or agents:
agent = aiAgent(
name : "analyst",
middleware : new RetryMiddleware(
maxRetries : 5,
initialDelay : 2000,
backoffMultiplier : 1.5,
maxDelay : 30000
)
)
It uses wrapLLMCall and wrapToolCall hooks β the outer wrap catches exceptions, sleeps, and retries up to maxRetries times. Non-retryable exceptions (like InvalidInput or MaxInteractionsExceeded) surface immediately:
| Option | Default | Description |
|---|---|---|
maxRetries | 3 | Attempts after first failure |
initialDelay | 1000 | First retry delay in ms |
backoffMultiplier | 2 | Multiplier applied per failure |
maxDelay | 30000 | Hard cap on delay |
nonRetryableTypes | "InvalidInput,MaxInteractionsExceeded" | Exception types to skip |
π‘οΈ GuardrailMiddleware β Defense in Depth
Block dangerous tools entirely, or reject tool calls whose arguments match regex patterns β before they ever reach the tool:
guardrail = new GuardrailMiddleware(
blockedTools : [ "deleteRecord", "dropTable", "truncateAll" ],
argPatterns : {
runSql : [ { query: "(?i)drop|truncate|delete" } ],
sendMail: [ { to: "@competitor\\.com$" } ]
}
)
agent = aiAgent( name: "db-assistant", middleware: guardrail )
The hook fires in beforeToolCall β it checks the tool name against blockedTools first, then validates each argument against the configured regex patterns:
// From GuardrailMiddleware.bx
AiMiddlewareResult function beforeToolCall( required struct context ) {
var toolName = context.toolName ?: (context.tool?.getName() ?: "")
// 1. Blocked tool list check
if ( variables.blockedTools.findNoCase( toolName ) > 0 ) {
return AiMiddlewareResult.reject(
"GuardrailMiddleware: tool '#toolName#' is in the blocked tools list."
)
}
// 2. Argument pattern checks
if ( variables.argPatterns.keyExists( toolName ) ) {
// ... check each rule against the resolved tool arguments
}
return AiMiddlewareResult.continue()
}
| Option | Default | Description |
|---|---|---|
blockedTools | [] | Tool names always rejected (case-insensitive) |
argPatterns | {} | { toolName: [{ paramName: "regex" }] } |
π HumanInTheLoopMiddleware β Keeping Humans in Control
This middleware intercepts specific tool calls and requires a human to approve, reject, or edit before execution proceeds. Two modes, two very different use cases.
CLI mode β blocks on stdin. Perfect for local scripts, automation tools, and development workflows:
agent = aiAgent(
name : "finance-bot",
middleware : new HumanInTheLoopMiddleware(
toolsRequiringApproval : [ "transferFunds", "placeOrder" ],
showArguments : true
)
)
When the LLM calls transferFunds, the terminal shows:
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HUMAN APPROVAL REQUIRED β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tool: transferFunds
Args: {"amount": 5000, "account": "12345"}
[A]pprove [R]eject [Q]uit
Decision:
Web mode β suspends the run and returns an AiMiddlewareResult.suspend(). The calling code checkpoints state and presents the approval request asynchronously β via email, Slack, a web UI, whatever fits your workflow:
agent = aiAgent(
name : "finance-bot",
middleware : new HumanInTheLoopMiddleware(
mode : "web",
toolsRequiringApproval: [ "placeOrder" ]
),
checkpointer: aiMemory( "cache" )
)
// First request β the LLM wants to place an order
result = agent.run( "Order 50 units of product SKU-789", {}, { userId: "alice" } )
if ( result.isSuspended() ) {
// Send approval request to alice's manager via Slack, email, etc.
notifyManager( result.getData() )
// Store threadId for resume
session.pendingApproval = result.getData().threadId
}
// After the manager approves (in a separate request/thread)
agent.resume( "approve", session.pendingApproval )
// Or if they edit the quantity
agent.resume( "edit", session.pendingApproval, { correctedArgs: { quantity: 10 } } )
The resume path in HumanInTheLoopMiddleware reads the _resumeContext injected by AiAgent.resume(), honours the decision, and either continues, rejects, or patches the tool arguments β then clears the context so subsequent tool calls in the same run go through normal HITL flow again.
| Option | Default | Description |
|---|---|---|
toolsRequiringApproval | [] | Tools needing sign-off |
mode | "cli" | "cli" or "web" |
showArguments | true | Show args in CLI prompt |
approvalCallback | β | Custom approval function |
ποΈ FlightRecorderMiddleware β AI Testing Solved
This is the one that changes how you think about testing AI agents.
The problem: agent behaviour is non-deterministic. The LLM might phrase something differently each run. The tool call order might vary. Writing assertions against agent output directly is fragile. And running tests against live providers is slow, expensive, and requires network access in CI.
FlightRecorderMiddleware solves this with a record/replay approach. Record a real run once β capturing every LLM round-trip and tool invocation to a JSON fixture file. Then replay that fixture in CI without any live calls.
Three modes:
// RECORD β calls real providers and tools, saves every interaction
agent = aiAgent(
name : "weather-bot",
middleware : new FlightRecorderMiddleware( mode: "record" )
)
agent.run( "What's the weather in London and should I bring an umbrella?" )
// β Writes: .ai/flight-recorder/weather-bot-20260402-143022.json
// REPLAY β zero live calls, fully deterministic
agent = aiAgent(
name : "weather-bot",
middleware : new FlightRecorderMiddleware(
mode : "replay",
fixturePath : "tests/fixtures/weather-bot.json"
)
)
agent.run( "What's the weather in London and should I bring an umbrella?" )
// β Returns the exact same response as the recorded run
// PASSTHROUGH (default) β no recording, calls pass through normally
agent = aiAgent(
name : "weather-bot",
middleware : new FlightRecorderMiddleware() // mode: "passthrough"
)
The fixture format β human-readable JSON that you can inspect, edit, and commit to version control:
{
"version": "1",
"recordedAt": "2026-04-02T14:30: 22",
"agentName": "weather-bot",
"interactions": [
{
"seq": 1,
"type": "llm",
"request": { "model": "gpt-4o", "messages": [...], "tools": [...] },
"response": { "choices": [{ "message": { "tool_calls": [...] } }] }
},
{
"seq": 2,
"type": "tool",
"toolName": "getWeather",
"arguments": { "city": "London" },
"result": "15Β°C, overcast, 80% chance of rain"
},
{
"seq": 3,
"type": "llm",
"request": { ... },
"response": { "choices": [{ "message": { "content": "Yes, bring an umbrella..." } }] }
}
]
}
One implementation detail worth noting: the recorder flushes to disk after every interaction, not just at the end. This means if your agent crashes mid-run, the partial recording is preserved and can be inspected:
// From FlightRecorderMiddleware.bx
private void function _appendInteraction( required struct interaction ) {
var seq = variables._tape.interactions.len() + 1
arguments.interaction.seq = seq
variables._tape.interactions.append( arguments.interaction )
_saveSnapshot() // flush after every interaction β crash-safe
}
Strict vs lenient replay:
// Strict (default): throw on type mismatch β "expecting llm but tape has tool"
new FlightRecorderMiddleware( mode: "replay", strict: true )
// Lenient: skip forward to find next matching interaction type
new FlightRecorderMiddleware( mode: "replay", strict: false )
| Option | Default | Description |
|---|---|---|
mode | "passthrough" | "passthrough", "record", or "replay" |
fixturePath | "" | Path to fixture file |
fixtureDir | ".ai/flight-recorder" | Auto-generated fixture directory |
recordTools | true | Whether to capture tool interactions |
strict | true | Throw on type mismatch in replay |
π’ MaxToolCallsMiddleware β Runaway Agent Prevention
Simple but essential in production β caps the total number of tool invocations per agent run:
agent = aiAgent(
name : "research-bot",
middleware : new MaxToolCallsMiddleware( maxCalls: 10 )
)
The counter resets at the start of each new run() call. If the cap is hit mid-run, the chain is cancelled with a clear error message. Essential for preventing infinite tool call loops in complex multi-step reasoning tasks.
βοΈ Writing Your Own Middleware
Two approaches, depending on how much structure you want.
Struct of closures β lightweight, no class needed:
agent.withMiddleware( {
beforeToolCall: ( ctx ) => {
if ( ctx.tool?.getName() == "dangerousTool" ) {
return AiMiddlewareResult.cancel( "This tool is not allowed." )
}
return AiMiddlewareResult.continue()
},
wrapLLMCall: ( ctx, handler ) => {
var start = getTickCount()
var result = handler()
metricsService.record( "llm.latency", getTickCount() - start )
return result
},
onError: ( ctx ) => {
alertService.notify( "Agent error in #ctx.phase#: #ctx.error.message#" )
return AiMiddlewareResult.continue()
}
} )
Structs are automatically wrapped in StructMiddlewareAdapter β you only define the hooks you need.
Class-based β reusable, configurable, independently testable:
import bxModules.bxai.models.middleware.BaseAiMiddleware;
import bxModules.bxai.models.middleware.AiMiddlewareResult;
class extends="BaseAiMiddleware" {
property name="tenantId" type="string";
function init( required string tenantId ) {
variables.tenantId = arguments.tenantId
variables.name = "Tenant Audit Middleware"
variables.description = "Logs all AI tool calls to the tenant audit trail"
return this
}
AiMiddlewareResult function beforeToolCall( required struct context ) {
auditLog.record(
tenantId : variables.tenantId,
tool : context.tool?.getName() ?: "unknown",
args : context.toolCall?.function?.arguments ?: "{}"
)
return AiMiddlewareResult.continue()
}
}
π Composing Middleware
Middleware stacks compose cleanly β just pass an array:
agent = aiAgent(
name : "production-agent",
middleware : [
new LoggingMiddleware( logToConsole: false ),
new RetryMiddleware( maxRetries: 3 ),
new GuardrailMiddleware( blockedTools: [ "deleteRecord" ] ),
new MaxToolCallsMiddleware( maxCalls: 15 ),
new HumanInTheLoopMiddleware( toolsRequiringApproval: [ "placeOrder" ] )
]
)
Or fluently, one at a time:
agent
.withMiddleware( new LoggingMiddleware() )
.withMiddleware( new RetryMiddleware( maxRetries: 3 ) )
.withMiddleware( new GuardrailMiddleware( blockedTools: [ "deleteRecord" ] ) )
In production, logging + retry + guardrails is the baseline stack. Add MaxToolCallsMiddleware for complex reasoning agents. Add HumanInTheLoopMiddleware for any agent touching money, data, or external systems. Use FlightRecorderMiddleware in record mode during QA and replay mode in CI.
What's Next
In Part 5, we close the series with a deep dive into BoxLang AI's provider architecture β how the capability system works, how BaseService and OpenAIService are structured, how to add custom providers, and a tour of the full 17-provider ecosystem.
π Full Documentation
π¦Install Today: install-bx-module bx-ai
π«ΆProfessional Support
Add Your Comment