Arxa/Intelligence
AI

How AI Is Transforming Corporate Treasury in 2026

Forecast accuracy benchmarks, anomaly detection that actually catches fraud, agentic copilots vs. hype, and what the auditors will demand from your model.

AIArxa Intelligence teamApril 7, 202613 min readAI · Anomaly detection · Agentic · Audit

Two years ago, AI in corporate treasury was mostly a slide in a vendor deck. In 2026, it is a working layer in the cash-management stack of most mid-market and enterprise treasuries we talk to. But the gap between what works in production and what the marketing team is promising remains uncomfortably wide. This article is an attempt to close that gap with numbers: real MAPE bands, real precision and recall figures from anomaly detection, and an honest assessment of what agentic copilots actually do today versus what they will do in 2027.

Where we actually are in 2026

The honest summary: AI has moved from pilot to production in forecasting, anomaly detection, and natural-language reporting. It has not moved into autonomous payment execution, despite what some vendor pitches suggest. The treasuries that are getting measurable value from AI in 2026 share three traits: they invested in their data layer first, they kept a human in the loop on every money-moving decision, and they treat their models as auditable artifacts rather than black boxes.

The treasuries that are not getting value tend to fall into one of two camps. The first bought a "treasury AI" SKU as an add-on to their TMS and never connected it to clean data. The second tried to leapfrog directly to agentic execution without first earning trust on forecasting and anomaly detection. Both end up with a dashboard nobody looks at and a renewal conversation nobody enjoys.

The rest of this article walks through each of the major AI surfaces in a 2026 treasury stack: predictive forecasting, anomaly detection in payment flows, natural language queries, agentic copilots, and the audit and compliance posture that ties them all together. We close with a sequencing playbook for treasurers planning a 12-month rollout.

Forecasting accuracy: what is actually achievable

The single most cited AI use case in treasury is cash forecasting. It is also the one with the most inflated claims. We see decks promising "95%+ accuracy" with no definition of horizon, no definition of error metric, and no baseline. The honest framing starts with a metric and a horizon.

MAPE is not a perfect metric (it punishes under-forecasts more than over-forecasts and breaks near zero), but it is the lingua franca of forecasting accuracy and it is what your CFO will ask about. The following table shows median MAPE bands we observe across a population of 140+ Arxa Intelligence customers in 2025-2026, segmented by horizon and by the maturity of the forecasting approach.

HorizonSpreadsheet baselineClassical ML (XGBoost, Prophet)Modern ML (gradient boosting + transformer ensemble)
1 week18-25%5-9%4-7%
4 weeks28-35%9-14%6-10%
8 weeks35-42%13-19%10-16%
12 weeks40-50%18-26%15-22%
26 weeks50%+ (effectively unusable)28-38%24-32%
Median cash-forecast MAPE by horizon and approach (Arxa Intelligence customer base, 2025-2026, n=140+)

A few honest observations on these numbers. First, the gap between classical and modern ML at the 1-4 week horizon is small. If your team is good with XGBoost and has a clean feature pipeline, you can get within two percentage points of a state-of-the-art ensemble. The bigger lift is going from spreadsheet to any ML approach. That move alone typically halves the error.

Second, accuracy degrades non-linearly with horizon. The marginal information available at week 12 is much weaker than at week 1, and no amount of model sophistication recovers it. We have seen vendors quote sub-10% MAPE at 12 weeks; in every case we have audited, this was either cherry-picked, computed on aggregate (group-level) cash rather than entity-level, or measured against a forecast that was revised weekly (which is not a 12-week forecast).

Third, the variance within each band is wide. A retailer with stable weekly receipts and predictable supplier payments will land at the bottom of the modern-ML band (4-5% MAPE at 1 week). A SaaS company with lumpy ARR collections, multi-currency exposures, and quarterly commission true-ups will sit at the top of the band or above. Model choice matters less than data quality and book complexity.

Why ML beats spreadsheets, mechanically

The spreadsheet baseline is not bad because spreadsheets are bad. It is bad because spreadsheet forecasts in practice are built from rolling averages, last-year-same-week heuristics, and sales-team commits. ML approaches add three things: they ingest hundreds of features per entity (DSO trends, supplier payment terms by vendor segment, FX rate paths, calendar effects, payroll cycles), they fit non-linear relationships between those features and cash flow, and they are retrained on a cadence that matches the velocity of your business. None of those things are individually magical. Together they compound.

Anomaly detection in payment flows: the false-positive tax

Anomaly detection on payment flows is, in our experience, the highest-ROI AI use case in treasury in 2026. It is also the one where precision matters more than recall, and where a tone-deaf deployment can destroy operational trust within a quarter.

The canonical anomaly types we catch fall into three buckets:

  • Never-seen IBAN. A payment instruction to a beneficiary IBAN that has not appeared in the last 24 months for this entity, vendor, or payment category. This is the single highest-signal alert and the most common vector for invoice-redirection fraud.
  • Duplicate payment. A near-identical instruction (same amount, same beneficiary, same reference within tolerance) to a previously executed payment within a configurable window. These are usually not fraud but cost real money in recovery effort.
  • Amount spike. A payment whose amount is more than k standard deviations above the rolling distribution for this beneficiary or category, controlling for seasonality. The typical k is 3-4; lower thresholds drown the team in noise.

The right metrics to evaluate an anomaly detection system are precision (of the alerts we raise, how many were genuinely actionable) and recall (of the genuine anomalies, how many did we catch). The trade-off between them is the entire game.

Anomaly typePrecisionRecallOperational note
Never-seen IBAN0.78-0.880.94-0.98High recall is critical; this is the fraud-prevention case. The 12-22% false-positive rate is the cost of doing business.
Duplicate payment0.92-0.970.85-0.92Precision is high because the signal is strong (same amount, same beneficiary). Recall is bounded by reference-field hygiene.
Amount spike (>3 sigma)0.55-0.700.80-0.90The hardest class. Genuine amount spikes happen at quarter-end, payroll true-ups, tax settlements. Heavy false-positive tax.
Off-hours payment0.40-0.600.95+Mostly noise in multinationals with 24/7 operations. Useful for SMB-only deployments.
Anomaly detection performance by anomaly type (Arxa Intelligence customer base, 2025-2026)

The number that matters most operationally is not precision in isolation. It is the false-positive rate per analyst per day. A treasury team of three people that gets 40 alerts a day, of which 6 are genuine, will burn out on the workflow within six weeks. We target a maximum of 8-12 alerts per analyst per day with a minimum precision of 0.7. That is the operational envelope inside which a payment-anomaly system survives its first quarterly review.

Natural language queries: what works, what doesn't

Natural language interfaces to treasury data are now table stakes in any modern TMS. They are also the place where the gap between demo and reality is widest. Here is the honest split.

What works in 2026

Retrieval-style queries against structured data work well. Examples:

  • Show me cash by entity as of yesterday
  • List all payments above EUR 500k executed last week, grouped by counterparty
  • What is our USD net exposure across all entities right now?
  • Compare actual vs forecasted cash for entity X over the last 8 weeks

These work because they map cleanly to a SQL query against a well-modeled warehouse. A modern LLM with a schema-aware text-to-SQL layer (with a retrieval step over the schema, not a zero-shot prompt) hits roughly 88-94% query accuracy on this class. The remaining 6-12% are usually ambiguous date scoping or misunderstood aggregation. Both fail loudly enough to catch.

What still doesn't work

Counterfactual and planning queries are still painful. Examples:

  • Forecast revenue if I cut headcount by 10%
  • What would our cash position look like if we delayed supplier payments by 15 days?
  • Should we draw on the revolver next week?

These fail not because the LLM cannot parse the question. They fail because the underlying causal model is missing. A headcount cut does not deterministically reduce revenue; it depends on which roles, what attrition, what customer-facing impact. A supplier payment delay affects working capital, supplier relationships, and DPO ratchets in ways that are not encoded in any structured table. The honest answer from an AI copilot in 2026 is: here is what the cash-flow line item would look like if we held everything else equal, and here are three risks that "everything else equal" misses.

The natural language layer is genuinely useful for getting numbers out of the TMS in seconds instead of minutes. But the moment my CFO asks a what-if question, I am back in the planning tool with my FP&A team. That handoff has not gotten shorter.

Group Treasurer, listed European industrial group, 2026

Agentic copilots: today's reality vs. the marketing

The term agentic has been doing more marketing work than engineering work for two years. In treasury specifically, the gap between "agentic copilot" pitches and what is safely deployed in production is enormous. Here is the honest split, by capability.

CapabilityMarketing pitchWhat actually ships in production
Cash forecastingAutonomous, self-tuning forecastsModels retrain on a schedule, treasurer reviews and overrides assumptions weekly
Payment executionAI executes routine payments end-to-endAI proposes payment runs, human approves every batch above a low threshold
FX hedgingAutonomous hedge ratio adjustmentAI suggests hedge adjustments based on policy bands, treasurer signs every trade
Anomaly responseAI investigates and resolves alertsAI clusters and prioritizes alerts, human investigates and dispositions
Bank reconciliationFully autonomous match and post85-92% auto-match, human queue for the long tail (this one is genuinely close to autonomous)
Counterparty riskAutonomous credit-limit adjustmentAI flags deteriorating counterparties, credit committee makes the call
Agentic capabilities in treasury: marketing vs. production reality (2026)

The pattern is clear. Agentic AI in treasury today proposes; humans approve. The exceptions are bank reconciliation (genuinely close to autonomous in 2026 because the cost of an error is bounded and reversible) and intra-day liquidity sweeps within pre-approved policy bands (autonomous because the policy bands are the approval).

That does not mean agentic AI is hype. It means the value is in compression of the human workflow, not removal of the human. A good 2026 treasury copilot collapses a 90-minute payment-run review into a 15-minute review by surfacing the right exceptions, drafting the right journal entries, and pre-populating the right approval paths. That is a real productivity gain. It is just not the science-fiction version.

Audit and compliance: what regulators want in 2026

The compliance posture around AI in financial operations has tightened materially since the EU AI Act came into force and since national regulators (the AMF, the DGCCRF on the consumer protection side, the ECB on the systemic-bank side, and equivalent bodies elsewhere) published guidance on AI use in financial decision-making. The artifacts auditors and regulators consistently ask for in 2026 are the following.

  1. Model card. A standardized description of each model in production: training data scope, feature list, retraining cadence, intended use, known limitations, and the human role in the decision loop. The model card is now the entry point for any AI-related audit conversation.
  2. Decision log. An immutable record of every AI-influenced decision: input features, model version, output, the human who approved or overrode, and the timestamp. This is the single most important artifact and the one most often missing.
  3. Human-in-the-loop boundary. A documented description of which decisions are AI-proposed-only, which are AI-executed within bands, and which are AI-suggested-human-approved. Auditors want this written down and tested.
  4. Backtest and drift evidence. Periodic evidence that the model is performing within expected accuracy bands on out-of-sample data, plus alerting on drift. Quarterly is the typical cadence; monthly is increasingly the expectation.
  5. Vendor and outsourcing documentation. Where the AI is provided by a third party (which is most cases), evidence of the vendor's own model governance, data residency, and incident response. The ECB's outsourcing-of-AI guidance specifically calls this out.

On the regulator side specifically: the AMF and equivalent national market authorities have been clear that material AI-driven decisions in cash and risk management require demonstrable human oversight, explainability proportionate to materiality, and incident reporting that includes AI failure modes. The DGCCRF in France focuses more on consumer-facing AI but has begun publishing guidance that B2B treasuries are reading carefully because the principles transfer. The ECB's stance on outsourced AI applies directly to any treasury that consumes AI from a SaaS vendor, which is essentially all of them.

Practical rollout: how to actually sequence AI in treasury

The single most common rollout mistake is starting with the most exciting use case (agentic copilot, autonomous hedging) instead of the most boring one (clean data, then forecasting, then anomaly detection). The boring sequence is the one that compounds. Here is the 12-month playbook we recommend to treasurers planning their first serious AI deployment.

Months 1-3: data foundation

Connect every bank, every entity, every payment rail to a single warehouse. Standardize on ISO 20022 where available. Reconcile your chart of accounts. This is unglamorous and it is the entire game. Without it, no AI use case downstream will produce trustworthy numbers. Plan for two thirds of your year-one effort to land here.

Months 4-6: forecasting

Deploy a weekly cash forecast at the entity level. Measure MAPE weekly, publish it to the treasury team, and benchmark against the spreadsheet baseline you replaced. Aim for 8-12% MAPE at the 4-week horizon by month six. If you are above 15%, the bottleneck is data quality, not model choice.

Months 7-9: anomaly detection

Turn on never-seen-IBAN and duplicate-payment detection first. Calibrate thresholds for one full month before turning on amount-spike detection. Track alerts-per-analyst-per-day as your operational health metric. Tune for precision aggressively in the first quarter of operation; recall improvements come naturally as the team's feedback loop tightens the model.

Months 10-12: copilot and audit hardening

Deploy a natural-language query layer for the retrieval use cases. Do not connect it to write-paths in payment systems in year one. In parallel, stand up the audit artifacts (model card, decision log, human-in-the-loop documentation) before your first audit cycle of the year. Year two is when you start expanding agentic surfaces within tight policy bands; year one is when you earn the right to do so.

Closing thought

The honest version of the AI-in-treasury story in 2026 is less exciting than the marketing version, and that is a good thing. The gains are real, measurable, and compounding: better forecasts, fewer fraud losses, faster reporting, tighter audits. The losses come from chasing the autonomous-execution mirage and skipping the data-foundation work. The treasurers who will look smart in 2027 are the ones spending 2026 on the boring sequence and writing down their decision logs.

See it run on your numbers.

Connect a single bank in 4 minutes. Get your first AI-prepared cash brief tomorrow morning.

Start your free trial

14 days · No credit card required · Cancel anytime

AI
Arxa Intelligence team
Treasury insights from operators who've been there

Written by the Arxa Intelligence team — finance leaders, engineers, and treasury operators sharing what we've learned in the field. We don't ghostwrite under fake names; if you want to talk to whoever wrote a piece, email us at hello@arxaintelligence.com.