Most AI tools entering audit promise 80%+ reductions in testing time. Pilots are widespread, but scaled adoption has not arrived. We believe one reason lies in a technical problem that most vendors are not discussing openly: the underlying models are probabilistic, and audit has almost no tolerance for that.
This executive brief presents what we learned from 700+ control test executions about why AI accuracy degrades in audit workflows and what it takes to produce workpapers that can withstand professional scrutiny.
A 75% tumor reduction is a medical breakthrough, but a 75% accuracy rate in an audit working paper is a serious professional risk. Domains where AI succeeds can either tolerate non-determinism or belong to frontier applications of AI.
The large language models powering these tools can produce different outputs from the same input. Despite major breakthroughs over the past year, the underlying technology remains probabilistic. In fields such as software development, organizations can often absorb that variability. Adoption among developers has accelerated despite evidence that AI-generated code produces 1.7x more bugs than human code and despite 61% of developers saying AI output “looks correct but isn’t reliable.”
A key governance function like internal audit has almost no tolerance for these errors. In our evaluation of available tools, we found limited transparency on how products account for non-determinism to ensure accuracy and consistency of outputs.
Other critical issues include a tendency to accept the auditor’s framing and to surface only what the model deems important. These systems do not reliably signal uncertainty and may present incorrect conclusions with unwarranted confidence.
In our testing across more than 700 control test executions with and without AI, workpapers prepared by a junior auditor with visible errors in testing, reasoning, and documentation were often easier to review and correct. AI-generated workpapers, by contrast, tended to present confident tone, structured prose, and plausible reasoning even when the conclusions were wrong. Auditors therefore spent time not only evaluating the control, but also disproving incorrect AI-generated analysis.
The layers of technology built around these models, including retrieval, tool-use frameworks, and memory systems, can help, but each one also introduces a new failure point. Retrieval may improve general analysis, yet in our testing both retrieval and tool execution were themselves inconsistent across runs, which weakened the technical quality of the resulting workpapers.
A 2025 study in the Journal of Business Research suggested that structured guardrails can reduce non-deterministic responses. In our testing, however, overly rigid guardrails also constrained reasoning and weakened technical quality.
For AI to work in internal audit, systems must be designed to manage probabilistic behavior and produce audit-grade workpapers and reports that can withstand professional scrutiny.
These technical challenges have a direct human cost. When AI produces errors in audit working papers, the consequences extend far beyond the individual mistake.
In one of our experiments, we tested Bank reconciliation control 44 times. One of those tests in Phase 2, the AI generated a complete working paper with detailed evidence references and structured conclusions. However, a detailed review revealed some inconsistencies in testing.
We interviewed the auditor leading the experimental test to understand the first reaction, which was straightforward: “If this error is here, what else is wrong?” There was no practical way to satisfy that professional skepticism without returning to the source documents and starting over.
The resulting rework largely erased the productivity gain and materially degraded the auditor experience.
Other industries are seeing similar patterns, but the consequences are more severe in audit because tolerance for error is lower. For a control that would typically require 20 hours of manual testing, an AI-generated workpaper with errors consumed 14 hours of rework, or 70% of the original manual effort.
Reviewing AI output creates overhead that does not exist when you draft the work yourself. Each test-step conclusion propagates into the executive summary, overall conclusion, exception details, and cross-references. Our test auditors reported performing rigorous line-by-line reviews to ensure no contradictory language or residual AI-generated phrasing remained.
For larger sample sizes or more complex controls such as lease accounting and revenue recognition, rework time exceeded the total manual effort by even wider margins.
We ran 64 identical tests of a bank reconciliation control (9 test steps per run) against the same evidence documents over four months. The key evaluation criteria were whether the AI correctly followed the test steps, performed accurate calculations, and documented the work in adequate detail. We also evaluated the AI’s ability to produce audit-grade workpapers that could withstand auditor scrutiny.
Phase 1 — Direct Prompting (Runs 1–9): 33% accuracy. We passed control evidence and test criteria to the LLM in a single prompt with no structure. The results were inconsistent and the AI generally failed to accurately test more than 3 steps. Both Pass and Fail results seemed accidental but there was no way to distinguish between the two.
Phase 2 — Structured Prompting (Runs 10–26): 33–67% accuracy, volatile. We broke test steps into individual structured prompts with explicit pass/fail criteria. Accuracy improved but the inconsistency remained. The same test procedure could produce different results on consecutive runs.
Phase 3 — Optimized Prompting with Context (Runs 27–38): 56–67% accuracy, somewhat more stable. We refined prompts with audit-specific context, few-shot examples, and optimized reasoning. Accuracy improved, but the underlying flakiness remained and the AI still tended to manufacture facts to support its reasoning. We observed diminishing returns from prompt engineering, additional rules, tighter guardrails, multi-step structures, and review layers.
Across Phases 1–3, we systematically mapped the variables that drive audit testing outcomes — evidence quality, test step structure, reasoning complexity, documentation requirements, and the interaction between these factors. Continuous learnings from each phase helped us determine where AI was struggling and what was required for audit-grade results. That enabled us to move towards Phase 4 with a different approach.
Phase 4 — Agentic Architecture with Guardrails (Runs 39–64): Correct classification across all 26 consecutive runs on the controlled bank reconciliation benchmark. We stopped trying to fight the non-determinism and focused on creating a highly structured and sophisticated workflow which works with non-determinism, not in spite of it. The AI correctly provided opinions — passed, failed, and even inconclusive at every step according to the evidence. This architecture was subsequently validated across 64 control test instances spanning 80+ control types, with 99% step-level classification accuracy scored by a qualified auditor against predefined pass/fail criteria.
Based on our testing and client conversations, we identified several conditions that can help the transition from pilots to production.
Existing control frameworks were written for human auditors. They often include non-binary assertions and nested, compound instructions that humans can follow more easily than machines.
Test steps written for human auditors often require substantial reworking for AI. Vague instructions like “verify appropriateness” produced unreliable results across runs, whereas precise, well-scoped steps outperformed human auditors.
• Identify the control candidates for AI testing, and then restructure the test procedures for absolute
clarity.
• Early candidates for AI testing should be document-heavy controls with clear pass/fail criteria and
structured evidence.
• Breaking down complex controls to smaller steps that can each be checked independently reduces the rework
rate significantly.
AI needs the full data set and rule set required to test controls, including not only obvious inputs such as corporate policy, but also the broader guidance human auditors bring implicitly. If people need certain information to test a control well, the AI needs it too.
Missing context about the control environment significantly degraded the accuracy. Any gap in providing the requisites forced the model to infer, and the results stopped matching what was actually in the evidence.
• Review controls and evaluate your evidence packages to ensure full context is provided to AI for
testing.
• AI handles unstructured and messy data exceptionally well, however in auditing tasks clean, structured data
achieves noticeably higher accuracy.
Vendor demos typically perform well on a curated set of controls. It is worth verifying that the demo reflects the actual production tool rather than a pre-built walkthrough.
• Black-box testing is a disqualifier. If the tool produces a finished working paper without
showing how each conclusion was reached, the auditor can only accept it or redo it.
• Look for tools that have highly optimized prompts scoped to specific control types. The tools need built-in
tolerance so AI can manage inconsistencies in real-world scenarios. Ask vendors to demonstrate one control of their
choice, and another on demand based on your team’s priorities.
• The tool should not require significant investments in testing infrastructure prior to adoption. It needs to
have the flexibility to adapt and provide baseline outputs for your organisation.
• Ask whether the tool provides only binary Pass/Fail (Effective/Ineffective) or can also provide inconclusive
or not applicable. What does the tool do when it encounters evidence it cannot interpret? Does it flag uncertainty,
or produce a confident wrong answer?
Lead the conversation before someone else does. CAEs have to set realistic expectations for what AI can deliver in audit today and define the methodology on their own terms rather than be pulled by market hype.
Accountability does not move. Auditors remain accountable for the conclusions in their workpapers regardless of how those conclusions were generated. In a function with such low tolerance for error, tools have to acknowledge that risk explicitly by surfacing uncertainty, making reasoning transparent, and supporting reviewer judgment rather than obscuring it.
Consider external auditor reliance. Where external auditors plan to rely on internal audit work under PCAOB AS 2605, AI-assisted workpapers must meet the same evidential standards as manually prepared ones. If AI-generated outputs cannot demonstrate a clear chain of reasoning from evidence to conclusion, external audit teams may decline reliance, negating the efficiency gain and potentially increasing overall audit cost.
Evolve your controls framework for AI execution. Your current test procedures are designed around how a human auditor works. Beyond adapting control definitions and test steps, AI opens the door to rethinking testing methodology entirely. Sample-based testing may eventually expand toward population-level testing, but token limits and human review capacity remain real constraints.
Audits are not all about testing. A strong tool must do more than generate testing outputs or workpapers. It should enable faster evidence transformation, higher-quality testing, and more rigorous review by your team.
Ask whether tool developers have accelerated their own workflows. If a tool or team cannot demonstrate accuracy on a defined set of controls within a few weeks, the issue usually lies in the tool, the approach, or the fit. Identify which one it is, adjust, and measure again.
This paper draws on 12 months of research spanning ITGC, ICFR, ESG, procurement, IFRS, revenue recognition, leasing, and other domains.
7,400+ unique pipeline executions across development, validation, and control testing
3,000+ control-oriented executions across 80+ distinct control types
700+ control test executions across 80+ distinct control types that produced reviewable workpaper outputs
The findings in this brief are grounded primarily in the control-testing subset, not the full engineering corpus.
SoxAudit.ai was founded by a dual-credentialed CPA/CA with 15+ years in internal audit and risk management, including leadership roles at Big 4 and other consulting firms supporting FAANG and Fortune 500 companies across SOX, ITGC, ESG, and financial process audits. The platform is built by practitioners who have led audit functions, transformed audit operations, and presented to audit committees.
These findings inform the design of SoxAudit.ai.
If you are evaluating AI for your audit function, or if your current pilot is producing mixed results, we can walk you through the evaluation framework and scoring criteria we use. We welcome the conversation.
See it live. We run a control test on demand — your control or ours — so you can evaluate the output before committing to a conversation.
SoxAudit.ai
For feedback or enquiries - try@soxaudit.ai
Schedule a live test: https://calendly.com/soxaudit/
The table below summarizes the 700+ control test executions based on the Phase 4 architecture referenced in this brief. Each execution represents a full pipeline run — from evidence ingestion through workpaper generation — on a named audit control.
| Domain | Control Types | Total Tests | Representative Controls |
|---|---|---|---|
| Banking / ICFR | 3 | 297 | Bank reconciliation, wire transfer authorization, check processing verification |
| Inventory | 2 | 238 | Inventory reception, inventory reconciliation |
| Leasing / IFRS | 1 | 95 | Lease revenue recognition and compliance |
| SOX Controls | 14 | 28 | Revenue cutoff, AP three-way match, payroll processing, journal entry approval, duplicate payment detection |
| ITGC / Access Controls | 3 | 48 | User access review (privileged accounts), change management, segregation of duties |
| Financial Close | 1 | 21 | Period-end close process |
| Regulatory | 1 | 12 | RERA marketing compliance |
| Other | 10+ | 22+ | Balance sheet reconciliation, SOC 2, Excel-based controls |
| Total | 35+ | 761 |
Testing conducted between July 2025 and March 2026 across 3 different setups. The broader research corpus includes 3,000+ control-oriented executions and 7,400+ total pipeline executions encompassing individual testing of tasks for an audit function like planning, testing, reporting etc.