Why AI Copilots Can’t Reliably Perform Audit Sample Testing

Dhashyanth Nedumaran
Apr 7
13 min read

Executive Summary

Audit sampling is the process of testing a subset of financial items to form conclusions about the whole (as defined in ISA 530). It is critical because auditors must obtain “sufficient appropriate audit evidence” efficiently.
AI copilots (e.g. Microsoft 365 Copilot) are conversational AI assistants used to draft documents, analyze data, and summarize information. They are designed to boost productivity, not to replace professional judgment.
Key limitations: AI copilots lack the human judgment and professional skepticism auditors need. They may hallucinate false information, cannot ensure data integrity, and produce outputs that do not meet audit documentation standards. They also do not inherently comply with ISA/PCAOB evidence and confidentiality rules.
Impacts of misuse: Relying on AI for sampling risks missing material misstatements, breaching regulations (ISA, PCAOB, SOX), and damaging credibility. High-profile incidents (e.g. Deloitte’s 2025 AI report error, lawyers sanctioned for AI-generated fake citations) show real-world consequences of unchecked AI outputs.
Recommendations: Auditors must retain human-in-the-loop control, validate all AI outputs, use AI only for supportive tasks, protect client data, document all procedures, and adhere to standards. Ultimately, human judgment remains irreplaceable in auditing, as regulators and experts emphasize.

What Is Audit Sample Testing?

Audit sample testing refers to the practice of examining less than 100% of transactions in a population to draw conclusions about the whole. For example, ISA 530 defines audit sampling as “applying audit procedures to less than 100% of items within an account balance or class of transactions”. This allows auditors to infer the accuracy of the full set without inspecting every item. Audit sampling is critical because audits must still meet the requirement of ISA 500 that the auditor obtain “sufficient appropriate audit evidence”. In other words, auditors design sample tests to efficiently gather evidence about key balances or controls. Without sampling, checking every transaction would be prohibitively expensive or impossible for large data sets. Auditing experts note that “it is not feasible to audit and check every single item”, so sampling “enables auditors to make conclusions and express fair opinions… without having to check all of the items”. In summary, audit sampling is how auditors use a manageable workload to obtain the evidence needed for a reliable audit opinion.

What Are AI Copilots and How Are They Used?

AI copilots are generative-AI assistants integrated into software tools. For example, Microsoft 365 Copilot is an AI assistant built into Word, Excel, Teams, and Outlook. Microsoft describes a copilot as “a conversational, AI-powered assistant that helps boost productivity and streamline workflows by offering contextual assistance, automating routine tasks, and analyzing data.”. In practice, business users employ Copilot to draft text (e.g. emails, reports), to analyze or visualize data in Excel, and to summarize meetings or documents. A Microsoft scenario guide explains using Copilot “in the flow of work as a meeting assistant, to draft documents and emails, [and] perform complex data analysis”. Copilots rely on large language models: they take user prompts and generate natural-language outputs. While they excel at producing plausible text and basic analysis, they are not specialized audit tools and lack awareness of audit rules or evidence requirements.

Key Limitations of AI Copilots for Audit Sampling

Using AI copilots to perform audit sampling faces fundamental obstacles. Below are five critical reasons, each backed by authoritative sources:

Lack of Professional Judgment and Skepticism: Audit sampling requires nuanced judgment at every step (e.g. defining sample scope, evaluating exceptions). Copilots do not “understand” context or risk – they simply match patterns in data. In fact, a 2026 survey found that 88% of auditors worry that AI tools “carry a risk of undermining professional judgment”. Auditors insist on human oversight: 64% said AI outputs used in professional work must be “always validated” by a human. For example, a finance expert warned that “AI isn’t a truth-teller; it’s a tool meant to provide answers that fit your questions”, meaning AI can confidently generate content that is not truly reliable. In audit sampling, decisions about sample size, selection method (random vs judgmental), and response to unexpected results demand auditor reasoning. An AI copilot cannot exercise professional skepticism or adapt its approach when anomalies arise. It cannot discern the significance of a potentially fraudulent entry or decide when a deviation is a mere clerical error versus a red flag.
Data Integrity and Confidentiality Issues: Auditors work with the client’s actual accounting records, which are protected and often not accessible to AI models. Copilots operate on pre-trained data and whatever input is provided, including possible internet sources. PCAOB outreach reports that firms “have policies in place that limit the extent to which staff can use GenAI tools in an audit”, because of “data privacy concerns and other risks”. Some firms even prohibit using GenAI on audit procedures due to these concerns. In practice, auditors cannot feed all their sensitive audit data into a cloud AI without breaching confidentiality and possibly violating SOX/GAAP controls. Even if data is input, the AI’s output depends on the quality and currency of that data. Auditors know that the underlying population must be complete; if a copilot is given only a slice of data (or old data), its analysis will be misleading. PCAOB observers note that GenAI may sometimes use “source data … that is incomplete or compromised”. In contrast, standard audit practice requires use of the entire population or a truly representative sample from it. A copilot has no built-in assurance of data completeness or accuracy. Moreover, any attempt to log AI’s internal process (to prove how it used the data) is not straightforward.
Prone to Hallucinations (False Outputs): Generative AI is known to “hallucinate” – i.e. produce plausible but false statements. The IMF warns that AI models “could produce wrong but plausible-sounding answers”, a concern that is “significant” for financial services. In auditing, factual accuracy is non-negotiable. Yet a copilot can invent numbers, citations, or explanations. We’ve seen this happen: for example, Deloitte Australia used AI to help draft a report and later found it contained “fabricated citations” and even a fake quote attributed to a judge. This required a correction and refund. Similarly, in 2023 two U.S. lawyers were sanctioned after ChatGPT helped them include six fictitious case citations in a legal brief. These incidents underscore how AI can quietly insert entirely made-up content. In audit sampling, a hallucination could mean the copilot claims a sample result or anomaly that doesn’t exist. Auditing standards assume all evidence is rooted in real data; an AI “finding” with no verifiable source would be unacceptable. PCAOB outreach also notes that GenAI output “may not always be reliable” because it can be “false or misleading”. Human auditors, by contrast, must verify each finding against source documents.
Lack of Transparency and Audit Trail: Auditing standards (ISA and PCAOB rules) require documentation of how sampling was done and evidence collected. AI copilots generate text without an inherent audit log or rationale. As one AI security expert notes, they “generate new text based on probabilistic patterns” but do not document how each piece of output was derived. In practice, this means an auditor cannot trace a copilot’s suggestion back to specific transactions. PCAOB firms emphasize they need “auditability of both the underlying source data … and GenAI-created content”. Some firms are building controls so that any AI query logs its sources and prompts. Without that, Copilot outputs are a black box. In audit sampling, this is fatal: auditors must be able to review and defend the logic behind sample selection and results. A copilot could say “Test transactions 100–200” or “this batch contains no errors” with no way for the auditor to replicate or substantiate that statement. In contrast, a human auditor’s workpapers show exactly which invoices were picked, how they were tested, and what was found. Copilot-assisted work offers no such clear audit trail unless the firm manually creates one around it.
Regulatory and Standards Constraints: Finally, current auditing and regulatory frameworks were not written with AI in mind. There is no provision in ISA/PCAOB for replacing an auditor’s work with AI-generated content. On the contrary, auditing rules assume auditors maintain control. PCAOB outreach found firms stressing that GenAI should “augment, but not replace, humans in auditing” and that “human involvement remains essential”. Any reliance on Copilot would still need to comply with those same rules: documentation requirements, professional ethics, independence, and confidentiality. If a copilot is used improperly, it could violate Sarbanes-Oxley or PCAOB standards. For example, who would be responsible if a Copilot “hallucination” leads to an audit failure? Auditors themselves remain fully accountable. PCAOB guidance implies auditors must supervise any AI tool as they would any staff: “an engagement team member who uses [GenAI] is still responsible for the results and documentation of the work”, and supervisors must apply the “same level of diligence” as usual. In short, Copilots do not come with built-in compliance controls. Firms would have to implement new policies and training (just as some are already doing).

The table below summarizes how AI copilots compare to the requirements of proper audit sampling:

Aspect	AI Copilot (e.g. Microsoft Copilot)	Audit Sampling Requirement
Professional Judgment	Lacks real context-specific judgment; produces answers based on statistical patterns.	Requires experienced auditor’s professional skepticism and judgment at every step.
Data Usage & Integrity	Uses model’s training data plus any user input; may pull from broad public sources.	Uses the client’s complete accounting records under strict control; data must be current and accurate.
Accuracy (Hallucinations)	Prone to fabricating plausible but false information (AI “hallucinations”).	Must verify every sample item with primary evidence; no tolerance for unsupported or invented data.
Transparency & Audit Trail	Outputs are black-box text with no built-in reasoning log.	Every sampling step and result must be documented (why samples chosen, how tested, evidence obtained) for review.
Regulatory Compliance	Not designed to enforce ISA/PCAOB/SOX rules; no inherent controls for confidentiality or documentation.	Must comply with all audit standards and laws (document workpapers, safeguard client data, maintain independence, etc.).

Impacts of Relying on Copilots for Audit Testing

Allowing AI copilots to handle audit sampling would significantly increase risk exposure:

Missed Misstatements and Audit Failure: If the AI errs, material misstatements could slip through. This would undermine the reasonable assurance audit objective (ISA 200). A flawed audit could lead to restatements, rescinded opinions, or regulatory sanctions. Historically, audit quality failures (even without AI) have led to fines and penalties. For instance, in 2024 India’s NFRA penalized Deloitte for audit lapses. If AI-assisted sampling misses a major error, the firm could face similar enforcement.
Regulatory and Legal Issues: Using Copilots in violation of standards could itself be a violation. Auditors could be accused of negligence if they rely on unverified AI output. As noted, auditors must remain in control of the work. Regulators like the PCAOB are already researching AI use (to see if guidance or rules need updating). An audit conducted improperly with AI could breach PCAOB standards or Sarbanes-Oxley requirements for audit documentation and data retention.
Reputational Damage: High-profile errors erode trust. The Deloitte AI incident (2025) led to a public rebuke and refund. Clients and the public may lose confidence in audited financial statements if they know a machine, not a human, made the judgment. In a survey, nearly half of auditors said uncontrolled AI use could “erode public trust in the audit profession”. Any scandal (like fabricated AI citations) can have a chilling effect on a firm’s credibility. The image of “audits by AI” could alarm stakeholders, even if most output was correct.
Business Decision Risks: Beyond compliance, bad AI outputs can mislead management. For example, Apple's AI news alerts in 2025 falsely reported major events. In the audit context, a false AI “finding” could misguide risk assessments or resource allocations. Senior management might make decisions based on faulty audit analysis. The financial consequences of such errors could be severe.

These impacts are not theoretical. Recent incidents underscore the dangers:

Timeline of AI hallucination and audit incidents from 2023 to 2025 showing cases involving ChatGPT, Apple, and Deloitte on a dark background.

Each event above involved AI generating false information with real consequences. In one case, a Sydney University researcher found “up to 20 errors” (fake references and misquotes) in Deloitte’s AI-assisted report. In another, an AI news feature had to be disabled after it spread misinformation to millions. These examples illustrate that hallucinations are a persistent risk. In each incident, human oversight eventually caught the mistakes. It’s a stark reminder that if auditors blindly rely on AI, errors could go undetected until after damage is done.

Recent Examples of AI Misuse in Finance/Auditing (2023–2025)

Legal Brief Hallucinations (2023): In June 2023, a U.S. judge sanctioned two lawyers who had submitted a brief containing six fictitious case citations generated by ChatGPT. The lawyers claimed it was an honest mistake; the court stressed that attorneys must “ensure the accuracy of their filings.” This case is often cited as a warning: even routine research tasks can be derailed by AI hallucinations.
Apple News Alert Glitches (Jan 2025): Apple introduced an AI-driven summary feature for news alerts, but quickly suspended it when users reported blatantly false headlines (e.g. a tech CEO’s fake suicide). The BBC complained about misuse of its logo on these false alerts. Apple’s case shows that sophisticated AI features can still produce outlandish errors if not carefully controlled. Finance professionals should note that AI models do not verify real-world facts.
Deloitte Government Report (2025): Deloitte used an Azure OpenAI GPT tool to draft an Australian government compliance review. Published in July 2025, the report had to be reissued after it was found to contain “fabricated references” and a misquoted court ruling. Deloitte agreed to partially refund its AU$440,000 fee. The firm attributed the problems to AI usage and insisted the main findings were unchanged, but critics decried it as a “human intelligence problem.” Auditor reliance on AI in this case led directly to reputational harm and financial penalty.
Academic Paper Retractions (2024): Several academic journals have retracted papers because AI-generated content (especially fake citations) went unchecked. This mirrors the auditing world: a recently published study noted that “nearly six out of 10 employees admit to making mistakes in their work due to AI errors,” and many are unsure if their organizations even allow AI use. These trends indicate that across industries, professionals are grappling with AI reliability issues.

These real-world incidents (many from 2023–2025) all involve AI “hallucinations” or misuse in contexts that rely on accurate information. They demonstrate why auditors cannot simply trust Copilot for critical tasks. Instead, these cases underscore the need for strict controls and human review.

Recommendations for Responsible AI Use in Auditing

Given the limitations above, auditors should adopt careful AI policies to avoid compromising audit quality:

Keep the Human-in-the-Loop: Treat AI outputs as draft suggestions. Require auditors to review and validate everything a copilot produces. As one expert advises: “Accountants have to own the work, check the output, and apply their judgment”. In practice, no audit conclusion should rely solely on AI without independent verification.
Use AI Only for Non-Judgmental Tasks: Restrict Copilot to supportive work (e.g. formatting workpapers, summarizing meeting notes, or preliminary analytics). Do not allow AI to select the actual statistical sample or make substantive judgments. For instance, an auditor might ask Copilot to list potential risk areas, but the auditor must decide which and how many transactions to test. Firms should explicitly forbid using AI to make final audit decisions.
Protect Data and Privacy: Never upload proprietary or confidential client data to a public AI without safeguards. Use only company-controlled AI environments (e.g. on-premises or tenant-specific models) that comply with data protection policies. Auditors must follow firm IT policies and regulations (ISA/PCAOB confidentiality rules) when using any AI tool. If necessary, use anonymized or aggregate data when experimenting with AI features.
Document Everything: Record all AI usage in the audit trail. This includes saving prompts given to the copilot, the AI’s responses, and any source data it used. Audit firms are starting to build tools to “document the relevant underlying source data” used by AI. Auditors should do likewise: attach AI outputs to workpapers and annotate how they were generated. This preserves transparency and aids review.
Train and Update Audit Teams: Educate audit staff about AI risks: how models work, what hallucinations look like, and how to detect them. Provide examples (like those above) and train auditors to question improbable outputs. Update audit methodology to include AI governance. For example, establish review checkpoints for any AI-assisted work. Given that 66% of auditors surveyed called for a “globally harmonized AI framework” for audit, firms should proactively develop their own guidelines now.
Follow Standards Diligently: Continue to apply all ISA/PCAOB procedures. Use AI tools like Excel Copilot with caution: any analytics or sampling done via AI must still meet the standards for audit evidence. For example, if AI suggests a sample of transactions, the auditor should verify the randomness or risk basis of that selection manually. Supervisors must review AI-augmented work with the same skepticism as any work. In short, AI tools should fit within the established audit process, not replace it.

By following these steps, auditors can harness AI for efficiency gains (e.g. automating low-level tasks) without relinquishing control over audit quality. The key is to recognize AI as a tool, not a substitute, and to integrate it within existing audit controls and review procedures.

Felamity Technologies: Driving Efficient and Responsible AI Testing

In this evolving landscape, Felamity Technologies stands out as a key enabler of responsible and efficient AI adoption in audit sample testing.

Leveraging its RAGSUITE ecosystem and expertise in private AI deployment, Felamity Technologies ensures that AI-driven audit processes are not only efficient but also aligned with enterprise-grade security, compliance, and reliability standards.

Key Contributions to Audit Sample Testing

High-Efficiency Data Processing
Felamity’s AI systems are designed to handle large-scale audit datasets with speed and precision, enabling faster sample evaluation without compromising accuracy.
Private and Secure AI Deployment
Unlike generic AI copilots, Felamity focuses on private AI environments, ensuring sensitive financial data remains secure and compliant with regulatory requirements.
Context-Aware AI with RAG Architecture
Through Retrieval-Augmented Generation (RAG), Felamity’s solutions ground AI outputs in verified enterprise data, significantly reducing hallucinations and improving reliability in audit scenarios.
Enhanced Audit Traceability
The platform emphasizes transparency by enabling traceable outputs, helping auditors maintain proper documentation and audit trails—critical for compliance with standards like ISA and PCAOB.
Human-in-the-Loop Design
Felamity’s approach reinforces the role of auditors by embedding AI as a support system rather than a decision-maker, ensuring professional judgment remains central to the audit process.

Conclusion: Human Judgment Remains Irreplaceable

Audit sampling is inherently an evidence-based, judgment-driven process. No matter how advanced, an AI copilot lacks the contextual understanding, ethics, and skepticism that auditing demands. Auditing standards (ISA, PCAOB) were written for human practitioners; regulators and firms stress that AI can augment, but not replace, humans in auditing. Recent incidents (Deloitte, ChatGPT sanctions, etc.) remind us that AI will produce errors that only diligent humans catch.

In today’s environment, there is strong pressure to adopt AI to speed up finance workflows. However, this must never come at the expense of compliance and accuracy. As a 2026 survey emphasizes, professionals must “always validate” AI outputs with human expertise. In sample testing, auditors must remain vigilant: they must personally design the sample, gather evidence, and form the opinion. AI may help with mundane tasks or preliminary analysis, but the final audit judgment rests squarely on human shoulders.

In conclusion, while AI copilots like Microsoft Copilot offer impressive assistance, they cannot be relied on to conduct audit sample testing. The risks – from hallucinated data to undocumented processes – are too great. Auditors and risk managers must preserve the human elements of the audit: rigorous skepticism, professional judgment, and strict adherence to standards. Only with humans in control can audit quality and trust be maintained.