Do AI Detectors Work? We’re not sure they do

I have a sneaky suspicion that AI detectors are nothing but guesswork and smoke & mirrors. They are meant to detect AI-generated content and have quickly become the internets gatekeepers of authenticity.

This research started a couple of weeks ago when a customer contacted us asking why undetectable.ai was showing that pieces of content we had written for them were showing as AI generated. This kind of took the team by surprise as nothing for that customer was AI generated and has lead us down a rabbit hole that really needs looking at.

How reliable are these gatekeepers, really? How can we know to trust their output when it is contradicted in multiple places. To find out, we conducted a simple experiment: we generated in AI a descriptive product review and ran it through a series of popular AI detectors.

This article explores the inconsistencies we uncovered and what this might mean for content creators, educators, and professionals. We really need to know if AI detection is currently more guesswork than science.

Now you’d expect that being as we knew it was AI generated completely from a very basic prompt it “should” show as 100% AI in all detectors.

The results? Wildly inconsistent.

DetectorResult
Phrasely.ai0/9 sentences are likely AI generated
undetectable.ai81% AI
ZeroGPT0% AI
GPTZero100% AI generated
OpenAI (Classifier)“Based on the analysis of linguistic patterns, narrative style, and content depth, I estimate that there is approximately a 30% likelihood that this text was generated by AI”
Copyleaks100% AI
Quillbot0% AI
Sapling100% AI
Writer0% AI
Grammarly0% AI
Scribbr0% AI
originality.ai100% AI
Decopy AI66% AI
aidetector.com0% likely AI generated
scispace0% AI
surgegraph.io94% AI
detecting-ai.com79.8% AI

All the screenshots from the test are below

The numbers don’t lie, even if the detectors apparently do. Out of 17 tools tested, only 4 correctly identified our AI-generated content as 100% artificial. I’m pretty sure that any other QA or testing system with that accuracy would be removed immediately.

The remaining 13 detectors offered an array of confidence scores. One marked the content as 79.8% human-written, others suggested it was 30% AI-generated, and a surprising amount confidently declared it entirely human-authored.

This feel more than just a minor calibration issue it’s more like a fundamental breakdown in the core promise these tools make to their users.

I don’t know about you but a 23.5% success rate on something that could be used to restrict student grades or potentially open a business up to claims that work done was not by a human is eye-opening.


The Methodology Behind Our Madness

Our experiment was deliberately simple to avoid any variables that might skew results. We used ChatGPT to generate a product review based on a random Amazon product that we scraped the product details and existing reviews. The content was straightforward, informative, and typical of what you’d find on any e-commerce platform.

We then fed this identical text through 17 different AI detection tools, ranging from free online checkers to premium services. The platforms included all the popular ones like phrasely.ai, Copyleaks, Grammarly, GPTZero, and numerous others that appear in Google’s search results for “AI content detector.”

What we discovered was nothing short of a chaotic landscape of contradictory verdicts

But is goes deeper. In the API documentation of undetectable.ai it has the following:

Do AI Detectors Work? We're not sure they do

They literally state that the scores they show from external platforms are a guess approximate. And yet they promote that they should be trusted to tell people whether a piece of content is AI or not

If these tools are supposed to offer certainty in the age of synthetic text, they’ve done the exact opposite.


The Technical Smoke and Mirrors

To understand why AI detectors fail so spectacularly, we would need to examine how they actually work, except we can’t because it is another closed black box like the LLM’s themselves. Most current detection tools seem to rely on statistical analysis of text patterns, looking for characteristics they’ve learned to associate with AI generation. The aggregators seem so just use the API’s of the others and assume the response is accurate.

The problem is that factors like sentence structure variety, vocabulary diversity, punctuation patterns, and stylistic consistency aren’t exclusive to AI writing. Humans can write in patterns that trigger AI flags, especially when writing professionally or following style guidelines.

I never even noticed an “em dash” before probably 2022 but sure enough they are in every book I read that was published before this AI boom.


The Perplexity Problem

Many AI detectors rely heavily on perplexity scores which are essentially just measuring how “surprised” a language model would be by each word choice in the text.

Low perplexity (predictable word choices) supposedly can indicate AI generation, while high perplexity (unexpected word choices) suggests human authorship.

But this approach has fundamental flaws. Skilled human writers often make predictable word choices, especially when clarity is the goal. Technical writing, instruction manuals, blog content, marketing collateral and formal business communication are typically low perplexity pieces of work.


The Training Data Trap

Another critical issue will lie in the training data used to build these detectors. Most AI detectors are trained on datasets comparing known AI outputs from older models (like GPT-3 or early GPT-4) against human writing samples. As AI models evolve rapidly, the characteristics these detectors learned to identify become outdated.

GPT-4o, Claude 4, and other current models produce text that’s qualitatively different from their predecessors. They’ve been trained to avoid many of the telltale patterns that earlier detectors learned to identify. This creates a constant arms race where detection tools are always fighting the last war.

Moreover, the human writing samples used in training datasets often come from specific sources like academic papers, news articles etc. This could create an unintended bias where detectors flag writing styles that don’t match their training data as “AI-generated”, even when produced by humans from different backgrounds, cultures, styles or writing contexts.

It could be the case that AI writing is becoming indistiguisable from human written and thus AI detectors will eventually not be able to ever tell the difference and we will simply have to live in a world of assuming everything is at least partially AI.


The Confidence Game

Perhaps most troubling is how confidently these tools present their flawed results. It is the entire reason that this post has ended up being written due to the concerns of some work that we had done for a customer. The confidence in the result and the false positive given to our customer brought our entire brand into doubt. Are we being honest? Can our customer trust us to do what we say we would do? This free online tool has brought an entire business relationship into question.

This is especially bad when these questions are in stark contrast to our core values.

Rather than acknowledging uncertainty (to be fair, some like Quillbot do have a disclaimer under the results but it is easy to miss) or providing confidence intervals, most AI detectors display precise percentages and definitive classifications. A tool might declare content “87% AI-generated” with no indication that this number might be meaningless.

Do AI Detectors Work? We're not sure they do

This false precision creates an illusion of scientific accuracy that masks the fundamental unreliability. Users, lacking technical background in machine learning, naturally assume that a tool confident enough to provide specific percentages must be accurate.

The reality is that current AI detection technology exists in an innacurate state that is largely hidden from it’s users.


But what’s the solution?

The solution isn’t necessarily better AI detectors, though technical improvements and making the UI more explanatory to non-technical users would certainly help. The solution is honesty about the current limitations of this technology. Unless AI detection technology matures significantly, we need to stop treating these tools as accurate and start acknowledging them for what they really are: sophisticated guessing machines.

Want to know something really ironic? I have sat here for the past 6 hours writing this blog post and out of interest I threw the text through Phrasely.ai and undetectable.ai …. I’ll let you be the judge on how fake I am.

Do AI Detectors Work? We're not sure they do
Do AI Detectors Work? We're not sure they do

Share

Rivals slash hiring costs 70 percent using virtual assistants.

Estimated reading time: 7 minutes Key Takeaways Virtual assistants for recruiters boost efficiency by up to 33% when routine tasks are delegated. Recruiters reclaim an average of 22 minutes per day, focusing on strategic candidate engagement. Tools such as Applicant Tracking Systems integrate seamlessly with VA workflows. *Remote recruiting support* opens doors to global talent across multiple time zones. Adopting talent acquisition automation elevates candidate

Data Processing and Analytics Services

Accelerate Growth with Expert Data Processing and Analytics Services

Data Processing and Analytics Services play a crucial role in business growth. Discover the transformative power of expert-driven analytics and how it can lead to informed decisions, better strategies, and increased profits. Dive deeper to learn more about these invaluable services.

Tony Blair's Political Comeback: New Book Sparks Debate

Tony Blair’s Political Comeback: New Book Sparks Debate

Tony Blair’s Leadership Manual: A Blueprint for Modern Governance The Birth of a Leadership Guide Tony Blair, the former Prime Minister of the United Kingdom, has recently penned a comprehensive leadership manual drawing from his extensive experience in government. This book emerges from Blair’s desire to share insights gained during his decade-long tenure at 10 Downing Street. With a wealth of knowledge accumulated through navigating

Outsourcing vs Offshoring Which Strategy Will Boost Your Business Success

Estimated reading time: 10 minutes Key Takeaways Outsourcing delegates tasks to third-party providers, while offshoring relocates operations to a foreign country. Location, control, and workforce differences are key distinctions between outsourcing and offshoring. Both strategies can reduce costs, improve efficiencies, and provide access to specialised talent. Challenges include risk management, cultural gaps, and data security considerations. Thorough planning, clear communication, and strong management are essential

Content gap analysis

Guide to Mastering Content Gap Analysis for Optimal Performance

How Content Gap Analysis Drives Growth dives into the world of content gap analysis, offering expert insights and strategies to boost your SEO efforts. Learn how to perform effective content gap analysis and optimize your content strategy for success.

How Offshore Teams Master Predictive Customer Service

How Offshore Teams Master Predictive Customer Service

The Strategic Advantage of Offshore Analytics British businesses are discovering remarkable opportunities through offshore analytics partnerships. By connecting with skilled professionals across different time zones, companies unlock substantial value whilst maintaining quality standards. Take Sarah’s marketing consultancy in Manchester – she partnered with our offshore analytics team, reducing operational costs by 40% whilst expanding her service capabilities. Working with global talent brings fresh perspectives and