Do AI Detectors Work? We’re not sure they do

May 27, 2025

I have a sneaky suspicion that AI detectors are nothing but guesswork and smoke & mirrors. They are meant to detect AI-generated content and have quickly become the internets gatekeepers of authenticity.

This research started a couple of weeks ago when a customer contacted us asking why undetectable.ai was showing that pieces of content we had written for them were showing as AI generated. This kind of took the team by surprise as nothing for that customer was AI generated and has lead us down a rabbit hole that really needs looking at.

How reliable are these gatekeepers, really? How can we know to trust their output when it is contradicted in multiple places. To find out, we conducted a simple experiment: we generated in AI a descriptive product review and ran it through a series of popular AI detectors.

This article explores the inconsistencies we uncovered and what this might mean for content creators, educators, and professionals. We really need to know if AI detection is currently more guesswork than science.

Now you’d expect that being as we knew it was AI generated completely from a very basic prompt it “should” show as 100% AI in all detectors.

The results? Wildly inconsistent.

Detector	Result
Phrasely.ai	0/9 sentences are likely AI generated
undetectable.ai	81% AI
ZeroGPT	0% AI
GPTZero	100% AI generated
OpenAI (Classifier)	“Based on the analysis of linguistic patterns, narrative style, and content depth, I estimate that there is approximately a 30% likelihood that this text was generated by AI”
Copyleaks	100% AI
Quillbot	0% AI
Sapling	100% AI
Writer	0% AI
Grammarly	0% AI
Scribbr	0% AI
originality.ai	100% AI
Decopy AI	66% AI
aidetector.com	0% likely AI generated
scispace	0% AI
surgegraph.io	94% AI
detecting-ai.com	79.8% AI

All the screenshots from the test are below

Do AI Detectors Work? We're not sure they do

The numbers don’t lie, even if the detectors apparently do. Out of 17 tools tested, only 4 correctly identified our AI-generated content as 100% artificial. I’m pretty sure that any other QA or testing system with that accuracy would be removed immediately.

The remaining 13 detectors offered an array of confidence scores. One marked the content as 79.8% human-written, others suggested it was 30% AI-generated, and a surprising amount confidently declared it entirely human-authored.

This feel more than just a minor calibration issue it’s more like a fundamental breakdown in the core promise these tools make to their users.

I don’t know about you but a 23.5% success rate on something that could be used to restrict student grades or potentially open a business up to claims that work done was not by a human is eye-opening.

The Methodology Behind Our Madness

Our experiment was deliberately simple to avoid any variables that might skew results. We used ChatGPT to generate a product review based on a random Amazon product that we scraped the product details and existing reviews. The content was straightforward, informative, and typical of what you’d find on any e-commerce platform.

We then fed this identical text through 17 different AI detection tools, ranging from free online checkers to premium services. The platforms included all the popular ones like phrasely.ai, Copyleaks, Grammarly, GPTZero, and numerous others that appear in Google’s search results for “AI content detector.”

What we discovered was nothing short of a chaotic landscape of contradictory verdicts

But is goes deeper. In the API documentation of undetectable.ai it has the following:

They literally state that the scores they show from external platforms are ~~a guess~~ approximate. And yet they promote that they should be trusted to tell people whether a piece of content is AI or not

If these tools are supposed to offer certainty in the age of synthetic text, they’ve done the exact opposite.

The Technical Smoke and Mirrors

To understand why AI detectors fail so spectacularly, we would need to examine how they actually work, except we can’t because it is another closed black box like the LLM’s themselves. Most current detection tools seem to rely on statistical analysis of text patterns, looking for characteristics they’ve learned to associate with AI generation. The aggregators seem so just use the API’s of the others and assume the response is accurate.

The problem is that factors like sentence structure variety, vocabulary diversity, punctuation patterns, and stylistic consistency aren’t exclusive to AI writing. Humans can write in patterns that trigger AI flags, especially when writing professionally or following style guidelines.

I never even noticed an “em dash” before probably 2022 but sure enough they are in every book I read that was published before this AI boom.

The Perplexity Problem

Many AI detectors rely heavily on perplexity scores which are essentially just measuring how “surprised” a language model would be by each word choice in the text.

Low perplexity (predictable word choices) supposedly can indicate AI generation, while high perplexity (unexpected word choices) suggests human authorship.

But this approach has fundamental flaws. Skilled human writers often make predictable word choices, especially when clarity is the goal. Technical writing, instruction manuals, blog content, marketing collateral and formal business communication are typically low perplexity pieces of work.

The Training Data Trap

Another critical issue will lie in the training data used to build these detectors. Most AI detectors are trained on datasets comparing known AI outputs from older models (like GPT-3 or early GPT-4) against human writing samples. As AI models evolve rapidly, the characteristics these detectors learned to identify become outdated.

GPT-4o, Claude 4, and other current models produce text that’s qualitatively different from their predecessors. They’ve been trained to avoid many of the telltale patterns that earlier detectors learned to identify. This creates a constant arms race where detection tools are always fighting the last war.

Moreover, the human writing samples used in training datasets often come from specific sources like academic papers, news articles etc. This could create an unintended bias where detectors flag writing styles that don’t match their training data as “AI-generated”, even when produced by humans from different backgrounds, cultures, styles or writing contexts.

It could be the case that AI writing is becoming indistiguisable from human written and thus AI detectors will eventually not be able to ever tell the difference and we will simply have to live in a world of assuming everything is at least partially AI.

The Confidence Game

Perhaps most troubling is how confidently these tools present their flawed results. It is the entire reason that this post has ended up being written due to the concerns of some work that we had done for a customer. The confidence in the result and the false positive given to our customer brought our entire brand into doubt. Are we being honest? Can our customer trust us to do what we say we would do? This free online tool has brought an entire business relationship into question.

This is especially bad when these questions are in stark contrast to our core values.

Rather than acknowledging uncertainty (to be fair, some like Quillbot do have a disclaimer under the results but it is easy to miss) or providing confidence intervals, most AI detectors display precise percentages and definitive classifications. A tool might declare content “87% AI-generated” with no indication that this number might be meaningless.

This false precision creates an illusion of scientific accuracy that masks the fundamental unreliability. Users, lacking technical background in machine learning, naturally assume that a tool confident enough to provide specific percentages must be accurate.

The reality is that current AI detection technology exists in an innacurate state that is largely hidden from it’s users.

But what’s the solution?

The solution isn’t necessarily better AI detectors, though technical improvements and making the UI more explanatory to non-technical users would certainly help. The solution is honesty about the current limitations of this technology. Unless AI detection technology matures significantly, we need to stop treating these tools as accurate and start acknowledging them for what they really are: sophisticated guessing machines.

Want to know something really ironic? I have sat here for the past 6 hours writing this blog post and out of interest I threw the text through Phrasely.ai and undetectable.ai …. I’ll let you be the judge on how fake I am.

Why Global Firms Are Racing to Outsource Their Accounting

The Rise of Accounting Outsourcing Understanding the Financial Benefits Accessing Global Expertise Flexibility and Scalability Operational Efficiency Risk Management and Compliance Choosing Your Outsourcing Partner Implementation Strategies Managing the Relationship Future-Proofing Your Decision

22/03/2025 No Comments

Strategies to Improve Customer Satisfaction for Lasting Business Success

Estimated reading time: 7 minutes Key Takeaways Proactively gathering feedback drives continuous improvement. Empathy and personalisation are vital to meaningful customer interactions. Multi-channel support enriches accessibility and convenience. Ongoing measurement of satisfaction (via Net Promoter Score (NPS) and Customer Satisfaction Score (CSAT)) allows for timely refinements. Long-term retention grows trust, loyalty, and sustainable success. Table of Contents • Understanding Customer Needs • Enhancing Customer Service

26/04/2025 No Comments

Lead Generation Outsourcing: Unlocking the Power of External Expertise

Lead Generation Outsourcing: Dive into the unparalleled advantages of leveraging outsourced lead gen, and discover how it can revolutionize your business growth.

26/10/2023 2 Comments

Boost Your Small Business with Clever Social Media Tricks

Harnessing the Power of Micro-Influencers Maximising Blog Content for Social Media Implementing Chatbots for Enhanced Customer Service Share for Share (S4S) Strategies Riding the Wave of Trending Topics Content Repurposing and Mashups Leveraging Native Advertising Optimising Post Scheduling and Visuals Mastering Hashtags and Analytics Fostering Transparency and Community Geotagging for Local Discovery

24/11/2024 No Comments

Why Today’s Corporate Career Ladder is More Like a Jungle Gym

The Vanishing Traditional Path Mind the Gap: Skills in the Modern Workplace The Office Chess Game When Companies Get It Wrong Taking Charge of Your Career The New Rules of Climbing Action Steps for Success Looking Ahead

29/01/2025 No Comments

5 Growth Strategies Every SME Must Know for 2024

The Technology Revolution: Navigating AI Implementation Creating Compelling Value Propositions Building Strong Teams Through Modern Benefits Making Strategic Decisions with Confidence Environmental and Social Responsibility Measuring Success and Adapting Strategy

25/01/2025 No Comments

Do AI Detectors Work? We’re not sure they do

The Methodology Behind Our Madness

The Technical Smoke and Mirrors

The Perplexity Problem

The Training Data Trap

The Confidence Game

But what’s the solution?

Share

Why Global Firms Are Racing to Outsource Their Accounting

Strategies to Improve Customer Satisfaction for Lasting Business Success

Lead Generation Outsourcing: Unlocking the Power of External Expertise

Boost Your Small Business with Clever Social Media Tricks

Why Today’s Corporate Career Ladder is More Like a Jungle Gym

5 Growth Strategies Every SME Must Know for 2024

Meet the team behind it all

Michael Kitt

Ali Memon

Bhanupriya Rawat

Jessica Jayapalan

Jayaram J

Ayesha Khan

Mehreen Farooq

Saloni Geedam

Janhavi Pimple

Need some business administration support?

Don't hesitate to say

Shortcuts

Contact Us

71-75 Shelton Street, London, WC2H 9JQ

+44 (0)1189 919 545

info@kimonservices.com

Follow Us

© Copyright 2025, All client agreements are transacted through Kimon Services (UK) Ltd. All Rights Reserved,