alexi.sh
All articlesBrowser securityNetwork privacyPrivacy toolingThreat modelingAI codingDev tooling

alexi.shAI Engineering Lab

ai-coding

How Do AI Detectors Work? (And How Reliable Are They, 2026)

PrivSec Lab5 min read
A fountain pen writing cursive text on lined paper

AI detectors flag text as machine-written using signals like perplexity and burstiness, trained classifiers, and watermarking. How they actually work, why their false-positive rates are high, and what they're really worth.

"Was this written by AI?" is now a daily question for teachers, editors, recruiters and platform moderators β€” and a small industry of AI detectors promises a yes-or-no answer. This guide explains how those tools actually work under the hood, the signals they rely on, and the uncomfortable truth about how reliable they really are.

What an AI detector is trying to do

An AI text detector estimates the probability that a passage was generated by a language model rather than written by a person. Crucially, it doesn't understand the text or check whether it's true. It looks at surface statistics β€” the shape and predictability of the words β€” and outputs a likelihood. That distinction matters, because it's the root of every limitation that follows.

To see why these statistics exist, it helps to know how the text was produced in the first place: an LLM generates writing by repeatedly predicting the most likely next token. That very process leaves a faint statistical signature, and detectors hunt for it.

A fountain pen writing cursive text on lined paper

The three core techniques

1. Perplexity and burstiness

The oldest and most common approach measures two things:

  • Perplexity β€” how surprised a language model is by each word. Because an LLM writes by choosing high-probability words, AI text tends to be very predictable, so it scores low perplexity. Human writing is messier and less predictable.
  • Burstiness β€” how much sentence length and complexity vary across a passage. People write in bursts: a long winding sentence, then a short one. Machine text is often flatter and more uniform.

A detector combines low perplexity and low burstiness into a "this looks machine-written" signal. It's intuitive β€” but it's also exactly why plain, well-structured human writing gets misjudged.

2. Trained classifiers

The modern approach is a machine-learning classifier. The tool is shown large numbers of human-written and AI-written samples and learns, on its own, the patterns that separate them β€” then outputs a probability for new text. This is the same family of technique behind spam filters, applied to authorship.

The catch: a classifier is only as good as its training data. It learns the styles of the models and topics it saw, and it can be confidently wrong on anything outside that distribution β€” new models, edited text, or writers whose natural style resembles the "AI" patterns it learned.

3. Watermarking

A fundamentally different idea: instead of guessing after the fact, the AI provider subtly biases the model's word choices following a secret pattern as it generates. A matching detector that knows the pattern can then spot it. In principle this is the most robust method β€” but it only works if the provider actually watermarks output and the watermark survives. Copying, paraphrasing or even moderate editing tends to wash it out.

Close-up of a printed circuit board with a microchip and surrounding components
A circuit board and microchip β€” most detectors run a trained classifier, a model that has learned the statistical patterns separating human from machine text.

How reliable are they, really?

This is where the marketing and the evidence part ways. AI detectors make two kinds of errors, and both are common:

  • False positives β€” flagging genuine human writing as AI. Because detectors reward "plainness," clear, formulaic, well-organized human writing can score as machine-made.
  • False negatives β€” missing real AI text, especially after a human lightly edits or paraphrases it.

Two public facts anchor the skepticism:

  1. OpenAI discontinued its own AI Text Classifier in July 2023, citing its low rate of accuracy. The company that builds the leading models could not ship a reliable detector for them.
  2. Researchers have raised the alarm on bias. A widely-cited 2023 Stanford study (Liang et al., published in Patterns) found that detectors disproportionately flag writing by non-native English speakers, whose simpler, more predictable phrasing reads as "low perplexity" β€” risking unfair accusations.

The deeper problem is structural: detection is a guess about surface patterns, and anything that changes those patterns defeats it β€” including the ordinary editing every careful writer does anyway.

Why detectors are easy to fool

Because the signal is statistical rather than semantic, lots of mundane actions lower a detector's confidence: rephrasing sentences, varying their length, swapping a few words, asking the model to write in a more "human" or varied style, or running the text through a paraphraser. Watermark detection only helps when a watermark was added and survived β€” frequently it wasn't, or didn't. This is a classic cat-and-mouse race, and the cat is losing.

What to do instead

For anything with real consequences β€” grades, jobs, publication, moderation β€” a single detector score is the wrong tool. Better signals come from process and context:

  • Look at draft history and version control rather than just the final text.
  • Ask follow-up questions about the work, or compare against a known writing sample.
  • Judge whether the content is actually correct, original and useful. An LLM's genuine weakness isn't that it's detectable β€” it's hallucination, stating false things confidently. Verifying facts catches more real problems than any detector.
  • If you must use a detector, treat its output as one weak input, document the false-positive risk, and never automate a decision or accusation on it alone.

For related context on how these models handle your data and where the real risks lie, see whether ChatGPT is safe to use.

The bottom line

AI detectors work by measuring the statistical fingerprints of machine text β€” low perplexity, low burstiness, learned classifier patterns, or provider watermarks β€” never by understanding meaning. That design makes them fundamentally probabilistic: prone to false positives (especially against plain or non-native writing), easy to defeat with light editing, and unreliable enough that even OpenAI pulled its own detector. Use them, if at all, as a faint hint β€” and base real decisions on process, context and whether the writing is actually any good.

Photo: Unsplash (source)

Also available in

FAQ

How do AI detectors work?
AI text detectors look for statistical fingerprints of machine writing rather than 'reading' for meaning. The two classic signals are perplexity (how predictable each word is β€” AI text tends to be very predictable, so low perplexity) and burstiness (how much sentence length and complexity vary β€” humans vary more, AI tends to be flatter). Most modern tools also run a trained classifier: a model shown many human and AI samples that learns to output a probability that a passage is machine-generated. A third approach is watermarking, where the AI provider biases word choice in an invisible pattern that a matching detector can later spot. None of these inspects facts or intent; they all estimate likelihood from surface patterns.
Are AI detectors reliable?
Not reliably enough to be used as proof. They produce both false positives (flagging genuine human writing as AI) and false negatives (missing real AI text, especially after light editing or paraphrasing). OpenAI publicly discontinued its own AI Text Classifier in July 2023, citing its low rate of accuracy. Because detectors key on statistical 'plainness,' clear and formulaic human writing can trip them, while a few human edits or a paraphrasing pass can defeat them. Treat any score as a weak signal, never a verdict.
Do AI detectors give false positives?
Yes, and that's their most serious weakness. A detector measures how 'predictable' text looks, so straightforward, well-structured human writing β€” the kind students and professionals are taught to produce β€” can score as AI. Published research has also raised concerns that detectors disproportionately flag text written by non-native English speakers, whose phrasing tends to be simpler and more predictable. Acting on a false positive (for example, accusing a student or rejecting a writer) can cause real harm, which is why no responsible policy should rely on a detector alone.
Can AI detectors be fooled?
Easily, in practice. Light editing, rephrasing, swapping a few words, asking the model to write in a more varied or 'human' style, or running the text through a paraphrasing tool can all lower a detector's confidence. Watermark-based detection only works if the provider added a watermark and it survived editing, which it often doesn't. Because detection is an estimate of surface patterns, anything that changes those patterns β€” including normal human editing β€” degrades it. This cat-and-mouse dynamic is why detection alone can't be a dependable gate.
What should I use instead of an AI detector?
For anything consequential, lean on process and context rather than a single score. Look at draft history and version control, ask follow-up questions about the work, compare against a known writing sample, and judge whether the content is actually correct, original and useful β€” an LLM's real weakness is hallucination, not detectability. If you use a detector at all, treat it as one weak input among many, document the false-positive risk, and never make an accusation or automated decision on its output alone.