"Was this written by AI?" is now a daily question for teachers, editors, recruiters and platform moderators β and a small industry of AI detectors promises a yes-or-no answer. This guide explains how those tools actually work under the hood, the signals they rely on, and the uncomfortable truth about how reliable they really are.
What an AI detector is trying to do
An AI text detector estimates the probability that a passage was generated by a language model rather than written by a person. Crucially, it doesn't understand the text or check whether it's true. It looks at surface statistics β the shape and predictability of the words β and outputs a likelihood. That distinction matters, because it's the root of every limitation that follows.
To see why these statistics exist, it helps to know how the text was produced in the first place: an LLM generates writing by repeatedly predicting the most likely next token. That very process leaves a faint statistical signature, and detectors hunt for it.
The three core techniques
1. Perplexity and burstiness
The oldest and most common approach measures two things:
- Perplexity β how surprised a language model is by each word. Because an LLM writes by choosing high-probability words, AI text tends to be very predictable, so it scores low perplexity. Human writing is messier and less predictable.
- Burstiness β how much sentence length and complexity vary across a passage. People write in bursts: a long winding sentence, then a short one. Machine text is often flatter and more uniform.
A detector combines low perplexity and low burstiness into a "this looks machine-written" signal. It's intuitive β but it's also exactly why plain, well-structured human writing gets misjudged.
2. Trained classifiers
The modern approach is a machine-learning classifier. The tool is shown large numbers of human-written and AI-written samples and learns, on its own, the patterns that separate them β then outputs a probability for new text. This is the same family of technique behind spam filters, applied to authorship.
The catch: a classifier is only as good as its training data. It learns the styles of the models and topics it saw, and it can be confidently wrong on anything outside that distribution β new models, edited text, or writers whose natural style resembles the "AI" patterns it learned.
3. Watermarking
A fundamentally different idea: instead of guessing after the fact, the AI provider subtly biases the model's word choices following a secret pattern as it generates. A matching detector that knows the pattern can then spot it. In principle this is the most robust method β but it only works if the provider actually watermarks output and the watermark survives. Copying, paraphrasing or even moderate editing tends to wash it out.
How reliable are they, really?
This is where the marketing and the evidence part ways. AI detectors make two kinds of errors, and both are common:
- False positives β flagging genuine human writing as AI. Because detectors reward "plainness," clear, formulaic, well-organized human writing can score as machine-made.
- False negatives β missing real AI text, especially after a human lightly edits or paraphrases it.
Two public facts anchor the skepticism:
- OpenAI discontinued its own AI Text Classifier in July 2023, citing its low rate of accuracy. The company that builds the leading models could not ship a reliable detector for them.
- Researchers have raised the alarm on bias. A widely-cited 2023 Stanford study (Liang et al., published in Patterns) found that detectors disproportionately flag writing by non-native English speakers, whose simpler, more predictable phrasing reads as "low perplexity" β risking unfair accusations.
The deeper problem is structural: detection is a guess about surface patterns, and anything that changes those patterns defeats it β including the ordinary editing every careful writer does anyway.
Why detectors are easy to fool
Because the signal is statistical rather than semantic, lots of mundane actions lower a detector's confidence: rephrasing sentences, varying their length, swapping a few words, asking the model to write in a more "human" or varied style, or running the text through a paraphraser. Watermark detection only helps when a watermark was added and survived β frequently it wasn't, or didn't. This is a classic cat-and-mouse race, and the cat is losing.
What to do instead
For anything with real consequences β grades, jobs, publication, moderation β a single detector score is the wrong tool. Better signals come from process and context:
- Look at draft history and version control rather than just the final text.
- Ask follow-up questions about the work, or compare against a known writing sample.
- Judge whether the content is actually correct, original and useful. An LLM's genuine weakness isn't that it's detectable β it's hallucination, stating false things confidently. Verifying facts catches more real problems than any detector.
- If you must use a detector, treat its output as one weak input, document the false-positive risk, and never automate a decision or accusation on it alone.
For related context on how these models handle your data and where the real risks lie, see whether ChatGPT is safe to use.
The bottom line
AI detectors work by measuring the statistical fingerprints of machine text β low perplexity, low burstiness, learned classifier patterns, or provider watermarks β never by understanding meaning. That design makes them fundamentally probabilistic: prone to false positives (especially against plain or non-native writing), easy to defeat with light editing, and unreliable enough that even OpenAI pulled its own detector. Use them, if at all, as a faint hint β and base real decisions on process, context and whether the writing is actually any good.


