ai-coding

How Do AI Detectors Work? (And How Reliable Are They, 2026)

PrivSec LabJune 20, 20265 min read

A fountain pen writing cursive text on lined paper

AI detectors flag text as machine-written using signals like perplexity and burstiness, trained classifiers, and watermarking. How they actually work, why their false-positive rates are high, and what they're really worth.

"Was this written by AI?" is now a daily question for teachers, editors, recruiters and platform moderators. And a small industry of AI detectors promises a yes-or-no answer. This guide explains how those tools really work, the signals they rely on, and the hard truth about how reliable they really are.

What an AI detector is trying to do

An AI text detector estimates the odds that a passage came from a language model rather than a person. Crucially, it doesn't understand the text or check whether it's true. It looks at surface statistics - the shape and predictability of the words - and outputs a likelihood. That gap matters. It's the root of every limit that follows.

To see why these statistics exist, it helps to know how the text was made. An LLM writes by guessing the most likely next token, again and again. That very process leaves a faint statistical trace, and detectors hunt for it.

A fountain pen writing cursive text on lined paper

The three core techniques

1. Perplexity and burstiness

The oldest and most common approach measures two things:

Perplexity - how surprised a language model is by each word. An LLM writes by choosing high-odds words. So AI text tends to be very predictable, and it scores low perplexity. Human writing is messier and less predictable.
Burstiness - how much sentence length and complexity vary across a passage. People write in bursts: a long winding sentence, then a short one. Machine text is often flatter and more even.

A detector blends low perplexity and low burstiness into a "this looks machine-written" signal. The idea makes sense. But it is also just why plain, well-built human writing gets misjudged.

2. Trained classifiers

The modern approach is a machine-learning classifier. The tool is shown many human-written and AI-written samples. On its own, it learns the patterns that tell them apart. Then it outputs a score for new text. This is the same kind of method behind spam filters, but aimed at who wrote the text.

The catch: a classifier is only as good as its training data. It learns the styles of the models and topics it saw. So it can be sure but wrong on anything outside that set - new models, edited text, or writers whose natural style looks like the "AI" patterns it learned.

3. Watermarking

This is a whole different idea. Instead of guessing after the fact, the AI provider gently skews the model's word choices as it writes, following a secret pattern. A matching detector that knows the pattern can then spot it. In theory this is the most robust method. But it only works if the provider actually watermarks the output and the watermark survives. Copying, rewording or even mild editing tends to wash it out.

Close-up of a printed circuit board with a microchip and surrounding components — A circuit board and microchip - most detectors run a trained classifier, a model that has learned the statistical patterns separating human from machine text.

How reliable are they, really?

This is where the marketing and the evidence part ways. AI detectors make two kinds of errors, and both are common:

False positives - flagging real human writing as AI. Detectors reward "plainness." So clear, by-the-book, well-ordered human writing can score as machine-made.
False negatives - missing real AI text, above all after a human lightly edits or rewords it.

Two public facts anchor the doubt:

OpenAI shut down its own AI Text Classifier in July 2023, citing its low rate of accuracy. The firm that builds the leading models could not ship a reliable detector for them.
Researchers have raised the alarm on bias. A widely-cited 2023 Stanford study (Liang et al., published in Patterns) found that detectors flag writing by non-native English speakers more often. Their simpler, more predictable phrasing reads as "low perplexity," which risks unfair claims.

The deeper problem is built in. Detection is a guess about surface patterns. So anything that changes those patterns defeats it - including the plain editing every careful writer does anyway.

Why detectors are easy to fool

The signal is statistical, not based on meaning. So lots of plain actions lower a detector's confidence. You can reword sentences, vary their length, or swap a few words. You can ask the model to write in a more "human" or varied style. Or you can run the text through a rewording tool. Watermark detection only helps when a watermark was added and survived. Often it wasn't, or didn't. This is a classic cat-and-mouse race, and the cat is losing.

What to do instead

For anything with real stakes - grades, jobs, publication, moderation - a single detector score is the wrong tool. Better signals come from process and context:

Look at draft history and version control, not just the final text.
Ask follow-up questions about the work, or compare against a known writing sample.
Judge whether the content is correct, fresh and useful. An LLM's real weakness isn't that it's easy to detect. It's hallucination - stating false things with full confidence. Checking facts catches more real problems than any detector.
If you must use a detector, treat its output as one weak input. Note the false-positive risk. And never auto-run a decision or claim on it alone.

For more on how these models handle your data and where the real risks lie, see whether ChatGPT is safe to use.

The bottom line

AI detectors work by measuring the statistical fingerprints of machine text - low perplexity, low burstiness, learned classifier patterns, or provider watermarks. They never understand meaning. That design makes them a matter of odds, not proof. They are prone to false positives, above all against plain or non-native writing. They are easy to defeat with light editing. And they are unreliable enough that even OpenAI pulled its own detector. Use them, if at all, as a faint hint. Base real decisions on process, context and whether the writing is any good.

Photo: Unsplash (source)

Also available in

FR ES DE IT PT

FAQ

How do AI detectors work?

AI text detectors look for statistical fingerprints of machine writing. They do not 'read' for meaning. The two classic signals are perplexity and burstiness. Perplexity is how predictable each word is. AI text tends to be very predictable, so it has low perplexity. Burstiness is how much sentence length and complexity vary. Humans vary more, while AI tends to be flatter. Most modern tools also run a trained classifier. This is a model shown many human and AI samples. It learns to output the odds that a passage is machine-made. A third approach is watermarking. Here the AI provider skews word choice in a hidden pattern that a matching detector can later spot. None of these checks facts or intent. They all guess likelihood from surface patterns.

Are AI detectors reliable?

Not reliably enough to be used as proof. They make two kinds of mistakes. False positives flag real human writing as AI. False negatives miss real AI text, above all after light editing or rewording. OpenAI publicly shut down its own AI Text Classifier in July 2023. It cited the tool's low rate of accuracy. Detectors key on statistical 'plainness.' So clear, by-the-book human writing can trip them. And a few human edits or one rewording pass can beat them. Treat any score as a weak signal, never a verdict.

Do AI detectors give false positives?

Yes, and that's their most serious weakness. A detector measures how 'predictable' text looks. So clear, well-built human writing can score as AI. That is the kind of writing students and pros are taught to produce. Published research has also raised a concern about bias. Detectors flag text by non-native English speakers more often, because their phrasing tends to be simpler and more predictable. Acting on a false positive can cause real harm. You might accuse a student or reject a writer. That is why no sound policy should rely on a detector alone.

Can AI detectors be fooled?

Easily, in practice. Many simple moves lower a detector's confidence. You can edit lightly, reword, swap a few words, or ask the model to write in a more varied or 'human' style. You can also run the text through a rewording tool. Watermark-based detection only works if the provider added a watermark and it survived editing. It often doesn't. Detection is a guess about surface patterns. So anything that changes those patterns weakens it, even normal human editing. This cat-and-mouse race is why detection alone can't be a firm gate.

What should I use instead of an AI detector?

For anything that matters, lean on process and context, not a single score. Look at draft history and version control. Ask follow-up questions about the work, and compare it against a known writing sample. Then judge whether the content is correct, fresh and useful. An LLM's real weakness is hallucination, not how easy it is to detect. If you use a detector at all, treat it as one weak input among many. Note the false-positive risk. And never make an accusation or an auto decision on its output alone.

Related research

Two developers looking together at code displayed on a laptop screen in an open-plan office

ai-coding

Copilot Code Review Gets Agent Skills and MCP: What Changes, and the Read-Only Limit

GitHub made agent skills and MCP support in Copilot code review generally available on 29 July 2026. Reviews can now use your own standards and pull context from your tools, with every MCP call restricted to read-only.

PrivSec Lab·Jul 30, 2026·5 min read

A developer seen from behind, wearing headphones and working at a monitor showing code in a dark, blue-lit room

ai-coding

Claude Opus 5 Is Now in GitHub Copilot: Who Gets It, How It Is Billed, and the Security Caveat

Claude Opus 5 became available in GitHub Copilot on 24 July 2026 for Pro+, Max, Business and Enterprise. It is billed at provider API list price rather than a flat multiplier, and it ships safeguards that may block some security-adjacent requests.

PrivSec Lab·Jul 29, 2026·4 min read

Lines of C++ source code on a dark editor screen

ai-coding

Nvidia, Microsoft, Meta and 20+ Firms Sign an Open Letter Against Banning Open-Weight AI (2026)

On July 24, 2026, around 25 tech firms - Nvidia, Microsoft, Dell, Hugging Face, IBM, Mistral, Mozilla and more - urged Washington not to restrict open-weight AI models. Who signed, who is notably absent, the China context, and what it means for developers.

PrivSec Lab·Jul 25, 2026·4 min read