AI Detector Arena

AI Detector Benchmark

Independent accuracy testing on our curated dataset

11

Detectors Tested

19

AI Models

100%

Best Accuracy

Detector Accuracy Ranking

Rank	Detector	Accuracy	False PositiveFP	False NegativeFN	Tested
🥇	Hive Moderation	100%	0%	0%	9
🥈	TruthScan	94%	20%	0%	18
🥉	SightEngine	92%	3%	5%	263
4	Was It AI	85%	8%	17%	1282
5	MyDetector	82%	8%	21%	199
6	AI or Not	76%	22%	3%	98
7	Winston AI	75%	25%	0%	8
8	Decopy	65%	7%	38%	1122
9	HF SDXL-detector	61%	17%	37%	1282
10	HF AI-image-detector	16%	9%	75%	1282
11	QuillBot	—	—	—	0

AI Model Detection Rates

How often each AI model gets detected. Lower = harder to detect.

Detected by 5/9 detectors

Hunyuan Image 3.0

Detected by 5/6 detectors

Detected by 5/6 detectors

Wan v2.6

Detected by 6/7 detectors

Black Forest Labs

Detected by 7/7 detectors

Black Forest Labs

Detected by 9/9 detectors

Black Forest Labs

Detected by 8/8 detectors

Gemini 3 Pro

Detected by 8/8 detectors

Detected by 6/6 detectors

Detected by 7/7 detectors

Detected by 6/6 detectors

Detected by 7/7 detectors

Leonardo Phoenix

Detected by 7/7 detectors

Qwen Image 2512

Qwen Image 2512

Detected by 6/6 detectors

Detected by 7/7 detectors

Seedream v3

Detected by 7/7 detectors

Seedream v4

Detected by 7/7 detectors

Stable Diffusion 3.5 Large

Detected by 7/7 detectors

Z-Image Base

Detected by 6/6 detectors

Arena Elo Rankings

Full Leaderboard →

Community-driven rankings from head-to-head battles. Vote in the Arena to contribute.

4W / 0L / 4T8 battles

1W / 0L / 0T1 battles

HF AI-image-detector

1W / 1L / 3T5 battles

1W / 2L / 0T3 battles

0W / 2L / 1T3 battles

HF SDXL-detector

0W / 2L / 2T4 battles

Benchmark Methodology

Dataset

Our benchmark uses a curated dataset containing both AI-generated images and real photographs. AI images are generated from popular models including Midjourney, Stable Diffusion (SDXL, SD 3.5), DALL-E 3, Flux, Adobe Firefly, Leonardo.ai, Runway, Google Imagen, and Ideogram. Real images are sourced from photography databases to test for false positives. The dataset is designed to represent realistic use cases — not cherry-picked easy examples.

How Detectors Are Tested

Each image in the dataset is submitted to every detector under the same conditions. We record whether the detector classified the image as AI-generated or real, along with any confidence scores provided. Results are aggregated to compute accuracy, false positive rate, and false negative rate for each detector.

Metrics Explained

Accuracy — Percentage of correct predictions across all images (both AI and real)
False Positive Rate (FP) — Percentage of real images incorrectly flagged as AI-generated. High FP means the detector is too aggressive
False Negative Rate (FN) — Percentage of AI images that the detector missed (classified as real). High FN means the detector is too lenient

A detector with high accuracy but a high false positive rate may be unsuitable for contexts where wrongly accusing someone of using AI has consequences. Conversely, a low false negative rate matters most when catching AI content is the priority.

Arena Elo Rankings

In addition to the static benchmark, the Arena provides live rankings using an Elo rating system. Users are shown an image alongside two detector verdicts and vote for the better answer. Over thousands of votes, detectors that consistently give correct answers rise in the rankings. Elo ratings complement the benchmark by incorporating real-world user judgment.

Frequently Asked Questions

What is AI Detector Arena?▾

AI Detector Arena is an independent benchmark platform that tests AI image detectors against a curated dataset of AI-generated and real images. We measure accuracy, false positive rates, and false negative rates across multiple detectors and AI models. The Arena also features live Elo-based rankings where users vote on which detector performs better in head-to-head comparisons.

How does the AI detector benchmark work?▾

We submit the same set of images — both AI-generated and real photographs — to each detector and record their verdicts. From these results we calculate accuracy (percentage of correct predictions), false positive rate (real images incorrectly flagged as AI), and false negative rate (AI images missed by the detector). All detectors are tested on the same dataset for a fair comparison.

Which AI detectors are tested?▾

The benchmark includes major commercial and research AI detectors such as Hive Moderation, SightEngine, AI or Not, Illuminarty, Content at Scale, and others. We continuously add new detectors as they become available and re-test existing ones as they update their models.

Which AI image models are in the dataset?▾

Our dataset includes images from popular AI generators: Midjourney, Stable Diffusion (SDXL, SD 3.5), DALL-E 3, Flux, Adobe Firefly, Leonardo.ai, Runway, Google Imagen (Gemini), and Ideogram. The dataset also includes real photographs to test for false positives.

How is detector accuracy calculated?▾

Accuracy is the percentage of correct predictions out of all images tested. A prediction is correct when the detector identifies an AI image as AI-generated or a real image as real. We also track false positive rate (real images wrongly flagged as AI) and false negative rate (AI images wrongly classified as real) because overall accuracy alone can be misleading.

What is the Arena Elo rating?▾

The Arena uses an Elo rating system — the same system used in chess rankings. Users are shown an image and two detector verdicts side by side, then vote for the detector that gave the better answer. Over time, detectors that consistently win accumulate higher Elo scores. This provides a complementary ranking to the static benchmark.

Is the benchmark independent?▾

Yes. AI Detector Arena is not affiliated with any AI detector vendor or AI image generator. We do not accept payment for rankings or favorable placement. All results are based on automated testing against our curated dataset.