AI Detector Rankings

Combined benchmark accuracy & community arena ratings

11
Detectors
19
AI Models
98
Top Score

Detector Rankings

RankDetectorScore
🥇97.7
🥈96.2
🥉88.7
487.8
587.3
681.1
777.2
872.3
950.7

How Rankings Work

Combined Score Formula

The Combined Score balances three key metrics: F1 Score (50%), False Positive Rate (30%), and False Negative Rate (20%).

Score = 0.5 × F1 + 0.3 × (1 - FPR) + 0.2 × (1 - FNR)

This formula rewards detectors that balance precision (avoiding false positives) with recall (catching AI images).

F1 Score

F1 is the harmonic mean of Precision and Recall. A high F1 means the detector is good at both catching AI images (low FNR) and not flagging real images (low FPR). Formula: F1 = 2 × Precision × Recall / (Precision + Recall).

FPR & FNR

FPR (False Positive Rate) — percentage of real images incorrectly flagged as AI. Lower is better. FNR (False Negative Rate) — percentage of AI images missed. Lower is better.

Dataset

Our benchmark uses images from Midjourney, Stable Diffusion (SDXL, SD 3.5), DALL-E 3, Flux, Adobe Firefly, Leonardo.ai, Runway, Google Imagen, and Ideogram. Real images are sourced from photography databases to test for false positives.

Frequently Asked Questions

What is AI Detector Arena?
AI Detector Arena is an independent benchmark platform that tests AI image detectors against a curated dataset of AI-generated and real images. We measure accuracy, false positive rates, and false negative rates across multiple detectors and AI models. The Arena also features live Elo-based rankings where users vote on which detector performs better in head-to-head comparisons.
How does the Combined Score work?
The Combined Score merges two ranking systems: Benchmark Accuracy (60%) and Arena Elo (40%). Benchmark Accuracy comes from automated testing on our curated dataset. Arena Elo comes from community votes in head-to-head battles. This gives you a single number that reflects both objective testing and real-world comparative judgment.
How does the AI detector benchmark work?
We submit the same set of images — both AI-generated and real photographs — to each detector and record their verdicts. From these results we calculate accuracy (percentage of correct predictions), false positive rate (real images incorrectly flagged as AI), and false negative rate (AI images missed by the detector). All detectors are tested on the same dataset for a fair comparison.
What is the Arena Elo rating?
The Arena uses an Elo rating system — the same system used in chess rankings. Users are shown an image and two detector verdicts side by side, then vote for the detector that gave the better answer. Over time, detectors that consistently win accumulate higher Elo scores. This provides a complementary ranking to the static benchmark.
Which AI detectors are tested?
The benchmark includes major commercial and research AI detectors such as Hive Moderation, SightEngine, AI or Not, Illuminarty, Content at Scale, and others. We continuously add new detectors as they become available and re-test existing ones as they update their models.
Which AI image models are in the dataset?
Our dataset includes images from popular AI generators: Midjourney, Stable Diffusion (SDXL, SD 3.5), DALL-E 3, Flux, Adobe Firefly, Leonardo.ai, Runway, Google Imagen (Gemini), and Ideogram. The dataset also includes real photographs to test for false positives.
Is the benchmark independent?
Yes. AI Detector Arena is not affiliated with any AI detector vendor or AI image generator. We do not accept payment for rankings or favorable placement. All results are based on automated testing against our curated dataset.