Category: AI Model Comparisons

10 Min Read

onFebruary 24, 2026

University of Montreal Study Proves AI Beats Average Humans on Creativity Tests—But Top 10% Still Outperform GPT-4

The world’s largest creativity study just revealed an uncomfortable truth: half of humanity is now less creative than a language model.…

8 Min Read

Artur Markus

onFebruary 22, 2026

Google Gemini 3.1 Pro Scores 77.1% on ARC-AGI-2—2.5x Jump Over Predecessor in Single Generation

Google just doubled AI reasoning capability in 90 days while keeping the price identical. The assumption that frontier AI improves linearly…

10 Min Read

Artur Markus

onFebruary 16, 2026

Snorkel AI Commits $3M to Open Benchmarks Grant—Targeting the ‘Biggest Blind Spot’ Where AI Models Excel on Tests But Fail in Production

Claude Opus 4.6 just scored 76% on MRCR v2—up from 18.5% on its predecessor. GPT-5.3-Codex hit 77.3% on Terminal-Bench 2.0. Neither score…

10 Min Read

Artur Markus

onFebruary 11, 2026

Claude Opus 4.6 Scores 76% on Long-Context Retrieval—4X Better Than Its Predecessor at 18.5%

A 310% improvement in a single release isn’t iteration—it’s a discontinuity. Anthropic just proved that model performance can…

8 Min Read

Artur Markus

onFebruary 4, 2026

ChatGPT’s Market Share Drops to 61.3% as Gemini Surges 237% Year-Over-Year—The AI Chatbot Monopoly Era Ends

ChatGPT lost 25 percentage points of market share in 12 months. The company that ate its lunch isn’t a startup—it’s Google, the…

10 Min Read

Artur Markus

onJanuary 22, 2026

TII’s Falcon-H1R 7B Outperforms 47B Models on Math Reasoning While Running on a 16GB Laptop

A 7-billion parameter model just scored 88.1% on AIME-24 math reasoning, beating models with 47 billion parameters. The parameter count arms…

9 Min Read

Artur Markus

onJanuary 20, 2026

Google’s 12B TranslateGemma Outperforms Its Own 27B Model: Open Translation Hits 55 Languages with MetricX Score of 3.60

Google’s smaller translation model just beat its larger sibling on standardized benchmarks, forcing us to reconsider everything we…

10 Min Read

Artur Markus

onJanuary 15, 2026

TII’s Falcon-H1R 7B Outperforms 47B Models on Math Reasoning While Running on a 16GB Laptop

A 7-billion parameter model just scored 88.1% on AIME-24 math reasoning, crushing NVIDIA’s 47B Nemotron at 49.7%. The assumption that…

12 Min Read

Artur Markus

onJanuary 9, 2026

Inverse Scaling in Test-Time Compute: When More ML Reasoning Tokens Systematically Destroy Performance

The industry just spent billions convincing you that longer AI thinking equals better results. New research proves that’s…

11 Min Read

Artur Markus

onJanuary 2, 2026

The Model Size Paradox: Why Anthropic’s October 2025 Research Proves That 250 Poisoned Documents Can Backdoor Any LLM—And Scaling to GPT-5 Won’t Save You

The security assumption that justified your $50 million scaling budget was just proven false by the company building the models you’re…

AI Model Comparisons

Subscribe to my Blog