The Little Models That Could: How AI Got 200x Cheaper in Three Years

Written by Mallory Mejias | Apr 20, 2026 10:30:00 AM

In early 2023, if you wanted access to the smartest AI available, you paid about $30 per million tokens to use GPT-4. It was expensive, it lived in the cloud, and you couldn't run it yourself. Three years later, that same level of intelligence costs under a dollar per million tokens. In many cases, you can run it on a laptop for free.

That's roughly a 200x cost reduction in under three years.

Numbers like this are easy to skim past, so let me put it in plainer terms. Imagine if the price of a new car dropped from $30,000 to $150 in the same window of time, and the car got faster and safer in the process. That is what has happened with artificial intelligence. And the trend line is still accelerating.

For association leaders, this reshapes the calculus of almost every conversation about what AI can and can't do for your organization. Many of the ideas that got tabled in 2023 because the math didn't work are now trivially affordable. Some of the ideas being tabled in early 2026 will be trivially affordable by year-end.

Here's how it happened, and why it should change how you plan.

What's actually happening under the hood

Every major AI lab now ships a family of models instead of one flagship product. There's a big, expensive model that sets the top of the intelligence curve. There's a smaller, faster, cheaper model built for scale. Sometimes there's a third model in the middle.

The pattern that keeps repeating, cycle after cycle, is this: the smaller model catches up to where the flagship was just months earlier.

Anthropic's smaller Haiku model reached the intelligence level of the prior-generation Opus flagship in about eight months.
Google's newest Flash model outperforms the previous generation's Pro flagship while running roughly three times faster.

Each release cycle, the floor moves up. What required a top-tier model last spring runs on a mid-tier model by fall, and on an open-source model anyone can download by the following spring.

The open-source side is accelerating too

The same compression is playing out beyond the big commercial labs, often at a faster pace.

In mid-2024, Meta released a 405-billion-parameter flagship. A few months later, they released a 70-billion-parameter model (roughly one-sixth the size) that matched its performance. The technique behind that jump is called distillation, where the larger model essentially teaches the smaller one by generating training examples.

A Chinese lab called DeepSeek showed that a massive budget isn't a prerequisite for frontier performance. Their reasoning model was reportedly trained for about $6 million, versus hundreds of millions spent by comparable U.S. labs, and it performed at state-of-the-art levels.

And in April 2026, Google released Gemma 4. The flagship variant is a 31-billion-parameter model that ranks number three among all open models in the world, outperforming models twenty times its size. A smaller variant activates just 3.8 billion of its 26 billion parameters at a time using a technique called mixture of experts, making it dramatically cheaper to run. Edge versions run on phones, including vision and audio capabilities. Fully open release, with no commercial restrictions.

That's remarkable on its own. What makes it more remarkable: this open, free model performs roughly on par with a paid commercial product from the same company.

The chart that tells the story

Here's what the current state of affairs looks like when you plot intelligence against cost.

The vertical axis is a composite intelligence score. The horizontal axis is cost per million tokens. What you want to see is a dot that's high on the chart (smart) and far to the left (cheap). Many of the recent releases sit in exactly that corner.

Models priced well under $2 per million tokens are scoring within a few points of the handful that cost five or ten times more. Near-state-of-the-art intelligence is available for roughly one-tenth the price of actual state-of-the-art.

Zoom the chart out to include the models that were state-of-the-art two years ago, and the shift becomes even more vivid.

The old flagships (Claude 3 Opus at around $30, GPT-4 at around $37) sit in a lonely corner of the chart at high prices and middling intelligence by today's standards. The rest of the field has collapsed toward the left edge, nearly all of it smarter than those older models and dramatically cheaper to run.

The real unlock for associations

The bottleneck used to be budget. Running a project that required serious model capability meant running the numbers, which usually meant scaling the project down or shelving it.

The bottleneck now is imagination. Knowing what to ask the technology to do.

One concrete example. Many associations manage content taxonomies that have grown, contracted, and reshaped over decades. New topics emerge. Old classifications fall out of use. Retagging decades of articles against a current taxonomy used to be an impossible project, the kind of thing that would run into many thousands of dollars or require a team of humans for months.

With today's small models, that same project runs in days for tens or hundreds of dollars. The classification quality is often better than what a team of humans would produce, because the model reads every article in full and applies the same judgment consistently across the whole corpus.

The same pattern shows up almost everywhere you look. Member segmentation. Content summarization. Multi-language translation of legacy materials. Automated research reports drawing from internal data. Content quality review at scale. Projects that were prohibitively expensive a year ago, now trivially cheap.

The planning takeaway: when you're thinking about what your organization could do with AI in the next year or two, don't start from "what's affordable." Start from "what would be useful if intelligence were effectively free," and work backward. That framing is a much closer match to where the technology is heading.

The quiet cost of waiting

A fair response to all of this is: things are moving fast, so maybe the right move is to wait for the dust to settle before committing resources.

The issue with that instinct is subtle.

The real risk of waiting shows up somewhere unexpected: in what doesn't happen during the wait. Your team doesn't build the judgment to recognize which problems are good AI problems. Your operations don't develop the muscle memory for working alongside AI systems. Your members don't develop expectations that you're keeping pace.

Familiarity with AI compounds the same way the underlying technology does. A year from now, associations that have been actively experimenting will be able to move quickly on the next generation of capability, because their people already know how to think with these tools. Associations that spent the same year waiting will be starting from zero against a much more capable technology. The gap between the two groups is widening, not narrowing.

What to take from all this

The compression trend shows no sign of slowing. Every six to nine months, the industry delivers roughly twice the performance per dollar. Multiple research groups are publishing work that suggests the trend is accelerating, not tapering off.

The implication for planning is straightforward. If performance per dollar keeps doubling on that cadence, the models you can afford to run at scale a year from now will be smarter than the flagship models of today. Budget and design accordingly.

The associations that internalize this earliest will build capabilities over the next few years that are hard for peers to match. The ones still waiting will be learning the same lessons later, against dramatically more capable tools.

View full post