38 min read

The Little AI Models That Could | [Sidecar Sync Episode 130]

The Little AI Models That Could | [Sidecar Sync Episode 130]

Summary:

This week on the Sidecar Sync, Amith Nagarajan and Mallory Mejias trace one of the biggest stories in AI: how cutting-edge intelligence keeps getting compressed into smaller, faster, cheaper models. From GPT-4-era pricing to today’s open models like Gemma 4, they unpack the forces driving this shift, including distillation, mixture-of-experts architectures, quantization, and Google Research’s new TurboQuant breakthrough. Along the way, they explore what this means for associations: lower costs, better knowledge assistants, hybrid AI systems, and a future where organizations that start learning now will be far better positioned to capitalize on an era of abundant, inexpensive intelligence. 

Timestamps:

00:00 - Puppies & the Innovation Hub 
05:48 - The Little Models That Could
09:16 - The Cost-vs-Intelligence Curve
15:51 - Why Waiting on AI Is a Mistake
17:50 - Hybrid AI Architectures for Associations
23:49 - TurboQuant and the Math Behind Compression
32:52 - What TurboQuant Means for RAG and Knowledge Bases
40:59 - Jevons Paradox and the Future of Association AI
45:57 - Readiness and What Comes Next 

 

 

 📍 Join us at Sidecar Innovation Hub on Tuesday, April 21, 2026 in Chicago for a hands-on day of AI strategy for associations:
https://sidecar.ai/innovation-hub 

👥Provide comprehensive AI education for your team

https://learn.sidecar.ai/teams

📅 Register for digitalNow 2026:

https://digitalnow.sidecar.ai/digitalnow

🤖 Join the AI Mastermind:

https://sidecar.ai/association-ai-mas...

🎀 Use code AIPOD50 for $50 off your Association AI Professional (AAiP) certification

https://learn.sidecar.ai/

📕 Download ‘Ascend 3rd Edition: Unlocking the Power of AI for Associations’ for FREE

https://sidecar.ai/ai

🛠 AI Tools and Resources Mentioned in This Episode:

Gemma 4 ➔ https://deepmind.google/models/gemma/gemma-4/

Gemini ➔ https://gemini.google.com

Claude ➔ https://www.anthropic.com/claude

ChatGPT ➔ https://chatgpt.com

NotebookLM ➔ https://notebooklm.google.com

DeepSeek ➔ https://www.deepseek.com

Llama ➔ https://www.llama.com

Mistral ➔ https://mistral.ai/en/

Artificial Analysis ➔ https://artificialanalysis.ai

MemberJunction ➔ https://memberjunction.org/

👍Please Like & Subscribe!

https://www.linkedin.com/company/sidecar-global

https://twitter.com/sidecarglobal

https://www.youtube.com/@SidecarSync

Follow Sidecar on LinkedIn

⚙️ Other Resources from Sidecar: 

More about Your Hosts:

Amith Nagarajan is the Chairman of Blue Cypress 🔗 https://BlueCypress.io, a family of purpose-driven companies and proud practitioners of Conscious Capitalism. The Blue Cypress companies focus on helping associations, non-profits, and other purpose-driven organizations achieve long-term success. Amith is also an active early-stage investor in B2B SaaS companies. He’s had the good fortune of nearly three decades of success as an entrepreneur and enjoys helping others in their journey.

📣 Follow Amith on LinkedIn:
https://linkedin.com/amithnagarajan

Mallory Mejias is passionate about creating opportunities for association professionals to learn, grow, and better serve their members using artificial intelligence. She enjoys blending creativity and innovation to produce fresh, meaningful content for the association space.

📣 Follow Mallory on Linkedin:
https://linkedin.com/mallorymejias

Read the Transcript

🤖 Please note this transcript was generated using (you guessed it) AI, so please excuse any errors 🤖

[00:00:00:14 - 00:00:09:17]
Amith
Welcome to the Sidecar Sync Podcast, your home for all things innovation, artificial intelligence and associations.

[00:00:09:17 - 00:00:24:18]
Amith
My name is Amith Nagarajan.

[00:00:24:18 - 00:00:26:15]
Mallory
 And my name is Mallory Mejias.

[00:00:26:15 - 00:00:35:18]
Amith
 And we are your hosts. And here we go again. We've got all sorts of craziness going on in the world of AI and associations, as always Mallory, how are you doing today?

[00:00:35:18 - 00:00:49:05]
Mallory
 You know, Amith, I'm doing pretty well. I'm kind of chuckling to myself because we had a bit of a rocky start to the podcast this morning with puppy Winston, just being adorable and kind of, you know, doing puppy things. How are you today?

[00:00:49:05 - 00:01:07:19]
Amith
 Other than being exhausted, chasing him around and being up all night and all that, but it's doing great. We're having a great time with him and he's turned 10 weeks on Sunday. So we're, we're making progress, but you know, so far his record is sleeping for four and a half hours, I think, without getting upset.

[00:01:07:19 - 00:01:20:00]
Mallory
 I was going to ask. I think that's, I think that's better than the last time. Or maybe he had just hit a few hours stretch the last time we recorded a pod. Four and a half hours, not bad, but not ideal when you're in deep sleep and you got to pop up and take the dog out.

[00:01:20:00 - 00:01:32:05]
Amith
 Yeah, exactly. So working through that, but I think the next two, three weeks, I'm optimistic we'll be making some improvements. So, um, yeah, so other than that, uh, things are great. How, how's everything on your end?

[00:01:32:05 - 00:02:30:04]
Mallory
 Everything's been good over here. Still in full blown house mode. I'm trying to think if we've made any progress since the last recording, uh, our cabinets have handles. That's big. Our electrical is working better. We did go through a time of meath where every time I use the microwave, it would flip a breaker. So that was a problem. Uh, but we did get that fixed. And other than that, I mean, we're just enjoying the warm weather in Atlanta. I know we've got some new listeners on the pod, so I want to say welcome to you all. Uh, if you haven't officially met us or maybe this is your first time listening, someone on LinkedIn, I believe her name was Tanya. Hopefully I'm pronouncing that correctly. I commented on one of my posts about the podcast and said, she always listens to us when she's getting her nails done every week. And I love that. I guess I don't often think about where our listeners are when you're, when you're consuming this podcast, maybe you're making yourself a cup of coffee right now, you're on a walk or you're getting your nails done, uh, but welcome. And hopefully you enjoy this episode.

[00:02:30:04 - 00:02:54:22]
Amith
 I love that. Well, thanks for, uh, however you weave us into your weekly routine. Uh, we appreciate you being here and, and, uh, going on this AI journey with us. Uh, it's a lot of fun. It's a lot of crazy. And, um, you know, Mallory things keep compressing in terms of time. You know, we end up with, uh, more and more happening in less and less time. And thematically, that seems to be impacting what we chose to speak about on this episode as well. Right.

[00:02:54:22 - 00:03:14:16]
Mallory
 Right. We're talking all about model compression models, getting smaller and cheaper and more powerful. I mean, before we kick that off, though, I know we have an event coming up for Sidecar and the greater blue Cyprus family. We have the innovation hub. I believe next week, April 21st in Chicago, will you be attending? And can you tell us anything about that event?

[00:03:14:16 - 00:04:07:13]
Amith
 Uh, I'm glad you reminded me. So I'm so exhausted from puppy puppy chasing. I might've forgotten it. I probably probably not even with the puppy, but, uh, yes, I will be up there. It's April 21st, Tuesday, April 21st in Chicago. It's a full day, one day event innovation hub put on by Sidecar. And, uh, we expect to have over a hundred people there. And so it's going to be the biggest innovation hub yet. This is our, uh, smaller, uh, springtime event that we do each year to really welcome the community to come in and talk about their use cases, what they've been doing with AI. So we have, we have a number of different speakers planned, a few people from our end, I'll be speaking briefly as well. Uh, but we have a lot of different folks from the community that are going to be sharing things they've been doing with AI. So it's always a fun time. It's, um, a small intimate gathering. It's a single, single track. There's no, no breakout sessions or anything like that. Just one big room with everyone. And there's a lot of great conversations. So looking forward to that.

[00:04:07:13 - 00:04:28:05]
Mallory
 That sounds like a great event and the American college of surgeons, they have a beautiful space. We've had the innovation hub there before. I think we started the innovation hub a few years ago. And I want to say that first year, we only had about 35, maybe 45 attendees. So very exciting to see that it's growing, but still small and intimate enough that you can have those meaningful conversations.

[00:04:28:05 - 00:05:35:22]
Amith
 Yeah. We launched the event in order to have something that kind of, uh, was a counterweight to digital now in the fall every year. We're digital now, if you're not familiar is our fall, um, big event where we have hundreds of association execs and, um, people from the community getting together, uh, this year we expect to have over 400 people. Um, this is going to be the best digital now ever, but we want it to have a springtime event that could compliment it and be more of a community driven, uh, event, more conversation, a lot lighter impact on the schedule, just being a one day event. Uh, and in the past we read it in both Chicago and Washington, but we actually ended up realizing that we had people traveling in both directions. So maybe because of schedule, it might work better for people in DC to go to the Chicago one or vice versa. So this year we decided an experiment. We said, we'll just run it once and we'll run it in the city that we're not having digital now. And so digital now this fall, uh, will be in late October in Washington, DC. And so we're having innovation hub this year in Chicago and next year we're going to bring digital now back to Chicago and we'll have the springtime event innovation hub in the DC area. So pretty excited about it. We think the new cadence is going to work pretty well.

[00:05:35:22 - 00:05:47:10]
Mallory
 Well, excited to hear how that goes. We'll definitely chat about it on our next pod and any listeners, if you're going to be there, make sure you tell Amit hello and, uh, tell him to, to give Winston a hug for you.

[00:05:48:24 - 00:06:26:05]
Mallory
 All right. Today we are dedicating the full episode to one overarching topic since the very first episode of the show, we have tracked a trend intelligence is being compressed into smaller and smaller packages. A development last week brought the whole story to a head. Google released Gemma for an open source model that rivals their own paid products. And right before that Google research published a compression breakthrough called turbo quant that shows how far the engineering behind this trend has come. So today we want to zoom all the way out and tell that full story. The story of the little models that could.

[00:06:27:13 - 00:06:31:00]
Mallory
 All right. So first let's talk about the compression story from the beginning.

[00:06:32:00 - 00:06:56:09]
Mallory
 In early 2023, if you wanted the best AI available, you were paying roughly $30 per million tokens to use GPT for through open AI's cloud. It was big, expensive, and you couldn't run it yourself today. That same level of capability costs under a dollar. And in many cases you can run it on your own hardware for free. That's roughly a 200 fold cost reduction in under three years.

[00:06:57:10 - 00:09:01:07]
Mallory
 Every major AI company now releases a family of models. So they've got a powerful flagship and then a smaller, faster, cheaper version. And sometimes one in the middle. And Tropic has Haiku. Google has flash. Open AI has many. And what keeps happening is the small model catches up to where the big model was just months earlier. So Claude 3.5 Haiku, the smaller model matched Claude 3 Opus, its most powerful model in eight months. Google's Gemini 3 flash outperforms their own 2.5 pro while being three times faster. On the open source side in mid 2024, Meta released llama 3.1 with a massive 405 billion parameter flagship. Just a few months later, they released llama 3.3 at 70 billion parameters, roughly a sixth of the size and it matched the larger model's performance. They did that through a technique called distillation, where you essentially use the big model's outputs to teach the smaller one. And then we've got DeepSeek, a Chinese AI lab that show that you don't even need a massive budget to compete. There are one reasoning model was reportedly trained for just $6 million compared to hundreds of millions that US labs were spending and it performed at the frontier. That brings us all the way to last week. So Google released Gemma 4, the latest generation of their open source model family, a line we've trapped since episode 18 of the SideCarsync podcast. Gemma 4's flagship is a 31 billion parameter model that ranks number three among all open models in the world, outperforming models 20 times its size. A myth take when you initially shared this was that it's roughly as capable as Google's own Gemini 3 flash, which is a paid product. A free open model matching a commercial offering from the same company. There's also a mixture of experts variant that only activates 3.8 billion of its 26 billion parameters at a time and edg models that run on phones with vision and audio. It's released fully open source, no commercial restrictions.

[00:09:02:10 - 00:09:56:00]
Mallory
 The techniques driving all of this distillation mixture of experts and quantization, which we'll talk about in the next topic, are compounding on one another. What was cutting edge and cloud only 12 months ago is running on laptops and phones today. Right now I am sharing my screen with a visual. If you're listening, audio only. Ameth and I will do our best to describe exactly what we're seeing. But I do recommend that you check this out on YouTube if you would like to see the chart that we have in front of us. So on the X axis, we're seeing price, which is U.S. dollars per million tokens on the Y axis. We're seeing the artificial analysis intelligence index. So kind of overall intelligence of some of the most popular and used models today. There are 28 models plotted on this chart. Ameth, can you describe what it is that you are seeing when you look at this this visual that we have in front of us?

[00:09:56:00 - 00:11:28:01]
Amith
 Yes, when you look at the Y axis or the up and down on the chart, you see that on the left side of the chart and the right side of the chart, you see roughly the same thing, right? So how high the little dot is on the chart means the model is smarter. So Claude Opus 4.6, GPT 5.4 on extra high and Gemini 3.1 Pro are the highest bubbles. They're all at around 56, I think, on the artificial analysis benchmark. And by the way, artificial analysis, if you're not familiar with them, listeners definitely check out their website. It's a really cool way of comparing a bunch of different models. They have a lot of really good analytics on there and they have their own index, which essentially is a composition of a number of other benchmarks that they bring together to form a single number that shows overall intelligence. There's lots of ways of measuring models, but I like this one. It's just a simple way of saying what's the overall intelligence level. So you can see Opus GPT 5.4 and Gemini 3.1 Pro are at the very top. However, right below that, right below that, you have GLM 5.1, which is an open source model. You have Gemini 3 Flash and you really have basically models like if you just go even further to the left, anything that's above 50, you have Minimax 2.7, which is basically right up there almost at the same level of intelligence. And the cost is dramatically lower, right? Because what's further to the left is less expensive. So that's really the key message here is that you're getting almost state of the art intelligence for almost nothing, relatively speaking, in terms of cost.

[00:11:28:01 - 00:11:48:05]
Mallory
 So now I'm sharing another version of that same image, but I went ahead and added Clod 3 Opus and GPT 4 just to really drive home what Amit just said. When you add in these models, you can very clearly see basically a vertical line where all of these models are approaching zero in terms of cost. Is that what you're seeing, Amit?

[00:11:48:05 - 00:12:40:01]
Amith
 Yeah, essentially. I mean, and the cost obviously isn't quite zero yet, but the trend line is what matters. And we have to remember that we have to look at this from an exponential perspective. This isn't a historical chart that says 30 years ago, something costs something that's dramatically less than now. We're talking about Clod 3 Opus and GPT 4, which was not very long ago. And those models at the time were stunningly intelligent, state of the art models, and they were very expensive. And obviously competition drove some of that pricing down. But what you're seeing here is that delivery of intelligence relative to the size of the model is a key component because the smaller the model, the less expensive it is to run, essentially. So when you look on the left side of the chart, all these are smaller models, including even something like Mistral Small, the French open source AI company or GPT OSS 120B, which is an open source model from OpenAI.

[00:12:41:11 - 00:13:01:10]
Amith
 And what you're seeing here in comparison to the first chart is all of the more recent model releases are indeed clustered together the way you describe Mallory on the left side of the chart because they're so much less expensive than the models that were state of the art. And they're also a lot smarter. You know, even if you look at like Mistral Small, it's dramatically more intelligent than GPT 4.

[00:13:02:11 - 00:15:51:02]
Amith
 And it's incredibly cheap to run. So the broader trend line is this. When we look at this cost curve, what we're seeing is that the near state of the art, the near frontier level of intelligence is about a one tenth the cost rate. And that will continue to hold. And what you'll also find is that that will continue to be the case for what's currently frontier intelligence if you just wait about six to 12 months. And many of the applications that we are interested in running in associations and many other organizations for that matter are indeed applications that don't need even today's state of the art. So Claude 4.6 Opus, GPT 5.4 on extra high, Gemini 3.1 Pro. Those are the three strongest models in the world at the moment, not including the new models that are coming that we've been hearing about. Methos, for example, from cloth from anthropic stuff like that. But these models at the moment are the most powerful available models. And we don't even need that for most of the things that we do. We don't need models that powerful. We can get by just fine with something like Gemini 3.1 flashlight. You mentioned that in the intro, that that model is something that I indicated is roughly on par with Gemma 4. And that's pretty stunning because Gemini 3.1 flashlight is a really smart model. It's super, super fast. It's really cheap. And that's a commercial model. And then you have Gemma 4, which has a couple of different variants. And those are basically near free to run. You can run them yourself and they're effectively free, but you can run them on infants, providers in the cloud and they're very low cost. And those are very close to Gemini 3.1 flashlight. So really the main takeaway from this is when you think about intelligence curves and adoption, if you're thinking about an application, for example, classifying every document that you've ever written in your association's history or like getting the tags from it, right? Like a lot of associations have these taxonomies that are very difficult to maintain because the taxonomies themselves are changing. And if there's a new topic that has come out in your taxonomy, how do you go back and retag all of the old articles that you've written, hundreds of thousands of articles over decades of time? And that would be a, you know, obviously an insurmountable obstacle classically. But now you can adapt your taxonomy to reflect current topics, current subtopics in your profession and reclassify all of your old content very, very quickly and in almost no cost. You can talk about basically tens or hundreds of dollars to classify a massive corpus of content versus tens of thousands or hundreds of thousands of dollars with models like GPT-4. So those are the things and by the way, the classification results will be better as well because these models are more intelligent. So that's just one example of an at scale problem where someone might say, hey, I want to be able to do a lot more, but I'm limited by cost.

[00:15:51:02 - 00:16:09:07]
Mallory
 What would you say to the association leader, Amith, who was looking at that chart that we just shared and is thinking, well, I'm glad I didn't do anything with GPT-4 or Claude Opus 3 because everything's getting cheaper and more powerful. So you know what? I'll just wait another year until the models are more powerful and even cheaper. And then maybe I'll give it a go.

[00:16:09:07 - 00:17:25:22]
Amith
 What do you say? Great question, Mallory. And unfortunately, a lot of people are doing exactly that, not necessarily because they're thinking ahead in the way that you describe it just because they're saying, well, you know, I'll just see how this thing shakes out more generally. And that's a big, big mistake because the biggest problem you're going to have is not the ability for the model to do what you want, but for your own team and your own ability to understand what these things can do. So to know what you want the model to do, you have to have some reasonable level of familiarity with AI and the absence of experimentation, the absence of adoption, the absence of real use in an organization because of this, well, we'll just wait and see mindset, which is, I would say probably about half the association market right now is still, if they were honest with themselves, basically in a wait and see mindset. The other half, I think, is actively starting to deploy and really dive deeper. But there's roughly half the market that I see where people are giving lip service to saying, yeah, we're doing stuff. But really what that means is they've said AI is OK to use in some cases, but they haven't really pushed it at all. So the unfortunate reality for those organizations is they're going to have a really hard time understanding what the use cases are. They're going to drive value. They're going to be at the starting line and the models are going to be so much faster. And it's like right now we're trying to catch a train that's already moving.

[00:17:26:24 - 00:17:48:22]
Amith
 That's tough, especially if the train has already left the station, but maybe it's going slow enough where you can run and jump on it or maybe get on horseback and try to jump on that train. But pretty soon it's going to be going at Mach 10 or something. So you're going to need something else in order to figure out how to catch it. So and that has to do not even so much with the tech. Again, it's our brains, our ability to understand this stuff. That's the big issue.

[00:17:50:12 - 00:18:07:02]
Mallory
 It seems like the emerging pattern is hybrid architectures. And we've talked about this on the pod before, Meath, where a smaller model is going to handle the bulk of the routine work cheaply and then escalates complex cases to a frontier, a more powerful model. Can you help paint a picture of what that looks like for associations?

[00:18:07:02 - 00:20:06:06]
Amith
 Yeah, I'll give you a really concrete example. So, you know, when you think about breaking down a problem and forget about AI for a minute, just think about your own team. You might start off with the brain trust, with a lot of people together thinking about a problem overall. How should we structure our annual meetings to be more engaging? That's a big, complex, challenging problem. It requires a lot of different skill sets, a lot of different data, a lot of different ideas to think through a problem like that. How do we make our annual meeting more engaging? Right. It's a fairly generalized question, but that's a question a lot of people are asking. To solve for that, you might need a lot of different skills, fairly high levels of experience and access to a lot of data. But let's say that you go through this process and you have that brain trust come up with a plan and you say, hey, we really like this plan. We're excited about it. This is what we're going to do for the 2027 annual meeting. We're super jazzed about it. It's going to have all these new innovative formats. We're going to have this new technology that's going to be able to engage attendees in new ways, et cetera, et cetera. So you have a plan built. And now you say, okay, now I need to break that plan up into 10, 15, 100 different subcomponents to go execute. And to go execute that plan, I need to give whoever I'm assigning it to very clear instructions. I need to have them understand the overall plan. But they don't necessarily need to be at the level of experience that the planners had, right? The people at that meeting who created the plan, they just need to know how to execute the plan. And so in the world of AI, you can do similar things. You can use frontier models to create complex projects and plans. And then you can use worker bees to like worker be a eyes like Claude Sonnet or Haiku or, you know, GLM or Gemini flash to do a lot of the day to day work. And then what you do is you have the bigger AI, the more intelligent AI oversee and review the work from the smaller AI. So a common pattern that we have, I'll give you an example out of Member Junction, our data platform. This is the open source data platform, if you haven't heard me speak about it before.

[00:20:07:06 - 00:22:19:12]
Amith
 And this is a totally free open source thing you can download. But Member Junction has an agent built into it called Research Agent. And what Research Agent does is it is capable of, you know, basically trying to solve any arbitrary research question you throw at it. It's kind of like how deep research works in Google's Gemini or in Claude or in chat GPT. The difference is, is that this tool has access to your internal data, your structured database, but also your internal files, access to things like Slack and Teams, and can do a really good job of helping you combine insights from all those things along with the public web. And so in order to do that, let's say you go to Research Agent and you say, I have this complex question about how to solve for a particular problem. Or maybe you're just curious about something. You say something like, I really want to understand what motivated our attendees to come to last year's annual conference. Go figure that out. Well, you probably have a lot of data out there between emails, files, maybe there's stuff on the web, maybe in your online community, that if you went through and read all of it and analyzed it and classified it, you might be able to get a pretty good sense of the answer to that question. To come up with the plan of how to do that, you might want to put some time into that, right? You don't want to just send off a dozen researchers to research this without really good questions that you want to answer. So Research Agent works exactly that way. It comes up with a hypothesis that you want to test and then it basically spawns all of these subagents to go off and do the actual tasks. Now those subagents don't need to be nearly as smart as the overseer, the coordinator, essentially the orchestrator at the top level. And then finally, at the very end of the Research Agent process, you might want a slightly higher level AI there to put together the report to actually distill down that knowledge and give you something that's actionable. So we use a high end AI on the front end and a high end AI on the back end of it, like Gemini 3.1 Pro. But then we might use something like Gemini 3 Flash or Flashlight or one of these other models that's very, very inexpensive to do the bulk of the work. And that's both more token efficient, meaning and more cost efficient, but it's also faster and it scales more. Because if I can do a report like that for five cents instead of five dollars, you'll probably do more of them.

[00:22:19:12 - 00:22:21:23]
Mallory
 And it's also more energy efficient, right?

[00:22:21:23 - 00:22:23:03]
Amith
 That's true as well.

[00:22:24:10 - 00:25:27:13]
Mallory
 Now I want to move to topic two. So we've been talking about the trend models getting smaller and cheaper while staying just as capable. And I want to spend a few minutes on something that doesn't go out of headlines, but quietly is making all of this possible. And that is the engineering of compression. At the end of March, Google Research published a paper on a new technique called TurboQuant being presented at ICLR 2026, one of the top machine learning conferences. It's a good window into the work happening behind the scenes. So here's the problem that TurboQuant solves, and I'm going to use an analogy to make it more concrete. So imagine that you're writing a really long research paper and as you write, you keep a set of sticky notes next to you, reminders of key facts, definitions, earlier points you made. So you don't have to flip back through the entire document every time you need to reference something. I know for me, when I was younger in elementary school, we would always use index cards. So you would do all your research. You'd have a stack of index cards, hopefully organized, and then you would reference those as you wrote the research paper. So AI models do essentially the same thing. When a model is processing a request, it maintains what's called a key value cache, a running set of notes about everything it's already figured out. So it doesn't have to start from scratch with each new word it generates. We've talked on the show before about why models can seem forgetful and long conversations, though I feel like that's gotten better a lot better recently. This cache is directly related. The longer the conversation or document, the bigger the stack of notes gets. And at some point, it starts overwhelming the model's working memory. That's when things slow down, costs go up, and the model starts losing track of what you told it earlier. So the question becomes, can you shrink those notes without losing any of the important information? That's what quantization does. It takes the precise numbers the model uses internally, imagine each one stored as a long decimal, and rounds them down to take up less space. The tricky part has always been that the rounding process itself creates errors, and the tools you need to manage those errors take up their own space, which partially defeats the purpose. What TurboQuant figured out is how to compress those internal numbers down to just three bits each. For reference, a standard number in these models is stored at 32 bits. So this is roughly a 10x reduction, with zero loss in accuracy and no need to retrain the model. In testing, it cut memory usage by at least 6x and made the model run up to 8x faster on NVIDIA's H100 chips, which are the same GPUs powering most of the AI infrastructure in the world right now. It also showed major improvements for vector search, the technology behind how AI tools search through and retrieve information from a knowledge base, which is exactly what's happening when an association builds a member-facing knowledge assistant that pulls answers from their own content. So, I mean, this is a little bit dense. I tried to use the analogy to help everybody understand, okay, what are we talking about here and why is it so important? What is exciting about TurboQuant to you?

[00:25:29:00 - 00:25:34:13]
Amith
 Well, you know, I think, actually, first of all, Mallory, I got to say I'm impressed. You're using index cards in elementary school?

[00:25:34:13 - 00:25:35:10]
Mallory
 I was.

[00:25:35:10 - 00:25:36:16]
Amith
 Wow, that's impressive.

[00:25:36:16 - 00:25:39:00]
Mallory
 Wait, why are you impressed by that? We have to unpack this.

[00:25:39:00 - 00:25:44:13]
Amith
 Because I've barely learned how to use index cards, I think, in high school or something. So you're ahead of the curve. I don't know who

[00:25:44:13 - 00:25:51:13]
Mallory
 I was and you keep them on a little ring and then you kind of organize them. I was using index cards. Was not an iPad kid, everybody, just so you know.

[00:25:51:13 - 00:28:44:16]
Amith
 Yeah, that's impressive. That's impressive. Well, in any event, I think that's actually a great analogy and I think the idea of how does the model keep track of conversations as it goes on? It's not even just a multi-turn conversation where you're asking a question, asking another question, it responds, you ask your next question. It's even within the initial response. So as the model starts to prepare its initial response, how is it basically continuing to predict the next word? Right. That's what we say a lot in the way these models work. It's predicting the next token or the next word. And the KV cache is an instrumental part of that. You described it well for our audience. I'd also say that the other thing that we're compressing is the model weights themselves, which is the parameter count that we keep talking about. So we say that a model has a trillion parameters or a hundred billion parameters. These are all the numbers, essentially. And so if the numbers all have 32 bits of precision versus a much, much smaller amount of precision in terms of memory space, I can load it into a much smaller amount of memory. Think about it this way. This is just a really simple way to kind of compute this for all of our brains. If I say 10 plus 10, you immediately say 20 because it's two very simple numbers. If I say 10.1 plus 10.1, it's 20.2. It's still pretty simple. But if I have, let's say, five or seven more trailing decimals after that, and then instead of addition, I do multiplication, it gets a lot harder. And so the idea is that even though the numbers themselves are very similar, the precision trails off. Now, the precision is super important. So in the past, quantization efforts, which is this compression technique, have always resulted in a loss of information, lossy compression, a loss of information, which means that the models aren't as smart. And the more and more you compress something, the less useful it was. So if I had 10.124579, but then I make it 10.1, the model's now going to have collisions between all these different weights that actually used to mean very different things between these very specific numbers and very general numbers. So what that means now, though, is that, and this is the key innovation from Google at TurboQuant, is they've been able to, through really mathematical magic, from my point of view, eliminate the loss. And that's pretty amazing. So the way I would think about it is this, from an association leader's perspective, it goes back to the first topic directly. It makes it possible for us to say, let's take the exact same performance of something like a Gemini 3.1 Flash and make it available in something that's a fifth the size, that costs a fifth as much to run, right? And potentially can run on hardware that's more accessible. That's part of the trend line that's happening here. So it's a really important part of why it is possible for models to be able to continue to deliver more and more value relative to cost on smaller and smaller hardware. This is one big lever that I think is being pulled. And TurboQuant's brand new, by the way. Google's deployed it, I believe, in Gemma 4.

[00:28:45:19 - 00:29:00:04]
Amith
 And then as far as like anyone else, since it's brand new research, probably not. But they did put it out there as open research. So you're going to see this everywhere. And there's going to be a stacking of additional math and engineering solutions on top of this that further improve it over time.

[00:29:01:12 - 00:29:13:02]
Mallory
 It almost does sound like mathematical magic because hearing the way you explained it to me, if you round, it does seem like you're going to lose information. So the fact that they have solved for that, I don't fully understand it, but it sounds exciting.

[00:29:13:02 - 00:29:51:10]
Amith
 I don't fully understand it either. I'm not a hardcore math person. I did read the paper and I read Google's blog post on this and it's pretty fascinating stuff. What we'll do is, well, if you're interested in taking a deeper dive down that rabbit hole, we'll include Google's blog post announcing TurboQuant in the show notes. And in the blog post that accompanies this podcast episode. And I would encourage anyone who's curious about it to take a deeper dive. The other thing you can do is grab that link from the Google blog and take it over to Gemini, since it's Google, or you can do Claude if you prefer and say, hey, explain this to me. Make me a podcast about it, notebook LM. Go deeper on that topic if you'd like to. But it's pretty fascinating stuff.

[00:29:52:13 - 00:30:37:19]
Amith
 The key thing I think is that we're also seeing broader acceleration of all of the hard sciences, right? You're seeing this throughout engineering. You're seeing this even with some areas of fundamental science. You're seeing AI help an awful lot with acceleration there. So this is the compounding effect. We're very much riding the exponential. Multiple exponentials are converging. It's going to go faster and faster. But this is a key part of it because we have to remember all these super powerful models. You know, if you think about like the Claude, Anthropic, Claude, Methos preview that only 50 organizations in the world have access to now, this is the thing that everyone's freaking out about because it's broken. You know, all of the all of the different systems that are out there, we all rely on for security. It's found tons of vulnerabilities.

[00:30:38:19 - 00:30:47:13]
Amith
 That particular model is not available generally, which is probably really good. It's reportedly has something in the order of five to 10 trillion parameters. It's a massive, massive model.

[00:30:49:09 - 00:30:57:11]
Amith
 Now, that model, as brilliant as it is, is still on the order of magnitude about roughly one tenth of the organizational complexity of a single human brain.

[00:30:58:12 - 00:31:26:13]
Amith
 And so even though it's amazing, it still isn't quite on the level of complexity of our brains yet. I don't know how much power it would take to power Claude Methos, but most likely a whole heck of a lot because of the size of the model. Right. And so whereas our brains run on the same amount of power as effectively a light bulb. So we've got a long ways to go before we get to what we know is a possible level of intelligence relative to efficiency. And this is one step along the way. It's exciting.

[00:31:26:13 - 00:31:55:09]
Mallory
 I feel like over the past few years, I mean, we've seen an explosion of a power knowledge basis among the association community. And so that's why I mentioned it with this topic. But I think it's worth discussing a bit. So typically these knowledge bases use RAG or retrieval augmented generation. Can you talk about how turbo the discoveries from TurboQuant, how that relates to RAG and might affect associations that have a powered knowledge basis?

[00:31:55:09 - 00:33:46:02]
Amith
 Yeah, it means faster search. So RAG essentially is a technique that kind of works like this. Mallory, imagine if you are this unbelievably well-trained, intelligent person who came out of, you know, pick the best university in the world in your mind, but you don't know anything about my association. Right. My association has a body of knowledge that's very specialized. So you're brilliant and you have incredible fundamental training, but you don't know my content. Right. And what I do is I say, hey, Mallory, I'm going to have you answer all my questions, but I'm going to before the question is asked of you, I'm going to help you a little bit. I'm going to run a search on my knowledge base and I'm going to give you three to 10 chunks of information that I think are really, really relevant to what the user's asking. And then I'm going to ask you to use that information and formulating your response. So I've essentially said, hey, Mallory, here's here's really like a bunch of source material. Now you have a few paragraphs to read. Now come up with an answer. Your probability of getting it right is much, much higher if the RAG system is designed extremely well. But this is actually very easy to get to about 80 percent and very, very hard to get to 100 percent. It's very difficult to get this quite right because there's a lot of nuance to it. But RAG systems use a lot of vector search. That's a big part of what they do. And the faster, the smaller and the cheaper you can make vector search, the better. And the more of it you can do to get to the right answer because you're using a combination of this vector search coupled with reasoning through the AI models to then come up with answers. And the best of these systems do this iteratively. They don't just one shot it by saying, hey, let me grab content through search, chuck it into a model and produce a response because that's highly questionable whether the response is correct. Was it grounded? Was it free of hallucinations? So you typically have to do multiple iterations of this in order to ensure that the answer is actually correct. And that matters a whole lot to the association community.

[00:33:47:12 - 00:34:27:03]
Amith
 And so in those scenarios, the faster you can make each of these iterations, the more and more close to real time you can get. So for example, we have a product in the space many are familiar with called Betty that does a lot of the things I'm describing. And a typical request to Betty might take anywhere from two seconds to up to 15 to 20 seconds, depending on the complexity of the request. And that's pretty reasonable. But if we want to make Betty closer to real time, there's a lot of improvements that our engineering team at Betty are working hard on that we've got ideas for. But model improvements, speed of small models, better vector search, all of these things are going to contribute to effectively real time capabilities for that type of quality. So that's very exciting.

[00:34:27:03 - 00:34:39:13]
Mallory
 Yeah. And I feel like you really nailed that on the head of me. So it's not just faster answers, which is great for the end user, but it's also better answers in the end because it can iterate faster. And so I think that's really the key point with this.

[00:34:39:13 - 00:35:38:12]
Amith
 Yeah. Another example is Mallory. Let's say you take my example and break it up into, you know, multiple chunks where I say, hey, Mallory is one of the people we got to help answer this question. We also have three others and we're going to ask all of them the same question at the same time because we want to get the best possible answer. And then we're going to have somebody else review all four or all five or all 10 answers from all these parallel agents running all at the same time. And then we're going to produce the best answer by comparing contrasting these four great answers. So systems can do that now, but the cost and the time to do that is significant. And so the lower the cost and the faster the speed while either preserving or improving quality, the better, right? And the more you can do these really higher order things right now, if you can get an answer to like 99.9% accuracy or thereabouts, that's really good. But what if we want to take it to 99.9999%, right? Something along those lines and no RAG system can do that yet, but it will, that will be possible in the not too distant future with these types of improvements.

[00:35:38:12 - 00:35:52:09]
Mallory
 Are there any other engineering advancements that you were keeping an eye on, Amith? I mean, I know we've talked about obviously TurboQuant and then some techniques like distillation and mixture of experts. Is there anything else people should kind of keep on their horizon?

[00:35:52:09 - 00:36:41:14]
Amith
 I mean, just broadly, I think if you're interested in this stuff at a deeper level, go and subscribe to the blogs of all the leading research labs. So you can do this with OpenAI, with Anthropic, with Google, of course, Microsoft Research as well. And just read the stuff they put out because there is a lot of fascinating work happening at the fundamental model architecture level, which is a mixture of both engineering effort, but also fundamental scientific research with respect to computer science and AI. And there's new novel model architectures that are being explored. For example, in the past in this pod, I think we talked about diffusion based LLMs as another concept, different than transformer based LLMs. We've talked a little bit in the past about the concept of liquid style AI models, which are models that don't have fixed weights, weights that can change over time, which is both quite interesting and also potentially very scary.

[00:36:42:22 - 00:39:32:01]
Amith
 And there's a lot of things that are happening in the world of AI and ML research right now that will contribute to further compounding this exponential. But if you were to say the model architecture is largely fixed, which of course is not true, but if you were to assume that, then I'd say these incremental advances like this are going to be big, big game changers because we know this model architecture works. And so by being able to stretch it and push it and pull it and prod it, right, we're finding ways to make a model architecture that's actually inherently extraordinarily inefficient, something that performs quite well at real world tasks. The transformer architecture is what I'm referring to. And the big fundamental choke point with it is that for every single word you put into the model, it has to compare that against every possible combinations of words in the preceding sequence. So the longer the sequence is, it's not incrementally more work, it's essentially exponentially more work. And so there's been all sorts of really smart things that have happened with math and with engineering that have helped reduce that problem. Lots of smart shortcuts, lots of ways of doing compression. And so within the current model architecture as well as future model architectures, these kinds of advancements really do play to accelerate and improve quality. Ultimately, again, I would zoom back out for our typical listener here at the Sidecar Sync, that's an association leader thinking about how to contextualize this. It means for your planning purposes, as you're thinking ahead about the types of things you can do, you shouldn't view the world through the lens of constraint and scarcity. You should view the world through the assumption that models are effectively cheap, effectively free even, and that you can do just about anything you can imagine. And here's why. Every six to nine months, we're getting a doubling in power relative to price. And that's what that chart really showed as well, Mallory. It's the compounding out over just a handful of years. Incredibly powerful. You keep that doubling going as we're seeing, it's actually accelerating. And in the next one year, two years, three years, it's insane. We can't even conceptualize what this means. So we can't look at the world and say, "Oh, well, in order to do that, that would have required a thousand employees. We don't have a thousand employees. We have 30 employees." That might have sounded crazy not too long ago, but that's essentially the way I need you to think about the world in the context of AI to take advantage of the exponential. Because you will have the equivalent of thousands of employees or millions of employees effectively in AI because of these advancements. So the science and engineering I think is super cool, but the most important thing for association leaders to understand is to recognize the curve you're riding. And to understand what you can do now so that your brain figures out what you should be doing in six, 12, 18 months. And then remember to take the constraint thinking out of your mind in terms of cost and speed because those problems are going to be solved.

[00:39:34:08 - 00:40:29:24]
Mallory
 So we've told the story of how we got here and what's driving it under the hood. We want to finish this episode by talking about where this goes. Researchers at Epic AI have tracked that frontier level AI performance typically becomes available on consumer hardware within about a year of its initial release. If you extend the cost curve, AI capabilities that cost $10,000 to run in 2022 might cost single digits by the end of this year. The industry consensus is moving fast. Gartner predicts that by 2027 organizations will use small task specific AI models three times more than general purpose large language models. ARM, the company that designs the chips in most smartphones predicts that reasoning per joule, how much intelligence you get per unit of energy will become a standard benchmark for AI models right alongside accuracy and speed. The conversation is shifting from how big can we make these models to how efficient can we make them.

[00:40:31:00 - 00:41:31:07]
Mallory
 There's also an interesting economic dynamic at play here. There is a concept called Jevon's paradox that we've talked about on the podcast before. The observation that when you make a resource dramatically more efficient, people don't just save money. They use more of it. We've seen this play out in computing over and over. Cheaper storage led to more data, cheaper bandwidth led to streaming video. The same thing is happening with artificial intelligence. As the cost of inference or running an AI model drops, the number of tasks worth throwing AI at explodes. A million documents that were too expensive to process in 2023 becomes trivial today. And that is a real opportunity for associations. So, Amith, we've talked about Jevon's paradox on the podcast before. Honestly, I really enjoyed that episode and I really enjoyed the blogs that came out of it. So, I'm excited to talk about it again. How do you think about this compression story that we've been telling and Jevon's paradox and how those intersect for what's to come for associations?

[00:41:32:16 - 00:41:48:09]
Amith
 Part of the issue that people have a hard time grasping with this is not understanding what the applications are going to be. As you think about data, you pointed that out. There's an iconic moment in the last couple of decades when Steve Jobs was talking about the data. He was on stage announcing the original iPod.

[00:41:50:14 - 00:42:01:19]
Amith
 That device had a little hard drive in it that could store a thousand songs. His thing was a thousand songs in your pocket. I think it was $300, maybe something like that. It was fairly reasonable.

[00:42:02:19 - 00:42:24:07]
Amith
 And prior to that, first of all, that sounded insane. How can you carry a thousand songs in your pocket? Because people were used to having CDs in their cars and stuff like that or even tapes. It was remarkable. But then if you were to fast forward that and say, "What about a hundred thousand songs?" Or, "What about a billion songs?" Well, how many songs are there? How many songs does any individual person want to have?

[00:42:25:20 - 00:42:44:19]
Amith
 You might ask, "Well, that's kind of silly. Why would I need anything more than a thousand songs?" Maybe some avid music fans might have five thousand songs or ten thousand songs, but a million songs? But that's the thing. It's like the application isn't songs. The application then became video. Then the application became other things that are dramatically more information dense than just audio.

[00:42:45:23 - 00:43:06:01]
Amith
 Then same thing happened with smartphones and cameras on smartphones. People were, "I don't know if I'm going to take photos." The initial cameras were pretty lousy, but people started to use them. Then the applications like Instagram became a thing that made it easy for people to share the photos and also tweak them a whole bunch. The internet made it a lot less expensive to get those photos moving around, zipping around the internet.

[00:43:07:05 - 00:43:11:24]
Amith
 You had these things converge together. Enough processing power to do that on a smartphone,

[00:43:13:10 - 00:43:20:09]
Amith
 cameras that were good enough to have resolutions that were acceptable, and storage that was cheap, but also bandwidth that became basically free.

[00:43:21:20 - 00:43:26:00]
Amith
 These things compounded a result in new applications people would not probably have envisioned.

[00:43:27:00 - 00:44:31:06]
Amith
 The same thing is happening now in the world of AI. It's we start off with simple chatbots that we're having conversations with and answering one question at a time to then deep research that's asking a thousand questions in parallel and coming back with a stunningly interesting report to autonomous agents that now are able to continuously look for and solve problems in your workflow. For example, autonomous agents that might say, "Hey, every day I'm going to look at my member renewal pipeline and I'm going to identify problems. Are there problems with engagement? Are there problems with our content stream?" Things like that, right? Or looking for opportunities. What types of content should I be writing and doing research and just continually iterating over that? The applications don't come to us right away. We have to really learn in order to come up with these applications. Jevon's Paradox coming back to that is you come up with new applications. In the case of Jevon's Paradox going back to coal, it's like, "Well, more coal will be consumed. There's not one fixed use of it." Of course, that's exactly what happened. The same thing is happening now. Intelligence is the ultimate thing to commoditize because that unlocks everything else.

[00:44:31:06 - 00:45:56:19]
Mallory
 I want to go back to something you've been talking about on this episode of Meath, which is laying the groundwork in your brain with your team and your organization's culture, even before you know a project that you might want to embark on when it pertains to AI. I feel like a good example of that with the Sidecar team and the greater Blue Cypress family is Grace, our AI audio agent that currently lives on the website. We have a recent episode all about Grace and how she works. I'm sure she's even improved in the couple of weeks since we've recorded that episode. I feel like that's a great example because you and Meath and leadership across Blue Cypress have been talking about AI audio for years. I feel like that warmed up our team. Okay, what's possible here? Let us familiarize ourselves with an audio agent. What could that do? What could that look like? Maybe the tech was a little bit clunky at first. If we had implemented an AI agent a few years ago, the latency would have been really high. The sound may not have been great, but we had the ideas in place. When we felt like the technology was ready, we could move on that really quickly. We talked about in that episode, I think it was 60 days from the initial idea to implementation. I feel like that is the importance of this episode. Even if maybe the technology is not exactly where you need it now for something that you might want to implement in your association, just laying that groundwork so that when it's ready, you are ready.

[00:45:56:19 - 00:46:35:02]
Amith
 I agree with that completely, and I think you said it really well. Our preparedness is a function of the role that we play in the sector, our backgrounds, things we've been doing for a long time, continue to compound. But then making that investment and experimenting with new technologies, things that are definitionally not yet ready for prime time, but not necessarily using them for prime time applications, but using them for things that we think could be interesting test cases, experiments. Grace herself is a research preview, and we're not saying Grace is ready for 100% prime time. In our role in this market, we think it's particularly important for us to experiment openly with technology. That's part of the reason we're able to do that.

[00:46:37:07 - 00:46:40:12]
Amith
 The other thing that you have to keep in mind with all this stuff is,

[00:46:41:23 - 00:47:19:16]
Amith
 your opportunities that present themselves are only ultimately realized to the extent that you have a degree of preparation. The preparation is the vast majority of that's in your mind, and in the minds of your team members. And so that's why we at Sidecar obviously are so focused on what we do with this podcast, and our blogs, and our newsletters, and obviously our course curriculum and AIP, because we're trying to help people be prepared. And ultimately, with what's happening with this exponential curve, all of us have the same biology, we have the same hardware, and we have the opportunity to use it in different ways, and if we continually attune ourselves to the idea of change and adaptation,

[00:47:20:18 - 00:47:27:08]
Amith
 we'll be more prepared to take advantage of the things that come our way. Everyone's going to have access to equivalent AI plus or minus a little bit,

[00:47:28:08 - 00:47:33:04]
Amith
 but very few people will take full advantage of that. And that's been true throughout time with all technologies.

[00:47:34:04 - 00:48:57:09]
Amith
 The last thing I'll say about this is that I speak with a lot of association CEOs and other leaders, and many times when they come to me and say, "Hey, I want to do X, Y, and Z in terms of AI, and where should I start first?" is a typical question I get. I always start with learning, I always have that conversation with folks, but the things that people are looking to solve for right now, honestly, we could have solved for with GPT-4 class intelligence, meaning the technology that was available at the very end of 2022, early 2023, that level of AI technology was sufficiently intelligent to solve really most of the problems. I'd say 80 and 85, maybe 90 percent of the problems association leaders are asking to solve, which by the way, doesn't mean that that's a criticism. It doesn't mean those are the wrong problems. Those are absolutely the problems that should be solved. The point is that the technology, the intelligence level, was available some time ago. Now, it's been democratized far more. You can get that technology for almost no cost, and it's very, very fast. And so the adoption curve, this might be the right time for some of those applications, but what's possible to do with even Opus 4.6 and Gemini 3.1 Pro, we're not even beginning to scratch the surface of what those models can do, and very soon those models are going to be considered last year's news. So, I mean, they kind of are already. So, I think that's the key to it is, is you continually adapt and evolve your brain in order to understand what the applications could be.

[00:48:57:09 - 00:49:23:00]
Mallory
 Well, everybody, that is a wrap on our compression episode, the little models that could. Hopefully, you all know by now, AI is not slowing down. It's accelerating. What required a trillion parameter cloud model two years ago can run on a smartphone today. What costs $30 per million costs under a dollar. And as it gets cheaper, it doesn't just save money. It opens up entirely new categories of work that were not even

[00:49:23:00 - 00:49:28:16]
 (Music Playing)

[00:49:39:09 - 00:49:56:08]
Mallory
 Thanks for tuning into the Sidecar Sync podcast. If you want to dive deeper into anything mentioned in this episode, please check out the links in our show notes. And if you're looking for more in-depth AI education for you, your entire team, or your members, head to sidecar.ai.

[00:49:56:08 - 00:49:59:14]
 (Music Playing)

Emotions, Accents, and Avatars Galore & Why Apple Says It’s All Just an Illusion | [Sidecar Sync Episode 87]

1 min read

Emotions, Accents, and Avatars Galore & Why Apple Says It’s All Just an Illusion | [Sidecar Sync Episode 87]

Summary: In this episode of the Sidecar Sync, co-hosts Amith Nagarajan and Mallory Mejias dive deep into the cutting edge of AI-generated audio and...

Read More
Rapid Fire AI Updates & The Coming Energy Reckoning | [Sidecar Sync Episode 96]

1 min read

Rapid Fire AI Updates & The Coming Energy Reckoning | [Sidecar Sync Episode 96]

Summary: In this episode of Sidecar Sync, Amith Nagarajan and Mallory Mejias tackle the latest developments in AI with a fast-paced series of...

Read More
Fargo vs. Klarna & The Rise of Reasoning Models | [Sidecar Sync Episode 81]

1 min read

Fargo vs. Klarna & The Rise of Reasoning Models | [Sidecar Sync Episode 81]

Summary: This week on the Sidecar Sync, Amith Nagarajan and Mallory Mejias explore Wells Fargo’s virtual assistant “Fargo” and how it stacks up...

Read More