How We Built Grace: The Tech Stack Behind Sidecar's Voice AI Agent

Written by Mallory Mejias | Mar 23, 2026 3:44:51 PM

Sidecar recently launched Grace, a voice AI agent that lives on our website. You can talk to her about Sidecar's offerings, ask questions about the AI Learning Hub, and she'll walk you through purchasing options for yourself or your team. She pulls up visuals during the conversation, adapts her responses to your specific questions, and can take you all the way through checkout without you ever leaving the experience.

Grace went from concept to live deployment in about 60 days. She's not perfect — she occasionally stumbles on acronyms, and she can't listen and talk at the same time yet — but she's functional, impressive, and getting better fast. If you visit sidecar.ai, you'll find her icon in the top right corner. She works on mobile too.

We wanted to write this blog for the association professionals in our audience who might be wondering how something like Grace actually works. What's the technology? How do the pieces fit together? What are the real limitations? Consider this a look under the hood — the kind of breakdown you could share with your team if you were thinking about building something similar for your own organization.

How Grace Actually Processes a Conversation

Grace runs on what's called a speech-to-text-to-LLM-to-text-to-speech pipeline. The concept is more straightforward than the name suggests.

When you speak to Grace, the system captures your audio and converts it into text (the speech-to-text step). But it doesn't just transcribe your words. It also picks up on tonal cues, hesitation, inflection, and emotional signals, packaging those as supplementary notes alongside the transcript. All of that feeds into a large language model (LLM), which is where Grace does her actual thinking. The LLM reads the transcribed text plus those tonal notes, considers the full context of the conversation so far, and generates a response. That response then gets converted from text back into natural-sounding audio (the text-to-speech step), and you hear Grace reply.

The whole loop happens in seconds. Fast enough that it almost feels like a real conversation, with natural pauses rather than awkward silences.

One thing that surprised us during testing was how well Grace picks up on conversational nuance. During our demo on the Sidecar Sync podcast, my co-host Amith said "I want to get my whole team in the mix," and Grace immediately understood that meant interest in the Teams AI Learning Hub. She inferred it from context and tone, which is exactly the kind of thing that makes voice AI feel different from a chatbot.

Why We Chose Claude Haiku 4.5 as Grace's Brain

The LLM in the middle of that pipeline is doing the heavy cognitive lifting, and which model you choose has a direct impact on how the experience feels.

Anthropic's Claude offers three model tiers: Haiku (smallest and fastest), Sonnet (mid-range, balancing speed and intelligence), and Opus (the most powerful but slowest). For Grace, speed was the priority. A visitor isn't going to wait four or five seconds for a reply mid-conversation the way they might tolerate a pause in a chatbot. The experience has to feel immediate.

Haiku 4.5 gave us what we needed. It's intelligent enough to handle nuanced conversations — understanding context, adapting tone, reasoning through what to say next — while being fast enough that visitors don't notice any processing time. Sonnet or Opus would produce more sophisticated reasoning, but the latency would break the conversational flow. For audio AI, a slightly less brilliant answer that arrives instantly beats a perfect answer that arrives three seconds late.

ElevenLabs: The Voice Behind Grace

Grace's speech-to-text and text-to-speech components run on technology from ElevenLabs. Their Agent Studio product provides the scaffolding for building voice agents, handling audio capture, transcription, voice synthesis, and browser-based interaction.

One of the features that shaped Grace's design is something ElevenLabs calls client-side tools. These allow Grace to trigger actions directly in your browser during the conversation. When Grace decides it would be helpful to show you something — a product overview, a pricing breakdown, a feature comparison — she pulls up a visual on your screen while she keeps talking.

Grace has access to a library of about 30 pre-built slides, each tagged with a title and description. Based on where the conversation is heading, she decides which visual to display and then narrates around it. She's not reading from a script when she does this. She's reasoning about what to tell you based on everything you've discussed so far. A visitor asking about team pricing for 50 people gets a different walkthrough of the same slide than someone exploring options for themselves. The visual is the same. The conversation around it is personalized.

Those slides were designed by the Sidecar team using Google's Nano Banana 2 model, an updated version of Nano Banana Pro with improved text rendering and faster, cheaper generation. They're pre-built, not dynamically generated, though dynamic generation is something we're exploring for future versions of Grace.

Grace's Knowledge Base: Curated by Design

Every voice agent needs a knowledge base, and there's a temptation to index everything — every blog post, help article, product page, and webinar transcript. Sidecar actually has a tool built for that kind of deep knowledge retrieval: Betty, one of the products in the Blue Cypress family. Betty has access to essentially everything Sidecar has ever written or said.

But Betty wasn't the right fit for Grace. A deep retrieval system like Betty can take five to ten seconds to process a request. That's fine for a text-based tool where you can wait for a thorough response. For a real-time voice conversation, that delay kills the experience.

So we built Grace with a curated, lightweight knowledge base instead. She knows enough about Sidecar's products, pricing, and key differentiators to hold a substantive conversation. She maintains full context throughout the discussion, building on earlier points so she doesn't repeat herself or lose the thread. And her access to those tagged visual resources extends what she can communicate without needing to search a massive database.

The tradeoff is real. Grace occasionally won't have an answer that Betty would. But for a conversational experience, speed and flow matter more than encyclopedic coverage. As the underlying models get faster, these two approaches will eventually converge. For now, the constraint is deliberate.

Checkout Built Into the Conversation

One of Grace's more compelling capabilities is handling transactions inside the conversation itself. Through a Stripe integration, a visitor can go from learning about a product to completing a purchase without ever leaving the experience. If you've used Stripe on other websites, your payment information is pre-populated, making the process fast and secure. Grace can handle both individual and team subscriptions, walking you through the options and completing a B2B-level transaction in the same conversation where you first asked "what is this?"

This matters because so many interactions stall at the point of transaction. Someone is interested in a course or event, they get the information they need, and then they hit a registration form or a "contact us for pricing" page. Each step away from the conversation is a chance for that person to drop off. Grace removes that friction entirely.

Where Grace Falls Short Right Now

Grace is half duplex, meaning she can either listen or speak, but not both at the same time. When you interrupt her, there's a brief delay. The system has to stop generating audio, recognize that new input is coming in, process it, and restart. We simulate something close to full duplex by keeping the microphone always on and stopping Grace's audio output as soon as we detect someone speaking, but the pause is noticeable. It works. It's manageable. It doesn't feel quite as natural as a human conversation where two people can talk over each other.

Pronunciation can be inconsistent. During our podcast demo, Grace handled some names well and stumbled on others. She said "AAP" instead of "AAIP" at one point. Acronyms and proper nouns are still a weak spot for audio AI broadly.

These are characteristics of early technology, and they're improving quickly. Six months from now, what feels impressive about Grace today will probably feel rudimentary.

What's Coming for Grace (and Voice AI Generally)

Full duplex audio, where Grace can truly listen and speak at the same time, is on the horizon. NVIDIA released a model called PersonaPlex in January that demonstrates audio-to-audio processing, skipping the text translation step entirely. That means the model handles hearing, thinking, and speaking all in one place, with no information lost in translation between formats. It's not widely available from cloud providers yet, but the trajectory is clear.

For Grace specifically, the Sidecar team is planning to add short video walkthroughs in the next 30 to 45 days. These would be five to thirty second clips showing what it actually looks like inside the Sidecar LMS — course navigation, available content, the learning experience. Grace would play the video and narrate over it, giving visitors something close to a live demo without a human on the other end. Dynamic content generation is also on the roadmap, where Grace could produce visuals on the fly based on what you're discussing.

Longer term, Grace will expand beyond the Sidecar website. The plan is to bring her into the AI Learning Hub itself and eventually make her an active participant in Sidecar's mastermind sessions — synchronous, instructor-led learning where Grace could contribute in real time.

Why This Matters for Associations

Sidecar built Grace to solve a specific problem: visitors come to our site wanting to understand our products, and our human team can't be available around the clock to guide every person through every question. Many associations face something similar. Members visit your website at all hours trying to find the right resource, register for an event, or understand their benefits. Most of the time, they're on their own.

The tech stack behind Grace isn't proprietary. ElevenLabs, Claude, Stripe — these are commercially available tools. The cost is manageable, and the build took a small team about two months. Grace could just as easily live on an association's website, helping members navigate a credentialing program, explore event options, or renew their membership.

We plan to keep sharing what we learn as Grace evolves. If you want to see her in action, head to sidecar.ai and click the icon in the top right corner. Talk to her. Test her. And if you walk away thinking about what a voice agent could do for your own members, that's exactly the reaction we were hoping for.

View full post