The date is September 15, 2023. I'm on my fifth try trying to get my favorite AI image generator Midjourney to spell "Cimatri" correctly in a cartoon image. Despite increasingly detailed prompts and variations, the best result I get is "Cimitto." This is a true story—see below.
This wasn't just a Midjourney problem. Every major AI image generator available at that time struggled with text rendering. The results ranged from laughably wrong to almost-but-not-quite correct, making them unusable for professional applications that required accurate text.
Fast forward to today, and everything has changed.
Text rendering has been the Achilles heel of AI image generators since their inception. Even as these tools became increasingly sophisticated at creating stunning visuals, they stumbled when asked to include simple text elements. This limitation severely restricted their usefulness for creating infographics, logos, educational materials, or any content that required labeled visuals.
The release of OpenAI's GPT-4o image generation has finally solved this persistent problem. When we tested it ourselves at Sidecar, we were amazed to see no spelling errors, no bizarre fonts, and no mysterious additional characters.
What makes this possible is GPT-4o's omni-modal architecture. Unlike previous AI systems that treated text and image generation as separate functions, GPT-4o processes everything within a single model that understands both visual and textual elements simultaneously. This means when you ask for specific text in an image, the model understands exactly what you're requesting and treats the text as a critical component of the visual, not an afterthought.
To test GPT-4o's capabilities, we decided to transform our podcast headshots into cartoon characters wearing Sidecar Sync branded caps.
First, I uploaded my professional headshot and gave it a simple prompt: "Turn me into a cartoon wearing a yellow cap that says Sidecar Sync." The result was astonishing! Not only did it render a recognizable cartoon version of me, but it captured details down to the leaves in the background of my original photo, my earrings, and my outfit. Most importantly, the yellow cap featured perfectly rendered "Sidecar Sync" text.
↓
We then repeated the experiment with my co-host Amith's headshot, with equally impressive results. His cartoon captured the brick background from his original photo and even the stripes on his shirt. Again, the Sidecar Sync text on the cap was flawless.
↓
Taking it a step further, we asked GPT-4o to create an infographic for our podcast using both cartoons. It seamlessly combined the two images it had just created and added "Sidecar Sync Podcast: The Intersection of AI and Associations" text—all perfectly spelled and formatted.
What would have previously been a multi-step process requiring either collaboration with a design team or significant time learning graphic design software was accomplished in less than five minutes with minimal technical expertise.
The breakthrough capabilities of GPT-4o go beyond just text rendering. What truly sets it apart is the conversational, iterative process it enables. Unlike standalone image generators that require carefully crafted one-shot prompts, GPT-4o allows you to refine your images through natural conversation.
After generating an initial image, you can simply say, "Make the colors brighter" or "Change the background to blue" or "Add our logo in the corner," and the system understands your request in the context of the entire conversation.
This represents a fundamental advantage that may be difficult for standalone image generators to match. The omni-modal approach means the model understands the full conversation history and context. This enables iterative, continuous improvement for images that holistically represent what the conversation is about. It's a powerful layer of dimensionality that image-only models will struggle to replicate because they lack this contextual understanding of the ongoing interaction.
The conversation-based approach allows for:
This creates a workflow that feels more like collaborating with a designer than wrestling with a technical tool.
For associations, this advancement opens up numerous possibilities:
The ability to quickly iterate on these materials means you can experiment with different approaches and refine your visual communication strategy without significant time or resource investment.
GPT-4o's image generation capabilities signal a broader trend toward integrated, omni-modal AI systems that blur the lines between different media types. Rather than having separate tools for text, image, and soon video generation, we're moving toward unified systems that understand and create across multiple formats.
For associations, this trend has several implications:
We're entering an era where the limiting factor isn't technical capability but creative vision. The question becomes not "Can we create this?" but "What should we create?"
Looking back at my comical attempts to get Midjourney to spell words correctly in September 2023, I'm struck by how quickly this limitation has been overcome. What seemed impossible less than two years ago is now solved so completely that it will surely be taken for granted soon.
This pace of advancement suggests that capabilities we currently consider impossible or impractical may soon become routine. For association leaders, the key is to stay curious, experimental, and focused on the value these tools can create for members.
As you explore GPT-4o's image generation capabilities, remember that the goal isn't to replace human creativity but to amplify it—giving you more ways to communicate, educate, and engage with your members than ever before.