Text That Works: GPT-4o's Conversational Approach to Image Generation

Written by Mallory Mejias | Apr 21, 2025 12:00:00 PM

The date is September 15, 2023. I'm on my fifth try trying to get my favorite AI image generator Midjourney to spell "Cimatri" correctly in a cartoon image. Despite increasingly detailed prompts and variations, the best result I get is "Cimitto." This is a true story—see below.

This wasn't just a Midjourney problem. Every major AI image generator available at that time struggled with text rendering. The results ranged from laughably wrong to almost-but-not-quite correct, making them unusable for professional applications that required accurate text.

Fast forward to today, and everything has changed.

The Text Breakthrough in GPT-4o

Text rendering has been the Achilles heel of AI image generators since their inception. Even as these tools became increasingly sophisticated at creating stunning visuals, they stumbled when asked to include simple text elements. This limitation severely restricted their usefulness for creating infographics, logos, educational materials, or any content that required labeled visuals.

The release of OpenAI's GPT-4o image generation has finally solved this persistent problem. When we tested it ourselves at Sidecar, we were amazed to see no spelling errors, no bizarre fonts, and no mysterious additional characters.

What makes this possible is GPT-4o's omni-modal architecture. Unlike previous AI systems that treated text and image generation as separate functions, GPT-4o processes everything within a single model that understands both visual and textual elements simultaneously. This means when you ask for specific text in an image, the model understands exactly what you're requesting and treats the text as a critical component of the visual, not an afterthought.

Putting 4o to the Test - Our Experiment

To test GPT-4o's capabilities, we decided to transform our podcast headshots into cartoon characters wearing Sidecar Sync branded caps.

First, I uploaded my professional headshot and gave it a simple prompt: "Turn me into a cartoon wearing a yellow cap that says Sidecar Sync." The result was astonishing! Not only did it render a recognizable cartoon version of me, but it captured details down to the leaves in the background of my original photo, my earrings, and my outfit. Most importantly, the yellow cap featured perfectly rendered "Sidecar Sync" text.

↓

We then repeated the experiment with my co-host Amith's headshot, with equally impressive results. His cartoon captured the brick background from his original photo and even the stripes on his shirt. Again, the Sidecar Sync text on the cap was flawless.

↓

Taking it a step further, we asked GPT-4o to create an infographic for our podcast using both cartoons. It seamlessly combined the two images it had just created and added "Sidecar Sync Podcast: The Intersection of AI and Associations" text—all perfectly spelled and formatted.

What would have previously been a multi-step process requiring either collaboration with a design team or significant time learning graphic design software was accomplished in less than five minutes with minimal technical expertise.

The Power of Iteration

The breakthrough capabilities of GPT-4o go beyond just text rendering. What truly sets it apart is the conversational, iterative process it enables. Unlike standalone image generators that require carefully crafted one-shot prompts, GPT-4o allows you to refine your images through natural conversation.

After generating an initial image, you can simply say, "Make the colors brighter" or "Change the background to blue" or "Add our logo in the corner," and the system understands your request in the context of the entire conversation.

This represents a fundamental advantage that may be difficult for standalone image generators to match. The omni-modal approach means the model understands the full conversation history and context. This enables iterative, continuous improvement for images that holistically represent what the conversation is about. It's a powerful layer of dimensionality that image-only models will struggle to replicate because they lack this contextual understanding of the ongoing interaction.

The conversation-based approach allows for:

Contextual understanding: The model remembers what you're trying to achieve across multiple turns
Progressive refinement: You can improve aspects of the image iteratively rather than starting over
Reference to previous images: The model can modify images it created earlier in the conversation
Natural language editing: You can describe changes in plain English rather than technical terms

This creates a workflow that feels more like collaborating with a designer than wrestling with a technical tool.

Practical Applications for Associations

For associations, this advancement opens up numerous possibilities:

Educational Content

- Create cartoon characters representing different professions or member types
- Develop step-by-step visual guides with properly labeled diagrams
- Design infographics that blend data visualization with explanatory text
- Build comic-style educational content for complex topics

Branded Materials

- Generate social media graphics with consistent branding and messaging
- Create custom header images for newsletters and emails
- Design promotional materials for events with accurate event names and dates
- Produce consistent visual assets across different platforms and formats

Member Engagement

- Create personalized welcome materials for new members
- Develop visually engaging announcements for association initiatives
- Build eye-catching graphics for membership campaigns
- Design custom certificates or recognition materials

The ability to quickly iterate on these materials means you can experiment with different approaches and refine your visual communication strategy without significant time or resource investment.

What This Means for the Future

GPT-4o's image generation capabilities signal a broader trend toward integrated, omni-modal AI systems that blur the lines between different media types. Rather than having separate tools for text, image, and soon video generation, we're moving toward unified systems that understand and create across multiple formats.

For associations, this trend has several implications:

Skills evolution: The ability to effectively prompt and guide AI tools becomes more valuable than technical design skills
Production democratization: More team members can contribute to content creation without specialized training
Experimental approach: The low cost and time investment enables more experimentation with visual communication
Integration focus: Associations should look for ways to integrate these capabilities into their existing workflows

We're entering an era where the limiting factor isn't technical capability but creative vision. The question becomes not "Can we create this?" but "What should we create?"

From Text Failures to Limitless Possibilities

Looking back at my comical attempts to get Midjourney to spell words correctly in September 2023, I'm struck by how quickly this limitation has been overcome. What seemed impossible less than two years ago is now solved so completely that it will surely be taken for granted soon.

This pace of advancement suggests that capabilities we currently consider impossible or impractical may soon become routine. For association leaders, the key is to stay curious, experimental, and focused on the value these tools can create for members.

As you explore GPT-4o's image generation capabilities, remember that the goal isn't to replace human creativity but to amplify it—giving you more ways to communicate, educate, and engage with your members than ever before.

View full post