Hey, remember when I demoed six text-to-video sites?
Cat genitals, psychedelics, creepy chimeras, and other shenanigans? Ring a bell?
Today, I want to do the same for AI images. (Hopefully with 100% fewer cat penises.)
By my latest count, we now have seven primary public text-to-image models1:
DALL-E 3 (OpenAI)
Emu (Meta)
Firefly Image 2 (Adobe)
Ideogram (Ideogram)
Imagen (Google)
Midjourney 5.2 (Midjourney)
SDXL (Stability AI)2
Let’s check out the images they generate and learn more about the models.
The process
This won’t be a deep-dive showdown like my SDXL 1.0 vs. Midjourney 5.2 post.
Instead, I’ll briefly introduce each model and showcase the visuals it generates. To keep things consistent and comparable, I’ll be using the same 6 prompts for each model:
Tulips in a meadow, golden hour, watercolor painting
Parrot on a branch, wildlife photography, National Geographic
Portrait of a woman wearing sunglasses, pencil sketch
Abstract shapes, acrylic paint
Ice cream shop, minimal line logo
Colorful banner that says “Digital Art”
I tried to pick prompts that cover a range of picture types, art mediums, and styles.
Because some of the models can only generate square images at the moment, I’ll be sticking to the 1:1 aspect ratio for all images.
Off we go!
This post might get cut off in some email clients. Click here to read it online.
1. DALL-E 3 (Open AI)
DALL-E 3 is the latest image model from OpenAI, having replaced DALL-E 2 in October 2023.
What makes DALL-E 3 special is its ability to faithfully follow long, detailed prompts, thanks to being trained on images with synthetically enriched captions.
DALL-E 3 is great for cartoons with speech bubbles and other images that include writing, because it tends to handle text better than most other models.
Sample images:
DALL-E 3 at a glance:
Interface: Web
Standout features: Prompt adherence and text generation
Is free? Yes (via Bing)
Where to try: Bing Image Creator (free) or ChatGPT (for paid Plus users)
2. Emu (Meta)
Emu, by Meta AI, was first announced in early October 2023, started rolling out to select users in November, and finally became available to all US residents through a standalone site in early December.
The interface is pretty barebones for now: just a simple text box to input your prompt, which generates four alternative square pictures.
Sample images:
Emu at a glance:
Interface: Web (and inside Meta products like Facebook and WhatsApp)
Standout feature: Built-in watermarking for transparency
Is free? Yes
Where to try: imagine.meta.com (if you’re in the US or use a VPN)
3. Firefly Image 2 (Adobe)
First announced at the Adobe MAX conference in October, Firefly Image 2 replaced the first version of Adobe’s in-house image model.
It’s available as a standalone Adobe Firefly site as well as powering the company’s suite of products including Adobe Photoshop and Adobe Illustrator.
Because Firefly Image 2 is built into more advanced Adobe interfaces, you can use it for inpainting, changing image styles, and a whole lot more.
Sample images:
Firefly Image 2 at a glance:
Interface: Web (and inside most Adobe products)
Standout features: Additional editing options like style transfer, text-to-vector, text effects, generative fill, recolor, and more.
Is free? Yes (25 credits per month)
Where to try: firefly.adobe.com
4. Ideogram (Ideogram)
Ideogram arrived seemingly out of nowhere in late August 2023.
It’s the only text-to-image model on the list by a company that wasn’t on the scene until this year. The founders of Ideogram all previously worked on Google’s Imagen (see below) before leaving to start their own thing.
Ideogram is trained from scratch to solve the issue of gibberish text inside images. It was the first model to reliably generate text3 before DALL-E 3 caught up.
Sample images:
Ideogram at a glance:
Interface: Web
Standout features: Text generation and image remixing
Is free? Yes (25 prompts per day)
Where to try: Ideogram.ai
5. Imagen (Google)
The Imagen research paper first came out back in May 2022, when Midjourney was just starting out and DALL-E 2 wasn’t out yet. Then, while Midjourney and OpenAI rapidly iterated and released public-facing image models, Google just sat on its research. (I even threw a mocking jab at it in this post.)
But in October 2023, Google quietly made image generation available within SGE (Search Generative Experience), using Imagen.
Then, just as I was writing this article and testing the model, Google announced Imagen 2, which so far is only available to developers via Vertex AI.
As far as I know, my images below use Imagen 1, so the text accuracy caught me off guard.
Sample images:
Imagen at a glance:
Interface: Web
Standout features: Text generation, prompt understanding, watermarking
Is free? Yes (if you have SGE available and enabled)
Where to try: Google search (type “draw a picture of [prompt]”)
6. Midjourney 5.2 (Midjourney)
Ah, Midjourney, my greatest obsession4.
Trained by a relatively small team that shunned all external funding, Midjourney is still seen as the gold standard within image generation. The latest version is 5.2, but version 6 is just around the corner.
Midjourney continues to thrive despite the inconvenient Discord interface and the lack of a free plan. No small feat.
Sample images:
Midjourney at a glance:
Interface: Discord (but the alpha web version is out for power users and coming soon to all)
Standout features: Inpainting, outpainting, Style Tuner, blend, and more
Is free? No (plans start at $10 / month)
Where to try: Midjourney.com
7. SDXL (Stability AI)
Stable Diffusion is why this newsletter exists; it’s the first text-to-image model I ever tried.
Stable Diffusion is currently the only open-source5 text-to-image model. It can be downloaded and installed locally, customized, and iterated upon to create even better spinoff models (like Playground 2).
Stable Diffusion XL (SDXL) is the latest “vanilla” version, which I compared to Midjourney 5.2 a few months ago.
Sample images:
SDXL at a glance:
Interface: Web and local install (also Discord, if you insist)
Standout feature: Open-source, infinitely customizable, can run locally
Available for free? Yes
Where to try: Dozens of image creation sites
Observations
The most obvious conclusion here is that text-to-image models are converging. At the start of the year, there were clear leaders. Now, it’s often impossible to tell AI image models apart in terms of quality.
While Midjourney might still have a slight edge, the gap is closing fast! We’ll see if version 6 does anything to shake that up.
But the biggest surprise by far was Imagen.
Google released it without much fanfare, so I didn’t realize just how good it was, especially when it comes to text. Until now, I thought Ideogram and DALL-E 3 were the only models capable of rendering text.
The sample image I picked wasn’t a fluke. Here’s the entire grid:
Yup: Correctly spelled text in every single image. If this is indeed only Imagen 1, I can’t wait to see what Imagen 2 brings to the table.
But Google also has unexpectedly strict filters when it comes to generating people.
It took around 10 tries to get the pencil sketch almost by accident, after which Google refused to return more images.
All in all, we are truly spoiled for choice.
It’s incredible that—just one year after Stable Diffusion’s debut—seven models of this caliber are available to us, for free…with the notable exception of Midjourney.
What a time to be alive!
Side-by-side comparison
Here’s a look at all the 7 models and 6 prompts in a single image:
Over to you…
What’s your favorite AI image model? Do you agree with my observations? Have I overlooked a model in this roundup?
As always, I’d love to hear your input. Send an email to whytryai@substack.com or leave a comment below.
A lot has changed since my April article comparing text-to-image models to smartphone operating systems.
There are excellent sites like Leonardo and Playground with tuned models that easily outperform the vanilla Stable Diffusion XL. But they are spinoffs of Stable Diffusion, and my focus here is on the underlying models, so I won’t look into the many great SD spawns.
Although DeepFloyd IF by Stability AI did have a decent success rate.
I dedicated no fewer than 25 posts to Midjourney thus far, but who’s counting?
Although there’s some debate about the specific definitions of “open-source.”
I like you provide side by side comparison, do you mind to share your prompts?
And, do you think each models has a distinct features that others don't?
You said it well here: "The most obvious conclusion here is that text-to-image models are converging. At the start of the year, there were clear leaders. Now, it’s often impossible to tell AI image models apart in terms of quality."
I feel this way about all generative AI. The worst model today is probably better than the best model a year ago all across the board, and it's only going to get more competitive from here on out.
Super cool that Google is getting words right! I am sure we'll look back on this problem as trivial one day, but it's sure frustrating today.