The "Secret Sauce" Behind DALL-E 3: How Is It So Good At Following Instructions?
I share my key takeaways from OpenAI's "Improving Image Generation with Better Captions" research paper.
In a crowded text-to-image field, one thing makes DALL-E 3 stand out: It’s freakishly good at prompt adherence.
Go ahead and ask Midjourney for a “watercolor painting of a giraffe, pig, and hedgehog dancing in a meadow.”
I dare you!
Too scared? Here, I did it for you:
Yup, that’s straight-up nightmare fuel.
DALL-E 3, on the other hand, kills it:
DALL-E 3 is so good, it can generate entire single-panel cartoons in one go.
So, what gives?
How come DALL-E 3 is that precise while Midjourney tends to Frankenstein incompatible objects into unholy amalgams with no regard for our sanity?
The short answer: better image captions.
The slightly longer answer can be found in a research paper released by OpenAI called Improving Image Generation with Better Captions (PDF).
At only 14 pages (+appendix), it’s a quick read, as far as research papers go.
But I went ahead and broke down its main findings below.
The issue (or “Why image models suck at understanding exactly what you want”)
Today’s text-to-image models are trained on countless images paired with their text descriptions, mainly scraped from the Internet.
The problem?
Most of these descriptions come from the associated image alt text, and—because this is the Internet—that alt text isn’t always helpful.
OpenAI’s research paper points out two issues:
Incomplete descriptions. Most alt text focuses on the main subject and omits important details like other objects, their relative positions, amount, colors, size, any text appearing in the image, and so on.
Irrelevant descriptions. In the worst-case scenario, the alt text has nothing at all to do with the image. It might be an inside joke, meme, advertising, or random details unrelated to the content of the image itself.
Some examples:
These crappy descriptions sneak their way into the model’s training set and mess it up.
In AI research circles, this is known as the “Shit in, shit out” problem.
The fix (or “Teaching image models to suck less”)
Now that we know the issue, the solution is easy-peasy, right?
Just write better image descriptions!
But wait…how, exactly?
These training datasets contain millions of images. How do you possibly relabel them at scale?
Well, here’s what OpenAI did.
Step 1: Create a better image captioner
Without diving into the technical details—most of which sail right over my head anyway—OpenAI basically trained a custom CLIP-based language model that had the explicit purpose of creating text descriptions that are useful for training text-to-image models.
The team trained their model to produce two types of captions:
Short synthetic captions: These are clean, concise descriptions of the main subject and scene.
Descriptive synthetic captions: These dive into every detail of the image to counteract the incomplete and irrelevant alt text issues described above.
Here’s a comparison:
A massive difference, eh?
Step 2: Relabel the dataset
OpenAI then used the new captioner to create both short and descriptive synthetic captions for the entire training dataset.
At this point, OpenAI was ready to test the impact of these new captions.
Step 3. Test and calibrate
The team needed to find out:
Which types of captions performed best
What was the ideal blend of original and synthetic captions
But wait, if descriptive synthetic captions are always better, why mix them with the original captions at all?
That’s because synthetic captions are structured in a similar, predictable way. After all, they’re a product of a pre-trained language model.
What this means is that when a text-to-image model trained exclusively on synthetic captions comes across a real-world image description that looks completely different, it won’t quite know what to do with it.
As such, your training set needs to contain a certain amount of original human captions, or what OpenAI calls “ground truth” captions.
The impact (or “How much less sucky is it?”)
As you might expect, long, descriptive synthetic captions outperformed all other types of captions.
Here’s the CLIP score1 for each:
Similarly, the higher the proportion of descriptive synthetic captions in the training set, the better:
As a result, the final training set for DALL-E 3 was a blend of 95% synthetic captions and 5% ground truth captions.
OpenAI then got human testers to evaluate its performance in prompt following, style, and coherence. DALL-E 3 demolished every other model in the test:
With that, DALL-E 3 is now the perfect text-to-image model that can never be further improved!
The end.
Except, of course, not quite…
The limitations (or “Why some things still suck”)
As much of a leap forward as DALL-E 3 is, it still has issues.
1. Shorter prompts = worse output
Because DALL-E 3 is trained on long, descriptive captions, it tends to produce better images when the prompts it receives are equally elaborate.
But most real-world users aren’t going to write prompts that are nearly as detailed as OpenAI’s specially trained captioner, which means they won’t get the best results out of DALL-E 3.
Fortunately, OpenAI has access to an obscure little language model called GPT-4. With a specialized prompt2, GPT-4 can “upsample” short prompts into lengthy, vivid descriptions.
The differences are quite striking:
This might also explain why OpenAI decided to bring DALL-E 3 into ChatGPT Plus rather than using a separate interface for it. ChatGPT and DALL-E 3 are a match made in heaven.
2. Limited spatial awareness
Despite its impressive prompt adherence, DALL-E 3 still struggles with very precise directions like “to the left of,” “underneath,” and so on.
OpenAI found that this was mainly due to the pre-trained captioner being equally bad at it. So it passed its own problems down to DALL-E 3.
3. Imperfect text rendering
Personally, what blew me away about DALL-E 3 is its ability to write coherent text on a semi-reliable basis. I recently had fun creating imaginary movie posters for “Misunderstood Superheroes” using raw output from DALL-E 3:
But it’s far from flawless.
With the same exact prompt, DALL-E 3 is just as likely to return an image that contains text like this instead:
(This is why each poster in the above series took about a dozen rerolls.)
Again, this is due to the captioner itself. OpenAI suspects that the T5 text encoder they used makes the model treat text on a whole-word level instead of rendering each character separately.
4. Hallucinated specifics
We all know large language models tend to “hallucinate,” which is a term we’ve agreed to use because it sounds better than “make shit up.”
Because of this, OpenAI’s image captioner ended up hallucinating inaccurate details about certain images, especially those of plants and birds for some reason. It then added these hallucinations to the descriptive synthetic captions, thereby polluting the training data for DALL-E 3.
So now when you ask DALL-E 3 to create a specific plant genus, it may draw the wrong flower because of this corrupt training data.
Check out a few examples of failure cases from OpenAI. showcasing the impact of the limitations discussed above:
What’s next for text-to-image models?
Here’s the good news: Most of the limitations come from problems with the custom captioner, not the DALL-E 3 image model itself.
OpenAI believes the majority of these are readily fixable:
Conditioning the captioner on “character-level language models” should improve the text rendering.
Reducing captioner hallucinations should help DALL-E 3 accurately draw specific plant and animal species.
Training the captioner to use precise prepositions in its descriptions should make DALL-E 3 better at spatial awareness.
So we can probably look forward to more impressive results when something like DALL-E 4 eventually comes out.
But it gets even better.
What OpenAI’s research paper seems to confirm is that the bulk of what we consider fundamental problems with text-to-image models comes from poor text descriptions in the underlying training set.
As such, I see no reason why Midjourney, Stable Diffusion, etc. shouldn’t be able to replicate the success of DALL-E 3. All they need to do is find a scalable way to create better captions for their own training sets.
Simple!
Over to you…
Did this help you get a better understanding of why DALL-E 3 is as good as it is?
What’s been your own experience with DALL-E 3 so far? Did you see the same issues? Have you stumbled upon some other limitations that aren’t mentioned in the paper?
As always, I’m curious to hear your thoughts!
You can send me an email at whytryai@substack.com or leave a comment.
Essentially a measure of how closely the captions correspond to the content of the image.
See “Appendix C” in the paper.
My Dall-e picture of animals dancing had a four and a half legged giraffe and a monstrous hedgehog.I also fing its text writing has still a LOT to be desired.Having read your paper twice now I have tried things likecspacing the letters out .so that they become individual images..just leads to spaced out garble.Still thoroughly enjoying my "art" though ..and love your newsletter.More expert prompts to try please...
This is an excellent analysis! I hadn't seen the DALL-E 3 paper but I'm going to take a look at it later this week.
My main challenge with DALL-E 3 is 1) I'm pretty bad at image prompting, and 2) I don't love a lot of the default image styles that come out. For better or worse, I do like the heavily-stylized and/or photorealistic Midjourney output. Is that something you've been able to replicate via better DALL-E prompts?