The "Secret Sauce" Behind DALL-E 3: How Is It So Good At Following Instructions?

I share my key takeaways from OpenAI's "Improving Image Generation with Better Captions" research paper.

Daniel Nest

Nov 02, 2023

In a crowded text-to-image field, one thing makes DALL-E 3 stand out: It’s freakishly good at prompt adherence.

Go ahead and ask Midjourney for a “watercolor painting of a giraffe, pig, and hedgehog dancing in a meadow.”

I dare you!

Too scared? Here, I did it for you:

Creepy animal hybrids created by the following prompt: "“watercolor painting of a giraffe, pig, and hedgehog dancing in a meadow" in Midjourney — “We should not be!”

Yup, that’s straight-up nightmare fuel.

DALL-E 3, on the other hand, kills it:

DALL-E 3's accurate rendering of "“watercolor painting of a giraffe, pig, and hedgehog dancing in a meadow.” — “We are perfectly acceptable renditions of real animals!”

DALL-E 3 is so good, it can generate entire single-panel cartoons in one go.

So, what gives?

How come DALL-E 3 is that precise while Midjourney tends to Frankenstein incompatible objects into unholy amalgams with no regard for our sanity?

The short answer: better image captions.

The slightly longer answer can be found in a research paper released by OpenAI called Improving Image Generation with Better Captions (PDF).

At only 14 pages (+appendix), it’s a quick read, as far as research papers go.

But I went ahead and broke down its main findings below.

The issue (or “Why image models suck at understanding exactly what you want”)

Today’s text-to-image models are trained on countless images paired with their text descriptions, mainly scraped from the Internet.

The problem?

Most of these descriptions come from the associated image alt text, and—because this is the Internet—that alt text isn’t always helpful.

OpenAI’s research paper points out two issues:

Incomplete descriptions. Most alt text focuses on the main subject and omits important details like other objects, their relative positions, amount, colors, size, any text appearing in the image, and so on.
Irrelevant descriptions. In the worst-case scenario, the alt text has nothing at all to do with the image. It might be an inside joke, meme, advertising, or random details unrelated to the content of the image itself.

Some examples:

Examples of bad alt texts from OpenAI research paper — Source: OpenAI research paper.

These crappy descriptions sneak their way into the model’s training set and mess it up.

In AI research circles, this is known as the “Shit in, shit out” problem.

The fix (or “Teaching image models to suck less”)

Now that we know the issue, the solution is easy-peasy, right?

Just write better image descriptions!

Cartoon of a man sitting at a clean and modern desk in a sleek office, throwing his arms up in the air and yelling 'I fixed the Internet!' The man has short black hair and is wearing a red superhero cape and a matching red mask. His desk is equipped with multiple monitors and high-tech gadgets.

But wait…how, exactly?

These training datasets contain millions of images. How do you possibly relabel them at scale?

Well, here’s what OpenAI did.

Step 1: Create a better image captioner

Without diving into the technical details—most of which sail right over my head anyway—OpenAI basically trained a custom CLIP-based language model that had the explicit purpose of creating text descriptions that are useful for training text-to-image models.

The team trained their model to produce two types of captions:

Short synthetic captions: These are clean, concise descriptions of the main subject and scene.
Descriptive synthetic captions: These dive into every detail of the image to counteract the incomplete and irrelevant alt text issues described above.

Here’s a comparison:

Images with improved synthetic captions from OpenAI — Source: OpenAI research paper.

A massive difference, eh?

Step 2: Relabel the dataset

OpenAI then used the new captioner to create both short and descriptive synthetic captions for the entire training dataset.

Wide cartoon in a 16:9 aspect ratio of a robot sitting at a desk, using a pencil to cross out and write new text on endless stacks of papers. The robot has a playful design with bright colors, round shapes, and expressive eyes. The desk is cluttered with papers, and there are more stacks of papers on the floor around the desk. The background features a window with a view of the moon and stars. A speech bubble shows the robot saying 'Work hard. Caption harder.'

At this point, OpenAI was ready to test the impact of these new captions.

Step 3. Test and calibrate

The team needed to find out:

Which types of captions performed best
What was the ideal blend of original and synthetic captions

But wait, if descriptive synthetic captions are always better, why mix them with the original captions at all?

That’s because synthetic captions are structured in a similar, predictable way. After all, they’re a product of a pre-trained language model.

What this means is that when a text-to-image model trained exclusively on synthetic captions comes across a real-world image description that looks completely different, it won’t quite know what to do with it.

Wide cartoon in a 16:9 aspect ratio of a playful robot with bright colors and expressive eyes, holding a framed photograph with a mysterious image. The robot is leaning in closer to the photograph, and a speech bubble says 'What are you?'

As such, your training set needs to contain a certain amount of original human captions, or what OpenAI calls “ground truth” captions.

The impact (or “How much less sucky is it?”)

As you might expect, long, descriptive synthetic captions outperformed all other types of captions.

Here’s the CLIP score1 for each:

Graph comparing the CLIP score of ground truth, short synthetic, and descriptive synthetic captions from OpenAI research paper — Source: OpenAI research paper.

Similarly, the higher the proportion of descriptive synthetic captions in the training set, the better:

Graph comparing the CLIP score of different blends of synthetic and ground truth captions - 80%, 90%, and 95% from OpenAI research paper — Source: OpenAI research paper.

As a result, the final training set for DALL-E 3 was a blend of 95% synthetic captions and 5% ground truth captions.

OpenAI then got human testers to evaluate its performance in prompt following, style, and coherence. DALL-E 3 demolished every other model in the test:

Table showing high evaluation scores for DALL-E 3 compared to Midjourney V 5.2 and Stable Diffusion XL — Source: OpenAI research paper.

With that, DALL-E 3 is now the perfect text-to-image model that can never be further improved!

The end.

Except, of course, not quite…

The limitations (or “Why some things still suck”)

As much of a leap forward as DALL-E 3 is, it still has issues.

1. Shorter prompts = worse output

Because DALL-E 3 is trained on long, descriptive captions, it tends to produce better images when the prompts it receives are equally elaborate.

But most real-world users aren’t going to write prompts that are nearly as detailed as OpenAI’s specially trained captioner, which means they won’t get the best results out of DALL-E 3.

Fortunately, OpenAI has access to an obscure little language model called GPT-4. With a specialized prompt2, GPT-4 can “upsample” short prompts into lengthy, vivid descriptions.

The differences are quite striking:

Examples of longer "upsampled" captions for three different images in OpenAI research paper with the original short prompts — Source: OpenAI research paper.

This might also explain why OpenAI decided to bring DALL-E 3 into ChatGPT Plus rather than using a separate interface for it. ChatGPT and DALL-E 3 are a match made in heaven.

2. Limited spatial awareness

Despite its impressive prompt adherence, DALL-E 3 still struggles with very precise directions like “to the left of,” “underneath,” and so on.

OpenAI found that this was mainly due to the pre-trained captioner being equally bad at it. So it passed its own problems down to DALL-E 3.

3. Imperfect text rendering

Personally, what blew me away about DALL-E 3 is its ability to write coherent text on a semi-reliable basis. I recently had fun creating imaginary movie posters for “Misunderstood Superheroes” using raw output from DALL-E 3:

Reddit post for a movie poster of a black man holding a black pan, called "Black Pan Sir" — More movie posters on Reddit.

But it’s far from flawless.

With the same exact prompt, DALL-E 3 is just as likely to return an image that contains text like this instead:

The "Black Pan Sir" poster that's completely misspelled

(This is why each poster in the above series took about a dozen rerolls.)

Again, this is due to the captioner itself. OpenAI suspects that the T5 text encoder they used makes the model treat text on a whole-word level instead of rendering each character separately.

4. Hallucinated specifics

We all know large language models tend to “hallucinate,” which is a term we’ve agreed to use because it sounds better than “make shit up.”

Because of this, OpenAI’s image captioner ended up hallucinating inaccurate details about certain images, especially those of plants and birds for some reason. It then added these hallucinations to the descriptive synthetic captions, thereby polluting the training data for DALL-E 3.

So now when you ask DALL-E 3 to create a specific plant genus, it may draw the wrong flower because of this corrupt training data.

Check out a few examples of failure cases from OpenAI. showcasing the impact of the limitations discussed above:

Failure case images and associated captions from OpenAI — Source: OpenAI research paper.

What’s next for text-to-image models?

Here’s the good news: Most of the limitations come from problems with the custom captioner, not the DALL-E 3 image model itself.

OpenAI believes the majority of these are readily fixable:

Conditioning the captioner on “character-level language models” should improve the text rendering.
Reducing captioner hallucinations should help DALL-E 3 accurately draw specific plant and animal species.
Training the captioner to use precise prepositions in its descriptions should make DALL-E 3 better at spatial awareness.

So we can probably look forward to more impressive results when something like DALL-E 4 eventually comes out.

But it gets even better.

What OpenAI’s research paper seems to confirm is that the bulk of what we consider fundamental problems with text-to-image models comes from poor text descriptions in the underlying training set.

As such, I see no reason why Midjourney, Stable Diffusion, etc. shouldn’t be able to replicate the success of DALL-E 3. All they need to do is find a scalable way to create better captions for their own training sets.

Simple!

Wide cartoon in a 16:9 aspect ratio of a man sitting at a desk, throwing his arms up in the air, and yelling 'I fixed two Internets!' The man is wearing a casual outfit and has a cheerful expression on his face. The desk is cluttered with various tech gadgets and cables. The background features a window with a view of the city skyline.

Over to you…

Did this help you get a better understanding of why DALL-E 3 is as good as it is?

What’s been your own experience with DALL-E 3 so far? Did you see the same issues? Have you stumbled upon some other limitations that aren’t mentioned in the paper?

As always, I’m curious to hear your thoughts!

You can send me an email at whytryai@substack.com or leave a comment.

Why Try AI

Discussion about this post