Want Better AI Images? Ask a Chatbot!

How to use LLMs to brainstorm scene ideas, remix pictures, and more.

Jan 18, 2024

Happy Thursday, digital dingos!

Almost exactly1 a year ago, I wrote about “splatterprompting.”

That’s the practice of filling text-to-image prompts with dubious descriptors like “64K, HD, hyperrealistic, award-winning, masterpiece, best, epic, legendary, godlike, Captain-Marvelly”…you get the picture.

The gist of my argument was: “Stop that.”

Propaganda poster with a man holding his hand to the camera, saying "Stop. That."

My sentiment still holds.

Today, in line with the “connecting the dots” pledge I made last week, I want to demystify text-to-image prompting for those who still think it’s a complex and incomprehensible science.

Follow me, and you’ll see just how simple it can be…2

Image models have gotten way better

In my now-one-year-old article, I made the case for using natural language instead of splatterprompting:

Splatterprompting is a dying concept. It’s time to shake the habit and start using natural language when talking to AI models. — This could’ve been a text quote, but I wanted to try Substack’s “share a quote” feature, so here we are.

I reasoned that image models would only get better at creating pictures based on straightforward inputs.

Now, I’m not one to toot my own horn, but *TOOT*

My prediction has come true in at least two major ways:

1. Image models make better images now

Hey, remember this meme from late 2022?

Meme with too many hands and fingers trying a handshake, generated by AI

Oh, how we laughed!

Anyway, here’s Midjourney V6:

"A handshake" generated by Midjourney V6 — “Who’s laughing now, human?”

Yup.

Text-to-image models are that good now. People with three limbs and 17 fingers are largely a thing of the past.3

This means you no longer have to guide a model by telling it you want “a hand with five fingers” or using negative prompts like “low-res, bad anatomy, blurry, out of frame, deformed, mutated hands,” etc. (These are all real, by the way.)

To wit, here are two results for a simple short prompt: “Dog”:

"Dog" results in Midjourney Version 1 (left) and Version 6 (right)

Can you guess which one is Midjourney Version 1 and which one is Version 6?

(Hint: V6 is the one that doesn’t look like a Cronenbergian horror.)

Midjourney V6 doesn’t need a detailed specification of how many eyes dogs have to return a polished, fully realized image.

Instead, you can use the prompt space to focus on vivid descriptions of the exact scene you want.

This brings us to…

2. Image models understand you better now

The launch of DALL-E 3 marked a new era in just how many instructions and descriptors a text-to-image model can reliably handle.

Turns out it’s “many.”

From DALL-E 3 to Google’s Imagen 2 to Midjourney V6, the current generation of image models can successfully respond to long, elaborate scene descriptions written in natural language.

That’s great news for you: You can finally stop chasing “advanced prompts” all over the Internet.

Nowadays, an “advanced prompt” is just you telling the model what you want it to draw in as many words as are necessary to convey your vision.

“But Daniel,” you counter, “This assumes I’m able to describe my vision accurately and exhaustively. In case you haven't noticed, I’m not exactly Hemingway!”

I know, my hypothetical straw man friend.

I also no word so good sometimes.

Luckily, we have another AI tool to help our word-challenged selves…

Enter chatbots

Unlike many of us, chatbots powered by large language models always know the right terminology and are pretty great with descriptions.

If you don’t know what “macro photography” is, they have your back:

What's it called when it's like a photo of a thing but, like, uh, very close and zoomed in and you can see all the small stuff? ChatGPT It sounds like you're describing "macro photography." Macro photography is a genre where the subject is captured extremely close-up, often revealing details that can't be seen with the naked eye. This type of photography is great for highlighting the intricate details of small subjects, like the texture of an insect's wing or the droplets of water on a leaf. It provides a unique perspective by magnifying small objects or parts of objects.

If you aren’t inspired enough to describe an interesting scene, they have your back.

I want to paint a beautiful village scene, can you describe it for me vividly? Keep it to around 70-80 words. ChatGPT Imagine a quaint village nestled in a lush valley. Thatched-roof cottages line cobblestone streets, their walls draped in ivy. A sparkling stream meanders through, crossed by quaint wooden bridges. Colorful flower gardens bloom outside every home, creating a tapestry of reds, yellows, and purples. In the distance, rolling green hills meet a clear blue sky, dotted with fluffy white clouds. Villagers, dressed in simple, earthy tones, go about their day, chatting and laughing, adding a lively charm to the serene landscape.

This is perfect for anyone who wants to experiment with image models but doesn’t know where to start.

Chatbots are the ideal middleman between you and a text-to-image model, as expressed by this technical formula (see if you can follow):

Image models that understand natural language
+
LLMs that can describe scenes using natural language
=
Good things!

With that in mind, let’s explore a few specific ways to use AI chatbots in combination with text-to-image tools.

What chatbots can help with

For demo purposes, I’ll be using ChatGPT to generate prompts and Midjourney V6 to create images.

But the takeaways are generally applicable.

You can use your chatbot of choice and plug the resulting prompts into your preferred image model. (Don’t have one? I covered seven text-to-image models a month ago, and most of them are free to use.4)

For an all-in-one experience, you have two great options:

Paid: If you’re a ChatGPT Plus user, you can discuss what you want and generate images in the same chat using DALL-E 3.
Free: You can replicate virtually the same experience with Microsoft Copilot, which is also powered by GPT-4 and DALL-E 3.

Off we go!

1. Brainstorming scene ideas

Let’s say I want to make a kid-friendly alien but don’t know the exact look I’m going for. I ask ChatGPT to do something like this:5

I want to draw a cute alien for my kids, but I don't know what it should look like. Please give me five very different descriptions of about 80 words each. Aim for different styles and color palettes for each. Don't start your descriptions with commands like "Imagine" or "Create." Instead, use words sparingly to describe only the visual elements that can be captured in a single image.

ChatGPT happily obliges:

User I want to draw a cute alien for my kids, but I don't know what it should look like. Please give me five very different descriptions of about 80 words each. Aim for different styles and color palettes for each. Don't start your descriptionis with commands like "Imagine" or "Create." Instead, use words sparingly to describe only the visual elements that can be captured in a single image. ChatGPT A small, round-bodied alien with a soft blue hue. Its large, expressive eyes are a bright emerald green, taking up most of its face. It has two small, pointy ears and a tiny mouth. The alien stands on three stubby legs and has two flexible, arm-like appendages with suction cups at the ends. Its skin has a velvety texture and it emits a gentle, luminous glow, casting a calming light around it.

I’m tempted to clean the descriptions up a bit (e.g. “Tall, slender alien” instead of “This alien is tall and slender”).

But to prove my “keep it simple” point, I copy-pasted ChatGPT descriptions directly into Midjourney V6. Here are the three most kid-friendly results:

Fluffy teddy bear alien. Insect-like orange alien. Small blue alien.

All images reflect most of the major prompt elements, with minor exceptions.

Now let’s imagine I know the exact subject I’m going for but am not sure how best to capture it. ChatGPT to the rescue:

I'd like to make a photograph of a crowd at an outdoor festival, but I'm not sure of the best way to do it. Please describe five separate images that use different styles of photography, camera angles, and more. Keep each description around 80 words and only describe what's seen in the image.

Here’s what I get:

I'd like to make a photograph of a crowd at an outdoor festival, but I'm not sure of the best way to do it. Please describe five separate images that use different styles of photography, camera angles, and more. Keep each description around 80 words and only describe what's seen in the image. ChatGPT Wide-Angle Shot: This photo captures the expanse of the festival with a wide-angle lens. The image includes a large, diverse crowd of people, all engaged in various festival activities. The wide lens distorts the edges slightly, emphasizing the size and scope of the event. Food stalls, colorful banners, and a stage with musicians are visible in the background. The sky is bright, and the atmosphere feels lively and vibrant. Aerial Vie

Once again, you’ll want to tighten up the descriptions or chat with ChatGPT to do so. But even if you don’t, check out what Midjourney gives me on the very first try:

Different festival photos by Midjourney V6

Here are three handpicked images:

Close-up of faces in the crowd, arial view, black and white documentary photo of a music festival

You can use this approach to brainstorm anything from painting types to visual styles to art materials.

Give it a go and let me know how this works for you.

2. Mimicking an image or style

For this, you’ll need a multimodal LLM with image recognition.

If you’re using ChatGPT Plus or Microsoft Copilot (my recommendations above), you’ll be fine: Both of them can see and analyze images.

Start by uploading your image. You’ll typically find the “Upload a file” button to the left of the chatbot’s text input. Here’s ChatGPT:

I picked this image from Midjourney’s public showcase:

Illustration portrait of a woman with geometric shapes and different colors

Then I asked this:

I want to create an image similar to this one. Please describe it as accurately as possible, in a way that would let an artist recreate it. Keep your description to a single paragraph of about 80 words.

ChatGPT responded with:

This image features a stylized portrait of a woman in a contemporary one-line drawing technique with sections of geometric color blocking. The line art, done in black, outlines her features in an abstract and flowing manner without lifting the pen, capturing the essence of her face, neck, and a hint of her shoulders. Her facial features are detailed with closed eyes, full lips, and a pronounced nose. Bold colors like blue, yellow, red, and orange fill in various sections of the face and neck, creating a vibrant mosaic of overlapping shapes and colors. The background is white, which accentuates the vivid colors and the dynamic black lines.

Again, even without refining ChatGPT’s wordiness, we get pretty damn close:

If you like the style itself but want a different subject, you can try this:

I want to be able to draw other people and objects in this style. Please provide a detailed description of the style, colors, etc. that I can use. Keep it to around 80 words.

User I want to be able to draw other people and objects in this style. Please provide a detailed description of the style, colors, etc. that I can use. Keep it to around 80 words. ChatGPT The style is a fusion of one-line art and abstract color blocking. The continuous line is unbroken and fluid, capturing the subject's essence with minimalistic elegance. Bold, flat colors—primary blues, yellows, and reds, alongside secondary oranges—segment the image into a patchwork, reminiscent of cubist influences. This technique emphasizes the interplay between the defined, sinuous black line and the vivid, geometric color shapes, against a stark white background, creating a striking, modern aesthetic.

Here’s a portrait of a koala, using the exact style description above. (I just replaced “the style is a” with “koala portrait,”):

Koala portrait, fusion of one-line art and abstract color blocking. The continuous line is unbroken and fluid, capturing the subject's essence with minimalistic elegance. Bold, flat colors--primary blues, yellows, and reds, alongside secondary oranges--segment the image into a patchwork, reminiscent of cubist influences. This technique emphasizes the interplay between the defined, sinuous black line and the vivid, geometric color shapes, against a stark white background, creating a striking, modern aesthetic.

As you can see, even without transforming ChatGPT’s first draft into better, cleaner descriptions, Midjourney gives us what we need.

3. “Remixing” an image

The final fun thing to try is to reimagine an existing image. It’s a combination of the two uses above. You’ll also need an LLM with image recognition for it.

Start by uploading your starting image as I’ve shown above. In my case, I used our koala with the following prompt:

I want to evolve this image into new stylistic directions. Please provide me with five separate descriptions of images that use this one as a starting point but apply different artistic styles and techniques to it. Keep each description to about 80 words. Don't start your descriptions with commands like "Imagine" or "Create." Instead, use words sparingly to describe only the visual elements that can be captured in a single image.