I remember when you had to get a prompt just right. You had a very limited number of attempts per time period, so you had to make 'em count. I've also become a bit of a minimalist over the last year or so, especially as simply iterating on what you've just created is way, way faster than trying to correct a prompt. "No, I meant the other type of X" is an example of the pointed, simple corrective stuff you can get to quickly.
Of course, I make very fast images and use them predominantly to support what I've written, and that's a very different thing than making a standalone image for its own sake, so that context really matters.
Yeah. Another key difference between Midjourney and your experience is that, as far as I recall, you mainly talk to ChatGPT to generate images. This means there's always a middleman (ChatGPT) that turns your requests into prompts for DALL-E 3 to make the image. And then you ask ChatGPT to tweak stuff, which makes it write a new prompt for DALL-E 3, etc.
With Midjourney, the exact prompt that you type into the bar is what the model uses to generate the image. This gives you more direct control over the image generation, but it also sometimes means people end up using voodoo stuff like ISO and shutter speed, etc.
Even with this being the case (with MJ giving you exactly what you want), do you find that it's more economical (mainly talking about time here) to prompt quickly to see a "sketch" of what you're going for? I think it's kind of like the iterative approach vs "ready, aim, fire" - both are sensible approaches for different tasks, and maybe the answer here is a blend.
I've been consistently pretty vocal about my approach, which is the “minimum viable prompt” - describing what I want in as few words as necessary. And most of the time it gets me the results I want. When it doesn't, it's usually easy to see what to add/remove to fix it.
That's how I'm trying to play it, too: just create something I can see. Half of the time, that's good enough for me. The other half of the time, it probably takes less time, collectively, than crafting some uber-clever prompt, right?
That's an awesome and relevant question! Thanks for bringing it up.
While I don't know the exact inner workings of Midjourney's "describe" model, my educated guess is that it is the result of several factors working together.
First, there are two separate models at play here. Midjourney uses a diffusion model to power the text-to-image generation (V6.1), but the /describe command (or the "i" button on the web) uses a vision model - which is essentially an LLM that "sees" the uploaded image and tries its best to describe it. But because the engines are different, what the "describe" command spits out isn't guaranteed to be something the diffusion model will respond to. In fact, the first time Midjourney launched the original version of "/describe," they stressed that they're two distinct models. You might find this old Facebook post from late 2023 helpful: https://www.facebook.com/groups/officialmidjourney/posts/657381126553455/
(That Facebook post even already talks about your exact observation under the "Camera and settings" section.)
Second, the /describe command isn't deterministic: It doesn't always generate the same precise description, and clicking the "refresh" button on the same image will continue to endlessly generate random text strings. In combination with the above hallucinations, you're likely to end up with lots of completely irrelevant terms. For instance, I just uploaded my own photo to Midjourney and used "describe," which at one point said this: "Neil Cherkis is smiling at the camera in front of a crowd on the street." I don't know who Neil Cherkis is, but I'm not him. I tried Googling his name without any luck. Then, after a refresh, the "describe" said "A photo of the smiling and happy Timur Khvoronov," which, again, I am not and isn't a person that Google knows.
All of this is to say that /describe is perfectly capable of spitting out irrelevant gibberish.
However, none of the above fully answers your "Why does it return camera settings values?" observation. This brings us to the third factor...
Third, and this is a bit speculative, but there are good reasons to believe that Midjourney uses pairings of prompts + resulting Midjourney images when training the "/describe" feature. Note how they mentioned that the March version of /describe was made to output prompts "in the style of our community" (https://x.com/midjourney/status/1765909128344658369) - which tells me they're tapping into the hivemind of existing prompts. And, assuming that is the case, all this tells us is that the /describe feature picks up the many "ISO 1600, shot on Olympus Flip Flap Mark IV" prompts from the community, like an ouroboros eating its own garbage prompts.
Putting it all together, the short version is that /describe is an independent image-to-text model that's trained on Midjourney community prompts but that isn't fully aligned with the text-to-image model producing the actual images.
The best way to be sure what works is to use the prompts (as I did in my post) and see if they have an impact.
I hope that gives you some form of explanation for what might be happening!
I’m guilty of going down the path of complex prompts. Made me feel like I was putting in the work. But then I found something that I don’t know if you’ve seen as a pattern - the simplest prompt and often the first generation of it gave me the best results.
Yup, that's exactly why I recommend the "Minimum Viable Prompt" approach, especially with image models. Going for a simple description lets the default aesthetic come through. Adding all sorts of conflicting modifiers may simply confuse the model.
So your observations are very much in line with that!
I remember when you had to get a prompt just right. You had a very limited number of attempts per time period, so you had to make 'em count. I've also become a bit of a minimalist over the last year or so, especially as simply iterating on what you've just created is way, way faster than trying to correct a prompt. "No, I meant the other type of X" is an example of the pointed, simple corrective stuff you can get to quickly.
Of course, I make very fast images and use them predominantly to support what I've written, and that's a very different thing than making a standalone image for its own sake, so that context really matters.
Yeah. Another key difference between Midjourney and your experience is that, as far as I recall, you mainly talk to ChatGPT to generate images. This means there's always a middleman (ChatGPT) that turns your requests into prompts for DALL-E 3 to make the image. And then you ask ChatGPT to tweak stuff, which makes it write a new prompt for DALL-E 3, etc.
With Midjourney, the exact prompt that you type into the bar is what the model uses to generate the image. This gives you more direct control over the image generation, but it also sometimes means people end up using voodoo stuff like ISO and shutter speed, etc.
Even with this being the case (with MJ giving you exactly what you want), do you find that it's more economical (mainly talking about time here) to prompt quickly to see a "sketch" of what you're going for? I think it's kind of like the iterative approach vs "ready, aim, fire" - both are sensible approaches for different tasks, and maybe the answer here is a blend.
I've been consistently pretty vocal about my approach, which is the “minimum viable prompt” - describing what I want in as few words as necessary. And most of the time it gets me the results I want. When it doesn't, it's usually easy to see what to add/remove to fix it.
That's how I'm trying to play it, too: just create something I can see. Half of the time, that's good enough for me. The other half of the time, it probably takes less time, collectively, than crafting some uber-clever prompt, right?
Midjourney very often uses these terms in its own /describe outputs. What do you make of that?
Hey Charles,
That's an awesome and relevant question! Thanks for bringing it up.
While I don't know the exact inner workings of Midjourney's "describe" model, my educated guess is that it is the result of several factors working together.
First, there are two separate models at play here. Midjourney uses a diffusion model to power the text-to-image generation (V6.1), but the /describe command (or the "i" button on the web) uses a vision model - which is essentially an LLM that "sees" the uploaded image and tries its best to describe it. But because the engines are different, what the "describe" command spits out isn't guaranteed to be something the diffusion model will respond to. In fact, the first time Midjourney launched the original version of "/describe," they stressed that they're two distinct models. You might find this old Facebook post from late 2023 helpful: https://www.facebook.com/groups/officialmidjourney/posts/657381126553455/
(That Facebook post even already talks about your exact observation under the "Camera and settings" section.)
Second, the /describe command isn't deterministic: It doesn't always generate the same precise description, and clicking the "refresh" button on the same image will continue to endlessly generate random text strings. In combination with the above hallucinations, you're likely to end up with lots of completely irrelevant terms. For instance, I just uploaded my own photo to Midjourney and used "describe," which at one point said this: "Neil Cherkis is smiling at the camera in front of a crowd on the street." I don't know who Neil Cherkis is, but I'm not him. I tried Googling his name without any luck. Then, after a refresh, the "describe" said "A photo of the smiling and happy Timur Khvoronov," which, again, I am not and isn't a person that Google knows.
All of this is to say that /describe is perfectly capable of spitting out irrelevant gibberish.
However, none of the above fully answers your "Why does it return camera settings values?" observation. This brings us to the third factor...
Third, and this is a bit speculative, but there are good reasons to believe that Midjourney uses pairings of prompts + resulting Midjourney images when training the "/describe" feature. Note how they mentioned that the March version of /describe was made to output prompts "in the style of our community" (https://x.com/midjourney/status/1765909128344658369) - which tells me they're tapping into the hivemind of existing prompts. And, assuming that is the case, all this tells us is that the /describe feature picks up the many "ISO 1600, shot on Olympus Flip Flap Mark IV" prompts from the community, like an ouroboros eating its own garbage prompts.
Putting it all together, the short version is that /describe is an independent image-to-text model that's trained on Midjourney community prompts but that isn't fully aligned with the text-to-image model producing the actual images.
The best way to be sure what works is to use the prompts (as I did in my post) and see if they have an impact.
I hope that gives you some form of explanation for what might be happening!
I’m guilty of going down the path of complex prompts. Made me feel like I was putting in the work. But then I found something that I don’t know if you’ve seen as a pattern - the simplest prompt and often the first generation of it gave me the best results.
Yup, that's exactly why I recommend the "Minimum Viable Prompt" approach, especially with image models. Going for a simple description lets the default aesthetic come through. Adding all sorts of conflicting modifiers may simply confuse the model.
So your observations are very much in line with that!
It’s very informative and interesting. Thank you, Daniel. Another few days faffing around, here I come!
Great to hear you found it useful, John! Let me know if you stumble into any interesting observations of your own.
Very useful and I've fallen victim to the bad prompts. Now I won't
You have seen the ISO 1600, otherwise known as "the light."
Ha. Ha.
Just a little camera humor.
Happy to hear you found it useful!