The Surprising Parallels Between Text-To-Image AI Models & Mobile Operating Systems
Why I think Stable Diffusion, Midjourney, and DALL-E can be loosely compared to Android, iOS, and Windows Phone, respectively. See if you agree!
Good day, fellow co-inhabitants of this timeline in the multiverse!
First off, thank you for the overwhelmingly positive reception of my new Sunday segment, 10X AI. Judging by the poll votes and Substack engagement stats, it’s been a success. I’m motivated to continue digging up more newsy things every Sunday.
Today, I want to finally put into words something that’s been rattling around in my brain for months now. It grew from a curious observation into a pretty well-formed opinion.
That opinion is: There are strong parallels between the three main text-to-art models and smartphone operating systems.
Like so:
Stable Diffusion (Stability AI)…is like Android (Google)
Midjourney (Midjourney, Inc)…is like iOS (Apple)
DALL-E (OpenAI)…is like Windows Phone (Microsoft)
Just for fun (and to have angry people screaming that I’m wrong in the comments), I wanted to discuss the similarities I see between the three AI models and their smartphone counterparts.
Who knows? It might even help you understand the current text-to-image landscape.
“Daniel, but what about [this other model]?”
As far as I can tell, the above three represent the lion’s share of public-facing models.
Notable mentions:
There’s Imagen by Google, which—judging by Google’s favorite game of “let’s never actually release anything, ever”—means we likely won’t see it for a while.
There are also DeepAI and Craiyon. Craiyon evolved from the original DALL-E and DeepAI may be using its own proprietary model. Neither is as widely adopted as the big three. (If anyone has better info, please let me know so I can correct it.)
There’s BlueWillow, which uses a Midjourney-inspired Discord interface but appears to be an aggregation of models based on Stable Diffusion.
Finally, there’s Adobe Firefly, which actually has a chance to become a serious contender within generative AI. For now, it’s still in beta, and early impressions indicate it lags slightly behind in output quality.
In short, much like you can find minor smartphone players, there may be niche text-to-image models out there that don’t get as much traction (yet).
For this post, I’m focusing exclusively on The Big Three.
1. Stable Diffusion: the Android of the bunch
Stable Diffusion was the model that seduced me into exploring the generative AI scene in the first place. It hit the market in late August 2022.
Since then, it was updated several times and spawned hundreds of specially trained spinoff versions and projects based on its core source code.
Here’s what makes Stable Diffusion similar to Android.
Open-source
Just like Android, Stable Diffusion is fully open-source. You can look under the hood. You can download the entire model, use it locally, dissect it, tweak it, train it on your own data, build integrations on top of it, and bake homemade pancakes with it (citation needed).
Customizable and moddable
Because of the above, Stable Diffusion is infinitely customizable and moddable. There are countless pre-trained models based on SD. This compares to how Android serves as the foundation for third-party smartphone operating systems like Oxygen OS by OnePlus.
Not platform-bound
Stability AI does have its own site that lets you create art with Stable Diffusion. It’s called DreamStudio…and you absolutely don’t have to use it, just like you don’t need a Google Pixel phone to use Android.
There are hundreds of third-party sites and apps that let you create with Stable Diffusion. Here are just a few of them:
NightCafe (which also lets you use DALL-E)
Catbird (runs the same prompt through dozens of different SD models)
…and so, so many more.
The “geekier” option
Put simply, Stable Diffusion puts a lot of power and control in the hands of the end user. You can adjust almost any aspect of image generation, from the sampler used to the number of generation steps to the level of influence your text prompt should have over the end result.
Similarly, especially with a rooted Android phone, there are many ways to tweak and customize it to your exact needs. But it’s a bit of a double-edged sword. Because there are so many buttons to press and knobs to turn, there’s a steeper learning curve to getting the exact results you want.
This makes both Android and Stable Diffusion appeal to the slightly more tech-savvy crowd looking for that additional level of control.
Vibrant but fragmented ecosystem
Thanks to its open-source nature, Stable Diffusion has spawned a massive ecosystem of passionate tweakers and builders using its code to do awesome things.
There’s ControlNet that gives you precise control over posing your subjects and structuring your scene. There’s Deforum that lets you create animated scenes by generating and interpolating a series of Stable Diffusion images. There’s InstructPix2Pix that lets you edit images via text commands.
And those are just the tip of the iceberg. There’s so much to explore…
…which, again, is both a blessing and a curse. Newcomers may find the sheer amount of options overwhelming and wish for a self-contained, plug-and-play solution.
Which is why they might end up turning to…
2. Midjourney and the iOS experience
Midjourney’s been around since early 2022 but hasn’t started to truly impress people until Version 4 came out in November that year.
At the time of writing—with the launch of V5—Midjourney is arguably the best, most beginner-friendly text-to-image model on the market.
So what makes Midjourney similar to iOS?
Perfect for beginners
As Steve Jobs used to famously say about Apple products, they “just work.”
And while you can frequently find sarcastic “It just works” memes of iOS failing, it’s hard to deny that part of its broad appeal is the plug-and-play factor. With iPhones, you know exactly what you get, and you know they’ll work right out of the box.
As I argued before, this is also true for Midjourney. Even the simplest, one-word prompts are likely to spit out impressive images by default.
In Midjourney, the baseline result is simply more polished than in Stable Diffusion.
So if you’re just starting out with text-to-image, Midjourney is far more likely to give you something that “just works.”
Focus on the core experience rather than variety or amount of features
With Apple and iOS, there’s only one phone: the iPhone. There are many iPhone models, sure, but the core experience and interface is the same across the board.
Similarly, Midjourney released five versions of their model so far, but the Discord interface and the way you interact with it has stayed largely the same.
In the same vein—and in contrast to Stable Diffusion—Midjourney users are limited to a few core commands and parameters if they want to exert some influence over their images. There aren’t nearly as many buttons, sliders, and tick boxes to play with.
Once again, this is something that beginners will cherish and advanced users might consider a limitation.
Closed platform
Do you want to use iOS? You’ll need an iPhone.
Want to try Midjourney? Midjourney.com is the place to go.
Right now, Midjourney’s Discord channel is the only way to use their AI model.
There are a few third-party sites offering a sort of GUI that in turn interacts with Midjourney’s Discord bot. But other than that, Midjourney is very much a self-contained, closed-source platform.
You could argue that Midjourney is even more closed off than iOS. Apple at least has the App Store for third-party apps. The Midjourney team, on the other hand, are reluctant to release even a limited official API.
So it’s all in-house for now.
3. Why DALL-E is (a bit) like Windows Phone
Right off the bat, I must preface this by saying that this is the most iffy comparison of the three.
After all, Windows Phone went the way of the dodo (may it rest in peace), while DALL-E is very much alive and kicking. It even powers Microsoft’s AI image generation in their Bing Chat and other products.
Still, let’s see if we can shoehorn this last contender into our analogy.
(The fact that DALL-E is a model from OpenAI, which closely partners with Microsoft in the AI space, is just a curious coincidence in this case.)
Early mover
As strange as it may sound, the predecessor to Windows Phone, Windows Mobile, was actually out in 2004—well ahead of both iOS (2007) and Android (2008). Along with Nokia’s Symbian, Windows Mobile dominated the early smartphone market.
OpenAI’s first version of DALL-E came out in January 2021, an entire year before either Midjourney or Stable Diffusion were on the map. (A century in AI years.)
DALL-E was the first text-to-image model to truly capture the public’s attention, with deep dives like this one singing its praises. OpenAI even offered a limited demo that allowed users to pick from several combinations of pre-selected subjects and styles.
The first time I personally heard of the existence of image-generating AI was in an article about DALL-E.
Failure to stay relevant
In July 2022, OpenAI gradually made their newer DALL-E 2 available to a limited number of waitlist users in a closed beta.
But by then, they were no longer the only noteworthy player in the text-to-image space.
Midjourney V3 dropped at just about the same time. Crucially—in contrast to OpenAI’s closed beta approach—Midjourney was available to anyone willing to pay the monthly fee and even had a free trial that let people generate a limited number of test images.
One month later, Stability AI released a next-generation, open-source Stable Diffusion model that anyone with a powerful enough computer could download for free and run in the comfort of their own home.
The game changed dramatically!
Despite this, OpenAI only dropped the waitlist for DALL-E 2’s closed beta in late September 2022, more than a month after Stable Diffusion’s public release. (A decade in AI years, to stick to our ultra-scientific scale.)
As I write this, DALL-E is the most outdated text-to-image model, lagging behind both Stable Diffusion and Midjourney in terms of output quality.
All of this mirrors how Windows Phone failed to gain momentum and ultimately failed altogether by constantly being two steps behind the competition.
Note: This might not be a fair comparison, since OpenAI didn’t necessarily pursue the goal of having the most popular text-to-image model. (Microsoft was explicitly trying to fight for market share with the Windows Phone.)
Over to you…
What do you think? Does this comparison hold true to some extent? Am I way off? Did I miss other obvious overlaps that should be included?
I’m curious to hear your thoughts and will happily adjust the article to incorporate any relevant input. Leave a comment on the site or shoot me an email (reply to this one).
Stay tuned for Sunday’s 10X AI issue!