Battle of the Bands II: MusicFX vs. MusicGen vs. Stable Audio
One new contestant. Five test categories. 15 new prompts. Who will win?
Happy Thursday, sound shapers!
You know what’s been missing on Why Try AI this year?
A bit of music.
We’ve talked about how to prompt chatbots, how to create better AI images, and the concept of the “Minimum Viable Prompt.”
But text-to-music models? Not so much.
Quite a bit has changed since the first Battle of the Bands, and I figured it’s time for a follow-up.
So turn up the volume, and let’s dive in.
The contestants
Ladies and gentlemen! Welcome back to the second season of Battle of the Bands.
Allow me to introduce our three contestants:
MusicFX (Google): MusicFX, formerly known as
PrinceMusicLM, is finally available to try directly in the AI Test Kitchen. As such, it will no longer get the benefit of only playing preset tunes cherry-picked by Google.MusicGen (Meta AI): Coming back for another round, MusicGen is looking forward to going head-to-head with MusicFX in a more fairly matched fight.
Stable Audio (Stability AI): Meet our fresh challenger, who had the misfortune of joining the music scene several months after our last contest. Can this plucky newcomer hold its own against our two seasoned veterans?
Keen readers will have noticed that our previous third participant, Riffusion, isn’t making a comeback. There are two reasons for this.
First, Riffusion—with its basic “spectrogram” approach—was simply way out of its depth the last time I tried this. It would hurt my soul to put it on the spot again.
Second, Riffusion.com has since become a completely new concept focused on generating songs with lyrics, à la Suno. Perhaps we might want to see a showdown between them? But that’s a battle for another post!
So what’s the contest going to look like this time around?
This post might get cut off in some email clients. Click here to read it online.
The contest
As before, our contestants will compete in five separate categories.
This time, the categories are:1
Genres: Can our participants understand and reproduce different genres?
Instruments: Can the model mimic a given instrument?
Instrument + genre combos: Can the model seamlessly fuse instruments with genre prompts?
Settings: How well can each model creatively interpret the mood of a scene without direct guidance about genre or instruments?
Long prompts: How good are our contestants at complex prompt comprehension and adherence?
For each category, I provide three different prompts to give the models a chance to showcase their versatility (or lack thereof).
For MusicFX, I’ll be generating the audio in the AI Test Kitchen.
For MusicGen, I’ll be using this Hugging Face space.
For StableAudio, I’ll use my free StableAudio.com account.
And now, without further ado, let’s hear them play!
The results
We’ve got 45 sound clips to power through, so brace yourselves.
1. Genres
Let’s start simple: Straightforward genre prompts with no fancy additions.
“Acid jazz”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
Oof!
I didn’t expect this many fails right out of the gate.
MusicFX does a pretty convincing job of producing a pleasant, funky track that is the closest to our prompt.
I don’t know what MusicGen is doing, but it doesn’t feel jazzy at all.
And Stable Audio…well, let’s just generously give it partial credit for delivering some form of experimental jazz played by a troop of wild apes high on acid.
“Tropical house”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
That’s better!
All three manage to produce something enjoyable to listen to.
MusicGen is the most “house” while MusicFX is the most “tropical,” for whatever my opinion is worth.
“Classical”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
Whoa, Stable Audio. Well done, pal! That actually sounds like a legitimate classical piece, even if somewhat random.
In the meantime, MusicFX shows us what it’s like to be a high-school music teacher who has to listen to students play untuned instruments without making the slightest effort to coordinate with each other.
With MusicGen, we just have a flutist showing off the superhuman capacity of their lungs.
2. Instruments
But do our models know how to play specific instruments? There’s only one way to find out…
“Theremin”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
Thanks for the blood-curdling 60s horror movie soundtrack, Stable Audio. I wasn’t planning on sleeping for the next few weeks anyway! To be fair, some theremin-esque sounds are vaguely present.
MuiscGen just keeps playing the same single note but now with a different instrument that is no longer the flute but is very much not theremin either.
MusicFX has some theremin in the background that is inexplicably overshadowed by soft synthesizer sounds.
“Bongos”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
MusicFX insists on doing way more than I ask for. Show off! I even tried using “Bongos solo” as a prompt and still kept getting complex musical arrangements with bongos sort of present.
MusicGen produced clean bongos…that seem to have unfortunately been discovered by an unsupervised toddler.
Stable Audio gets another point for simulating a clean bongo beat. Well done!
“Tuba”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
Damnit, MusicFX. You don’t get points for throwing 17 other instruments into your German polka-inspired composition with vague hints of tuba. Stop that!
MusicGen…what in the name of unholy fuck was that? I mean, “tuba” isn’t an obscure alien instrument. It should be pretty easy to reproduce. (This wasn’t a fluke, by the way. I ran the “tuba” prompt through MusicGen several times with similar results.)
Once again, Stable Audio gets the closest to faithfully reproducing the sound of a standalone tuba, but the result is most certainly not music to our ears.
3. Instrument + genre combos
Let’s turn up the heat.
Can our text-to-music models successfully combine instruments with genres?
“Saxophone hip-hop”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
MusicFX takes the trophy here. A light hip-hop beat with hints of saxophone sprinkled in. Smooth.
MusicGen keeps doing its thing and ignoring prompts. No saxophone, and a vaguely hip-hop-ish track?
Stable Audio went so heavy on the beat that it forgot to even pretend to have a saxophone in there.
“Electric guitar lullaby”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
Again, MusicFX is the only model that actually imitates an electric guitar while also ending up with a nice track.
Both MusicGen and Stable Audio produce something approaching a lullaby with little discernible electric guitars to speak of. Stable Audio even ends up having an unpleasant out-of-tune segment to boot.
“Bagpipes punk rock”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
I’m not sure there are any winners here.
MusicFX has a decent enough track but I’m struggling to find bagpipes or punk rock in it.
MusicGen kind of gives us bagpipes…playing the same single note that MusicGen is apparently so fond of.
StableAudio is slowly unraveling before our very ears and introducing increasingly noisy, out-of-tune movements into its tracks.
4. Settings
Now for something fun: Can our models work with prompts that don’t explicitly mention musical genres or instruments?
“Deep space exploration”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
You get a point! You get a point. Everybody gets a point for being sufficiently “spacey”!
MusicFX gave us something I can see being used in an epic documentary about the wonders of the universe.
MusicGen and StableAudio are quite similar. Both are less “music” and more background sound effects, but they’re certainly adjacent to deep space exploration. StableAudio could work well for a deep-space survival-horror game ala Dead Space.
“Old Western saloon showdown”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
Wow, MusicFX nails it. At least the “Western saloon music” part. Perhaps not so much the “showdown” part.
MusicGen is still stuck playing the same note with different instruments. Frankly, I’m starting to get worried about its sanity.
Stable Audio tried hard to tell a story of a Western showdown but ended up with the cacophony of sounds I’m now starting to expect from it.
“Rainy day cafe”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
I see a pattern developing.
MusicFX ends up with a pleasant track that certainly works as a backdrop to a rainy cafe scene.
MusicGen tried to disguise its favorite one-note gimmick with random atmospheric sounds thrown in, but I’m onto you, MusicGen. You ain't fooling anyone here!
Stable Audio decided “Fuck it, let’s just embrace the noise!” and gave us what sounds like a bunch of people speaking gibberish in a cafe while heavy rain pounds its windows from the outside…as heard via a sped-up cassette tape.
5. Long prompts
We’ve reached our final category. Can our contestants handle long, elaborate prompts with multiple instruments and specific descriptors?2
“Intense industrial metal with gritty guitar riffs and electronic distortions, creating an adrenaline-pumping soundtrack for a dystopian action scene.”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
MusicFX certainly delivers on the “adrenaline-pumping” aspect but I struggle to hear any industrial metal or guitar riffs among all the LOUD NOISES!
The second half of MusicGen’s composition is the closest we get to industrial metal and guitar riffs. Yet, we still can’t escape that single note being played throughout the entire track. What is it with you and that note, MusicGen?
Stable Audio is no longer even trying to hide the fact that it’s gone off the rails.
“Ambient soundscape blending soft electronic textures with natural sounds like water and wind, designed to mimic the serene experience of a dawn walk in a misty forest.”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
Everyone delivers, technically.
MusicFX again has the most well-rounded and melodic composition. It’s hard to identify any specific “natural sounds” but the vibe is absolutely on point.
MusicGen…what are we going to do with you, MusicGen? Is it ambient? Sure. Is it a soundscape? It is. Can you stop playing that one note as if you’d fallen asleep face-first on a synthesizer? No, you cannot.
StableAudio gave up on trying to create discernible music, but at least the track sort of works as an ambient soundscape for a forest walk in this particular instance.
“Dynamic bluegrass and techno hybrid with fast-picking banjo melodies and electronic beats, suitable for an unconventional barn dance under the stars”
MusicFX:
MusicGen:
Stable Audio:
My thoughts:
For once, I kind of like what MusicGen’s got cooking. Reminds me vaguely of Carbon Leaf’s “Desperation Song.” I’d be curious to hear the rest of the track. No techno in sight, though.
I also dig the looping MusicFX track, even though I once again can’t detect any techno in it.
Now, is it possible for an AI music model to get tired and simply give up after generating over a dozen tracks? Because it certainly looks like that’s what’s happening to poor Stable Audio. For this final challenge, it took the basic ingredients of “music” and just threw them together haphazardly with no attempt to arrange them into any semblance of order.
Guess you’re not so stable after all, are you, Stable Audio?
Verdict and observations
Well, this is going to be a very tough call.
It’s so hard to pick a clear winn—
Just kidding: It’s MusicFX.
Obviously.
Here are my model-by-model observations:
MusicFX
MusicFX consistently produces the most pleasant-sounding tracks and, with minor exceptions, is the best at following a given prompt. It also generates rich, high-fidelity audio.
The only “downside” of MusicFX is that it’s not great at reproducing standalone sounds or instruments. It skews towards interpreting even simple prompts as requests for a complete music track.3 That’s why, in niche cases where you might need a sound effect or a one-instrument track, Stable Audio might be the better choice.
MusicGen
To me, MusicGen was the biggest surprise and disappointment of today’s test. I recall being quite impressed with MusicGen during the first Battle of the Bands, especially its ability to creatively incorporate an input melody into a new, original track.
This time around, MusicGen mostly produced underwhelming single-note “snippets” rather than fully formed tunes. It also often missed the prompts and the quality of output wasn’t up to scratch, either.
Granted, some of this could be due to the limited “Hugging Face” implementation, so it’s hard to make an objective judgment.
Stable Audio
If you want a simple isolated sound, Stable Audio is often your best bet. In contrast to MusicFX, Stable Audio tends to interpret short prompts as sound effect suggestions rather than requests for music. So unless you explicitly specify that you’re after a music track, Stable Audio is likely to produce a soundscape instead.
Stable Audio ends up with the noisest tracks of all three models and gets progressively worse as prompts increase in length, complexity, and specificity. Because of this, it’s the most likely to produce cacophonous, out-of-tune tracks.
Over to you…
Do you agree with my opinions about each track and the overall verdict? Which tracks stood out for you?
Have you tried using MusicGen, Stable Audio, or MusicFX on your own?
If you know of any other recent text-to-music models worth trying, let me know!
Leave a comment or shoot me an email at whytryai@substack.com.
The “Melody conditioning” category isn’t making a comeback, because neither the AI Test Kitchen nor Stable Audio offer the option to provide melody prompts at this stage.
Fun fact: I had ChatGPT study previous long prompts and come up with the ones I used below.
Then again, it is called MusicFX
I'm visiting my folks and can't really listen to horror sounds right now, but I enjoyed the descriptions of the fails!
I'm blocked from trying some of these in the Netherlands. I'm really hoping to find some decent "Beatbox psytrance" one day.