Battle of the Bands: MusicLM vs. MusicGen vs. Riffusion

Comparing music samples generated by three recent text-to-music AI models. How does MusicLM fare against MusicGen and Riffusion?

Jun 22, 2023

Hey kids, do you like music?

How about music made by cold, unfeeling AI models via algorithms pre-trained on millions of statistical data points?

Well, I sure hope you’ve answered “Yes” to both questions, because today we’ll be listening to a whole lotta text-to-music shenanigans.

Welcome to the Battle of the Bands, AI edition!

Let’s meet the contestants.

Our contestants

Ladies and gentlemen…

In the blue corner, we have the powerful-but-mysterious MusicLM from Google. MusicLM has composed many impressive songs, but you probably haven’t heard any of them personally. They, uh, only play them in Canada.

In the red corner is the up-and-coming MusicGen by Meta. Unlike MusicLM, MusicGen has nothing to hide and is willing to open its ~~heart~~ source to just about anyone. And it takes requests: Whistle a tune, and MusicGen will turn it into a track.

In the green corner of this oddly triangular ring is our last contestant: Riffusion by Seth Forsgren and Hayk Martiros. What it lacks in star power, Riffusion more than makes up for with its unique talent: It doesn’t just hear music, it sees it!

The contest

Let me make one thing clear right off the bat: This won’t be a fair fight.

The game is very much rigged in favor of MusicLM.

You see, even though I’m on the waitlist, I still don’t have access to Google’s AI Test Kitchen. This means I can’t feed MusicLM new prompts of my own.

So for this comparison, I’ll be relying on the existing demo samples from the MusicLM research paper. This means I’ll be comparing how MusicGen and Riffusion perform against handpicked MusicLM tracks that Google researchers deemed worthy of including in their model demo.

For MusicGen, I’ll be using a combination of this Hugging Face demo and this Google Colab to generate the samples. Finally, I’ll be generating the Riffusion samples directly on their site.

This is by no means a robust test of each model’s true capabilities. But it is a fun demo of their output and individual features.

Our contestants will compete in the following five categories:

Short genre prompts
Settings prompts
Simple instrument prompts
Rich caption prompts
Melody conditioning (Riffusion excluded)

Without further ado, let’s get to it!

Put those headphones on and enjoy the ride.

The results

Below are the chosen prompts per category, followed by the three models’ output. (Riffusion limits downloadable samples to just 5 seconds, so I also link to their site for anyone who might want to hear the track further.)

1. Short genre prompts

These are very broad prompts to see how the models handle different music genres.

“Dream pop”

MusicLM:

1×

0:00

-0:10

MusicGen:

1×

0:00

-0:08

Riffusion (more here):

1×

0:00

-0:05

Honestly, I think all three samples are generally close to the definition of dream pop. Riffusion doesn’t quite match the production quality of the other two models, but that’s going to be a recurring theme due to the innate limitations of the model.

“Breakbeat”

MusicLM:

1×

0:00

-0:10

MusicGen:

1×

0:00

-0:08

Riffusion (more here):

1×

0:00

-0:05

MusicLM and MusicGen are pretty close on this one, though MusicGen has more of a The Crystal Method vibe to it. Riffusion is too slow and halting for a breakbeat sample, if you ask me. (Which nobody did.)

“Minimal house”

MusicLM:

1×

0:00

-0:10

MusicGen:

1×

0:00

-0:08

Riffusion (more here):

1×

0:00

-0:05

Same as above: Points to MusicLM and MusicGen, but not so much to Riffusion.

2. Settings prompts

These prompts are far more vague and open to interpretation, so there’s no way to evaluate them objectively. I’ll go ahead and be subjective then.

“Escaping prison”

MusicLM:

1×

0:00

-0:10

MusicGen:

1×

0:00

-0:08

Riffusion (more here):

1×

0:00

-0:05

I could sort of see myself escaping prison to the first two samples. Riffusion’s take would make me fall asleep in the tunnel under my cell. Not a good escape track.

“Street performance”

MusicLM:

1×

0:00

-0:10

MusicGen:

1×

0:00

-0:08

Riffusion (more here):

1×

0:00

-0:05

Holy shit. I never thought I’d say this, but my vote goes to Riffusion on this one.

Sure, MusicLM’s sample might technically be both “street” and “performance,” but that’s not a performance I want to be anywhere close to. Sounds like a bunch of clowns aggressively tuning their toy instruments at the same time. I don’t know what the hell MusicGen has going on, but that ain’t no street performance.

As for Riffusion? I can actually see a street pianist playing something like that. Maybe that was an accident, but hey, it still counts.

“Underground rave”

MusicLM:

1×

0:00

-0:10

MusicGen:

1×

0:00

-0:08

Riffusion (more here):

1×

0:00

-0:05

Oh man, MusicGen nailed this one! That’s exactly what I’d expect to hear in any Hollywood movie during a nightclub scene.

In contrast, most of MusicLM’s sample is straight out of a knock-off sidescroller called Superb Marko Siblings you’d find in a bargain bin at your local supermarket.

As for Riffusion, thanks for participating, I guess?

3. Simple instrument prompts

Okay, this should be easy. Just get the instrument’s sound approximately right. Right?

“Harp”

MusicLM:

1×

0:00

-0:10

MusicGen:

1×

0:00

-0:08

Riffusion (more here):

1×

0:00

-0:05

Both MusicLM and MusicGen do a good job, with maybe a slight edge to MusicLM?

Riffusion, you’ll get there. Someday.

“Trumpet”

MusicLM:

1×

0:00

-0:10

MusicGen:

1×

0:00

-0:08

Riffusion (more here):

1×

0:00

-0:05

MusicLM sounds the most authentic, but I feel MusicGen conveys the “spirit” of the trumpet better. Also, this time we have Riffusion giving us what actually sounds like a trumpet…played by the guy behind Shittyflute.

“Xylophone”

MusicLM:

1×

0:00

-0:10

MusicGen:

1×

0:00

-0:08

Riffusion (more here):

1×

0:00

-0:05

MusicLM = xylophone.

MusicGen = toy xylophone?

Riffusion = it’s the thought that counts!

4. Rich captions

This challenge added yet another limitation for Riffusion, as its input field couldn’t accommodate the entire prompt. But we soldier on nonetheless!

“A fusion of reggaeton and electronic dance music, with a spacey, otherworldly sound. Induces the experience of being lost in space, and the music would be designed to evoke a sense of wonder and awe, while being danceable.”

MusicLM:

1×

0:00

-0:30

MusicGen:

1×

0:00

-0:08

Riffusion (more here):

1×

0:00

-0:05

An overwhelming win for MusicLM on this one! The track not only follows instructions but is actually pretty great in its own right (except perhaps the odd chants at the end).

I might even hand the second place to Riffusion here for getting the vibe right-ish.

“Meditative song, calming and soothing, with flutes and guitars. The music is slow, with a focus on creating a sense of peace and tranquility.”

MusicLM:

1×

0:00

-0:30

MusicGen:

1×

0:00

-0:08

Riffusion (more here):

1×

0:00

-0:05

Yay! Everyone (even Riffusion, kind of) understood the assignment. I’d actually like to hear more of the MusicGen track beyond the 8-second sample.

“Epic soundtrack using orchestral instruments. The piece builds tension, creates a sense of urgency. An a cappella chorus sing in unison, it creates a sense of power and strength.”

MusicLM:

1×

0:00

-0:30

MusicGen:

1×

0:00

-0:08

Riffusion (more here):

1×

0:00

-0:05

Frankly, I’m not hearing orchestral instruments in any of these. But MusicLM does get the rising tension, the choir, and the sense of urgency right. Points to MusicLM!

5. Melody conditioning

I could only do this test for MusicLM and MusicGen as Riffusion doesn’t support melody prompting. To get as close to MusicLM’s results as possible, I used the exact input files from the MusicLM demo site to condition MusicGen.

“Tribal drums and flute + Bella Ciao”

MusicLM:

1×

0:00

-0:09

MusicGen:

1×

0:00

-0:20

MusicLM sure parrots the melody faithfully, but damn, that deep tribal drum beat on the MusicGen track is fire!

“Electronic synth lead + Twinkle Twinkle Little Star”

MusicLM:

1×

0:00

-0:09

MusicGen:

1×

0:00

-0:15

Again, I like the MusicGen track a whole lot more, even though MusicLM’s is technically a more accurate reproduction of the input melody.

“Jazz with saxophone + Jingle Bells”

MusicLM:

1×

0:00

-0:09

MusicGen:

1×

0:00

-0:15

I know what it is!

MusicLM always goes all in on mimicking the melody at the expense of the general vibe. MusicGen treats the input melody as a broad suggestion but is actually better at capturing the rest of the prompt’s intent.

Observations

Phew!

That’s quite a lot of music samples. (Exactly 42, but who’s counting?!)

What have we learned today?

For what it’s worth, here’s what I have:

Sound quality

Both MusicLM and MusicGen are solid when it comes to the quality of audio. The chosen instruments sound authentic most of the time, too. Riffusion is definitely weaker on this front, with generally rather muffled, poor quality output.

Prompt adherence

With a few exceptions, MusicLM and MusicGen successfully honor any given prompt. This even holds for longer prompts with lots of context and instructions. (MusicLM nudges out MusicGen slightly on these lengthy prompts.)

Riffusion really struggles to stick to the task. Much of it is probably due to the nature of the model. Riffusion uses a starting “seed image” for its generations, which adds a baseline sound to every track, resulting in unwanted artifacts.

Alternative seed images in Riffusion: OG Beat, Agile, Marim, Motorway, and Vibes

Sometimes, playing around with the seed image gets you closer to the intended feel, but it’s mostly hit-and-miss.

Melody reproduction

This is a funny one.

There’s no doubt that MusicLM is better at accurately reproducing the input melody.

But I personally prefer the way MusicGen handles melody conditioning. The input melody influences the final outcome but is incorporated in a more subtle way.

To borrow Tim Urban’s “the cook and the chef” analogy, MusicLM is a great cook: It follows the recipe to the letter, so you always know exactly which dish you’ll get.

MusicGen, on the other hand, is a chef. It draws inspiration from the recipe but then remixes it into fusion cuisine with its own unique flavor.

Final thoughts

MusicLM and MusicGen are robust models, capable of impressive output. I can’t wait to be let into the AI Test Kitchen, so I can test MusicLM with new prompts.

Riffusion sure has a fun gimmick going for it: It doesn’t generate music directly. Instead, the model first creates a spectrogram image using Stable Diffusion, which it then converts into an audio clip.

But Riffusion clearly lags behind MusicGen and MusicLM when it comes to the quality and accuracy of output.

Over to you…

What do you think? Do you agree with my subjective opinions about each track? Have you tried playing more closely with either MusicGen or MusicLM? Do you know of any other recent text-to-music models I could try?

Leave a comment on the site or shoot me an email (reply to this one).

Liked the post? Help me grow Why Try AI by sharing it with others!

Andrew Smith

We did an article about this like a month and a half ago, which is like a century for generative AI in 2023. This might give you ideas for a follow up down the road; I know I want to do another as well: https://goatfury.substack.com/p/i-tried-writing-and-recording-a-song

Expand full comment

2 replies by Daniel Nest and others

2 more comments...

Why Try AI

Battle of the Bands: MusicLM vs. MusicGen vs. Riffusion

Comparing music samples generated by three recent text-to-music AI models. How does MusicLM fare against MusicGen and Riffusion?

Our contestants

The contest

The results

1. Short genre prompts

2. Settings prompts

3. Simple instrument prompts

4. Rich captions

5. Melody conditioning

Observations

Sound quality

Prompt adherence

Melody reproduction

Final thoughts

Over to you…

Discussion about this post