You might wanna tag (or just email) Rudy- I know he played with music generation early on, and it was a lot of work (and also kind of uncanny-valley terrifying).
Yeah I read his piece on Goatfury back when you first linked to it after my first "Battle of the Bands" - music AI isn't quite there yet, but it has its moments, and I think MusicFX generally ends up with something listenable. But that's a far cry from being steerable into delivering exactly what you need.
If we were told 30 years ago that we could just say words and have like orchestral sounds generated, we'd be blown away. It wouldn't even matter how much the output would suck, sort of like very early MJ or maybe GPT3. Just awe-inspiring if you step a little back.
It's insane how quickly we went from having our minds blown by the original Stable Diffusion (objectively crappier than anything out there now) and ChatGPT (with 3.5) to looking at something like Sora and going "Meh, it's not 100% perfect!"
I know the pain. I'm in Denmark myself. VPN is your savior, if you want to bother with it. Works like a charm for accessing MusicFX via the AI Test Kitchen. Keep me posted on how it goes!
Wow, you made my day with your comments, 😂😂😂😂😂. I certainly do agree with you.since I listened you last battle things are really getting worst . Was just wondering why no one there is interested in doing something decent. I would however say that I almost liked the last stable diffusion one 😁. What about music generation AI, they do quite good stuff.
I actually think there are many companies working on music AI, but I guess it isn't the easiest concept to crack. Then again, I feel like MusicFX is getting there.
When you say "music generation AI" - are you referring to some specific programs? There are of course places like Mubert and Soundraw that provide beats etc but they're also quite limited - they usually do tracks with beats in them and can't faithfully do slower genres or standalone instruments.
Then you have Riffusion and Suno that actually do a pretty good job of making believable songs complete with lyrics and instrumentals.
The generative music space is going terribly. On an ACX thread someone told me that music is hard for the same reason video is hard. (Many frames Vs a single image) I think video is probably advancing more quickly because of the theory that video data could produce a “General World Model”.
That's an interesting insight. I can't say I followed the music AI scene too closely. Almost every other generative AI field is getting more attention. I would still expect music to somehow be easier than (silent) video, since the need for coherence in flow and motion isn't as strict. You can e.g. get away with a somewhat random melody as long as the note sequence is vaguely pleasing, but you can't have a person's right foot move forward twice and still have the video look natural. Plus there are far more moving parts in a given video frame, especially with multiple characters. But I admittedly know very little about the complexities involved.
The Bluegrass options are interesting. Bluegrass techno!!! These tools are definitely getting better. When you let a track play for a long time is there significant variation, or does the track repeat after a while? Say the Rainy Cafe Track for instance?
Great post! It's hysterical how far off the rails Stable Audio became. Poor chap.
I liked your test criteria, though I wonder how they would have responded with a specific artist as a prompt. Could you have tested them with something like "Michael Jackson" or "Metallica" or do these models have safeguards against mimicking artists?
Originally, the idea of testing them with artists didn't even cross my mind. Each one of them is billed specifically as an instrumental music model, and none of the dozens of demo prompts and tracks showcase their ability to render vocals and lyrics. (That's why I distinguish between them and Riffusion and Suno.)
But after your comment, I went ahead and tried a few artist prompts in MusicFX, and yup: That's explicitly against their content policy, so they gave an error. Here's what it says: "MusicFX features precautions to protect artist voices and styles so certain queries that mention specific artists or include vocals will not be generated."
I gave it a try myself and while they seem to have a library of excluded artist names, song names were another story. Put "Yesterday" in MusicFX and it sounds clearly like a reconstituted MIDI version of the Beatles song. Makes you wonder if anything was off limits in the training data...
That's interesting. I'm guessing it's similar to training LLMs, in that they use whatever's available for pre-trainind and add all the guardrails afterwards in fine-tuning, etc. I just gave "Yesterday" a shot, and I can't sense much resemblence to the original, apart from the fact that it's a soft piano track (MuscFX generates two versions at a time, and none of them can be recognized as the melody from Yesterday.
I'm visiting my folks and can't really listen to horror sounds right now, but I enjoyed the descriptions of the fails!
Once you'll hear them, you'll see just how accurate, insightful, and genius my descriptions are. In my humble opinion.
You might wanna tag (or just email) Rudy- I know he played with music generation early on, and it was a lot of work (and also kind of uncanny-valley terrifying).
Yeah I read his piece on Goatfury back when you first linked to it after my first "Battle of the Bands" - music AI isn't quite there yet, but it has its moments, and I think MusicFX generally ends up with something listenable. But that's a far cry from being steerable into delivering exactly what you need.
If we were told 30 years ago that we could just say words and have like orchestral sounds generated, we'd be blown away. It wouldn't even matter how much the output would suck, sort of like very early MJ or maybe GPT3. Just awe-inspiring if you step a little back.
For sure. No need to even go that far.
It's insane how quickly we went from having our minds blown by the original Stable Diffusion (objectively crappier than anything out there now) and ChatGPT (with 3.5) to looking at something like Sora and going "Meh, it's not 100% perfect!"
It took us about 10 years to become angered by the slowness of downloading a video on the internet, right?
Attitude of gratitude 🙏
I'm blocked from trying some of these in the Netherlands. I'm really hoping to find some decent "Beatbox psytrance" one day.
I know the pain. I'm in Denmark myself. VPN is your savior, if you want to bother with it. Works like a charm for accessing MusicFX via the AI Test Kitchen. Keep me posted on how it goes!
Wow, you made my day with your comments, 😂😂😂😂😂. I certainly do agree with you.since I listened you last battle things are really getting worst . Was just wondering why no one there is interested in doing something decent. I would however say that I almost liked the last stable diffusion one 😁. What about music generation AI, they do quite good stuff.
Happy to make you laugh!
I actually think there are many companies working on music AI, but I guess it isn't the easiest concept to crack. Then again, I feel like MusicFX is getting there.
When you say "music generation AI" - are you referring to some specific programs? There are of course places like Mubert and Soundraw that provide beats etc but they're also quite limited - they usually do tracks with beats in them and can't faithfully do slower genres or standalone instruments.
Then you have Riffusion and Suno that actually do a pretty good job of making believable songs complete with lyrics and instrumentals.
I found myself both nodding in agreement and bursting into laughter at your commentary. Bravo! What a great post.
Thanks Suzi, happy you found it entertaining. It was nice to try something a little different and poke fun at the poor music models.
The generative music space is going terribly. On an ACX thread someone told me that music is hard for the same reason video is hard. (Many frames Vs a single image) I think video is probably advancing more quickly because of the theory that video data could produce a “General World Model”.
That's an interesting insight. I can't say I followed the music AI scene too closely. Almost every other generative AI field is getting more attention. I would still expect music to somehow be easier than (silent) video, since the need for coherence in flow and motion isn't as strict. You can e.g. get away with a somewhat random melody as long as the note sequence is vaguely pleasing, but you can't have a person's right foot move forward twice and still have the video look natural. Plus there are far more moving parts in a given video frame, especially with multiple characters. But I admittedly know very little about the complexities involved.
The Bluegrass options are interesting. Bluegrass techno!!! These tools are definitely getting better. When you let a track play for a long time is there significant variation, or does the track repeat after a while? Say the Rainy Cafe Track for instance?
The public-facing tools I used are limited to short snippets:
MusicFX generates exactly 30 seconds.
MusicGen on HuggingFace only does 15.
Stable Audio can be set to anywhere between 0 and 45, but I mostly kept it at 30. (The paid pro version allows up to 90 seconds.)
But we know for a fact that:
1) MusicFX is capable of generating tracks up to at least 5 minutes long
2) They are not loops but complete, non-repeating compositions.
Check out the "Long Generation" section of the original MusicLM paper (precursor to MusicFX): https://google-research.github.io/seanet/musiclm/examples/
Great post! It's hysterical how far off the rails Stable Audio became. Poor chap.
I liked your test criteria, though I wonder how they would have responded with a specific artist as a prompt. Could you have tested them with something like "Michael Jackson" or "Metallica" or do these models have safeguards against mimicking artists?
Originally, the idea of testing them with artists didn't even cross my mind. Each one of them is billed specifically as an instrumental music model, and none of the dozens of demo prompts and tracks showcase their ability to render vocals and lyrics. (That's why I distinguish between them and Riffusion and Suno.)
But after your comment, I went ahead and tried a few artist prompts in MusicFX, and yup: That's explicitly against their content policy, so they gave an error. Here's what it says: "MusicFX features precautions to protect artist voices and styles so certain queries that mention specific artists or include vocals will not be generated."
I gave it a try myself and while they seem to have a library of excluded artist names, song names were another story. Put "Yesterday" in MusicFX and it sounds clearly like a reconstituted MIDI version of the Beatles song. Makes you wonder if anything was off limits in the training data...
That's interesting. I'm guessing it's similar to training LLMs, in that they use whatever's available for pre-trainind and add all the guardrails afterwards in fine-tuning, etc. I just gave "Yesterday" a shot, and I can't sense much resemblence to the original, apart from the fact that it's a soft piano track (MuscFX generates two versions at a time, and none of them can be recognized as the melody from Yesterday.