Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wow. I’m guessing it’s generating MIDI or something rather than synthesizing audio from scratch? Even so, the quality of the score is leaps and bounds better than any of the long-form audio on the Stable Audio demo page (either Stable Audio itself or the other models). The audio model outputs seem to take a sequence of 1 to 3 chords, add a barebones melody on top, and basically loop this over and over. When they deviate from the pattern, it feels unplanned and chaotic and they often just snap back to the pattern without resolving the idea added by the deviation. (Either that or they completely change course and forget what they were doing before.) Yes, EDM in particular often has repetitive chord structures and basic melodies, but it’s not that repetitive. In comparison, from listening to a few suno.ai outputs, they reliably have complex melodies and reasonable chord progressions. They do tend to be repetitive and formulaic, but the repetition comes on a longer time scale and isn’t as boring. And they do sometimes get confused and randomly set off in a new direction, but not as often. Most of the time, the outputs sound like real songs. Which is not something I knew AI could do in 2024.


My understanding is that they use a side effect of the Bark model. The comment https://news.ycombinator.com/item?id=35647569 from JonathanFly probably explains it well. If you train your model on a massive amount of audio mixes of lyrics+music then prompting lyrics alone pulls the music with it as when the comment suggested that prompting context-correlated texts might pull the background noises usual for such context. Already while writing this I imagine training with a huge set of publicly performed poetry pieces that would allow generating novel performances of artificial poets with novel prompts. This is different to riffusion.com approach, where works the genius idea of more or less feeding spectrograms as images to Stable Diffusion.


I don't have any special insight into how it works, but I suspect it is largely synthesizing audio from scratch. The more I've thought about it, the task of generating music feels very similar to the task of text-to-speech with realistic intonation. So feels like the same techniques would be applicable.

Suno do have an open source repo here that presumably uses similar tech: https://github.com/suno-ai/bark

> Bark was developed for research purposes. It is not a conventional text-to-speech model but instead a fully generative text-to-audio model, which can deviate in unexpected ways from provided prompts. Suno does not take responsibility for any output generated. Use at your own risk, and please act responsibly.

I've generated probably >200 songs now with Suno, of which perhaps 10 have been any good, and I can't detect any pattern in terms of the outputs.

Here's another one which is pretty good. I accidentally copied and pasted the prompt and lyrics, and it's amazing to me how 'musically' it renders the prompt:

https://app.suno.ai/song/d7bad82b-3018-4936-a06d-8477b400aae...

Here are a couple more which are pretty good (i use it primarily for making fun songs for my kids):

https://app.suno.ai/song/a308ca8a-9971-47a3-8bb3-a95126ff1a8...

https://app.suno.ai/song/3b78a631-b52a-4608-a885-94f2edc190b...

And this one's kindof interesting in that it can render 'gregorian chant' (i mean it's not very good): https://app.suno.ai/song/0da7502b-73cf-4106-88e8-26f4f465a5f...

But this is one reason it feels like these models are very similar to text-to-speech but with a different training set




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: