Wow. I’m guessing it’s generating MIDI or something rather than synthesizing aud...

Agraillo · on Feb 13, 2024

My understanding is that they use a side effect of the Bark model. The comment https://news.ycombinator.com/item?id=35647569 from JonathanFly probably explains it well. If you train your model on a massive amount of audio mixes of lyrics+music then prompting lyrics alone pulls the music with it as when the comment suggested that prompting context-correlated texts might pull the background noises usual for such context. Already while writing this I imagine training with a huge set of publicly performed poetry pieces that would allow generating novel performances of artificial poets with novel prompts. This is different to riffusion.com approach, where works the genius idea of more or less feeding spectrograms as images to Stable Diffusion.

RobinL · on Feb 13, 2024

I don't have any special insight into how it works, but I suspect it is largely synthesizing audio from scratch. The more I've thought about it, the task of generating music feels very similar to the task of text-to-speech with realistic intonation. So feels like the same techniques would be applicable.

Suno do have an open source repo here that presumably uses similar tech: https://github.com/suno-ai/bark

> Bark was developed for research purposes. It is not a conventional text-to-speech model but instead a fully generative text-to-audio model, which can deviate in unexpected ways from provided prompts. Suno does not take responsibility for any output generated. Use at your own risk, and please act responsibly.

I've generated probably >200 songs now with Suno, of which perhaps 10 have been any good, and I can't detect any pattern in terms of the outputs.

Here's another one which is pretty good. I accidentally copied and pasted the prompt and lyrics, and it's amazing to me how 'musically' it renders the prompt:

https://app.suno.ai/song/d7bad82b-3018-4936-a06d-8477b400aae...

Here are a couple more which are pretty good (i use it primarily for making fun songs for my kids):

https://app.suno.ai/song/a308ca8a-9971-47a3-8bb3-a95126ff1a8...

https://app.suno.ai/song/3b78a631-b52a-4608-a885-94f2edc190b...

And this one's kindof interesting in that it can render 'gregorian chant' (i mean it's not very good): https://app.suno.ai/song/0da7502b-73cf-4106-88e8-26f4f465a5f...

But this is one reason it feels like these models are very similar to text-to-speech but with a different training set