More

humbledrone · 2025-05-06T23:46:56 1746575216

Hey everyone, it worked, I had a super productive conversation with exactly the right person on the Metal team! Thanks for helping me get Apple's attention. I didn't at all expect this amount of support.

https://anukari.com/blog/devlog/productive-conversation-appl...

krackers · 2025-05-07T01:04:11 1746579851

>While I can't share any technical details... The engineer provided some suggestions and hints that I can use right now to maybe — just maybe — get things working in the short term

Great that you have a workaround now, but the fact that you can't even share what the workaround is, ironically speaks to the last line in https://news.ycombinator.com/item?id=43904921 of how Apple communicates

>there’s this trick of setting it to this but then change to that and it’ll work. Undocumented but now you know

When you do implement the workaround, maybe you could do it in an overtly-named function spottable via disassembly so that others facing similar constraints of latency-sensitive GPU have some lead as to the magic incantation to use?

humbledrone · 2025-05-11T23:58:07 1747007887

Despite enjoying your idea, I probably won't do that, but certainly anyone who actually has this same problem should reach out to me and I'll put them in touch with the right folks at Apple, who can share the info.

mschuster91 · 2025-05-06T23:59:09 1746575949

Once again, HN has fulfilled its true purpose: cutting through the red tape that is placed in the front of every large corporation's customer support.

Congratulations and good luck with your project!

humbledrone · 2025-05-06T20:02:09 1746561729

I attempted to preempt your question in the section of my blog post, "Why don’t you just pipeline the GPU code so that it saturates the GPU?" It's one of the less-detailed sections though so maybe you have further questions? I think the main thing is that since Anukari processes input like MIDI and audio data in real-time, it can't work ahead of the CPU, because those inputs are not available yet.

Possibly what you describe is a bit more like double-buffering, which I also explored. The problem here is latency: any form of N-buffering introduces additional latency. This is one reason why some gamers don't like triple-buffering for graphics, because it introduces further latency between their mouse inputs and the visual change.

But furthermore, when the GPU clock rate is too low, double-buffering or pipelining don't help anyway, because fundamentally Anukari has to keep up with real time, and every block it processes is dependent on the previous one. With a fully-lowered GPU clock, the issue does actually become one of throughput and not just latency.

humbledrone · 2025-05-06T19:21:26 1746559286

It's a real-time audio app, so if it falls behind real time, no audio. You get cracks, pops, and the whole thing becomes unusable. If the user is doing audio at 48 kHz, the required latency is 1/48,000 seconds per sample, or realistically somewhat less than that to account for variance and overhead.

lostmsu · 2025-05-06T22:16:16 1746569776

I find it hard to believe that users would notice latency under 1ms. Probably not even under 5ms.

Have you tried buffering for 5ms? Was result bad? 1 ms?

humbledrone · 2025-05-06T19:15:52 1746558952

1. Anukari runs up to 16 entire copies of the physics model for polyphony, so 16 * 1024 * 48K (I should update the blog post)

2. Users can arbitrarily connect objects to one another, so each object has to read connections and do processing for N other entities

3. Using the full CPU requires synchronization across cores at each physics step, which is slow

4. Processing per object is relatively large, lots of transcendentals (approx OK) but also just a lot of features, every parameter can be modulated, needs to be NaN-proof, so on

5. Users want to run multiple copies of Anukari in parallel for multiple tracks, effects, etc

Another way to look at it is: 4 GHz / (16 voice * 1024 obj * 4 connections * 48,000 sample) = 1.3 cycles per thing

The GPU eats this workload alive, it's absolutely perfect for it. All 16 voice * 1024 obj can be done fully in parallel, with trivial synchronization at each step and user-managed L1 cache.

humbledrone · 2025-05-06T03:57:26 1746503846

For anyone seeing this post a bit late: I need a bit of help from someone inside Apple who works on Metal. If you know someone, it would be great if you could connect me to them:

https://news.ycombinator.com/item?id=43901619

https://anukari.com/blog/devlog/an-appeal-to-apple

humbledrone · 2025-05-06T03:40:10 1746502810

Some folks may have seen my Show HN post for Anukari here: https://news.ycombinator.com/item?id=43873074

In that thread, the topic of macOS performance came up there. Basically Anukari works great for most people on Apple silicon, including base-model M1 hardware. I've done all my testing on a base M1 and it works wonderfully. The hardware is incredible.

But to make it work, I had to implement an unholy abomination of a workaround to get macOS to increase the GPU clock rate for the audio processing to be fast enough. The normal heuristics that macOS uses for the GPU performance state don't understand the weird Anukari workload.

Anyway, I finally had time to write down the full situation, in terrible detail, so that I could ask for help getting in touch with the right person at Apple, probably someone who works on the Metal API.

Help! :)

bambax · 2025-05-06T08:46:51 1746521211

> This is going to be a VERY LONG HIGHLY TECHNICAL post, so either buckle your seatbelt or leave while you still can.

Well, I read it all and found it not too long, extremely clear and well-written, and informative! Congrats on the writing.

I've never owned a Mac and my pc is old and without a serious GPU, so it's unlikely that I'll get to use Anukari soon, but I regret it very much, as it looks sooo incredibly cool.

Hope this gets resolved fast!

my123 · 2025-05-06T13:15:05 1746537305

Did you try this entitlement? https://developer.apple.com/documentation/bundleresources/en...

wonder if com.apple.developer.sustained-execution also goes the other way around...

humbledrone · 2025-05-06T18:55:15 1746557715

Thanks for the thought, unfortunately when running as a plugin Anukari is subject to whatever plist.txt the host application uses. I think that I did try that with the standalone binary at one point, but unfortunately I did not appear to take notes! That probably means I did not have success.

aldrich · 2025-05-07T00:49:38 1746578978

Very cool work.. and frustating running into walls imposed by manufacturers, I imagine! I've also been working on GPU-based audio plugins for a long time and have done some public material on the subject.

Just my two cents: have you considered using a server/daemon process that runs separately and therefore more controllably outside a DAW (and therefore a client-server approach for your plugin instances)? It could allow you to have a little bit more OS-based control.

Archit3ch · 2025-05-07T11:22:17 1746616937

Do you have a link to your stuff?

> have you considered using a server/daemon process that runs separately and therefore more controllably outside a DAW

I'm slowly coming to the same conclusion, for audio plugins on GPUs.

vlovich123 · 2025-05-06T13:57:22 1746539842

Interesting post & problem. I wonder if the reason that the idea of running the tasks on the same queue fails is for the same reason you have a problem in the first place - variable clock rate means it’s impossible to schedule precisely and you end up aliasing your spin stop time ideal time based on how the OS decided to clock the GPU. But that suggests that maybe your spin job isn’t complex enough to run the GPU at the highest clock because if it is running at max then you should be able to reliably time the stop of the spin even without adding a software PLL (which may not be a bad idea). I didn’t see a detailed explanation of how the spin is implemented and I suspect a more thorough spin loop that consistently drives more of the GPU might be more effective at keeping the clock rate at max perf.

TheAceOfHearts · 2025-05-06T07:56:32 1746518192

I missed the Show HN, but the first thing that came to mind after seeing it was that this looks like it would lend itself well to making some very creative ASMR soundscapes with immersive multidimensional audio. I selfishly hope you or one of your users will make a demo. Congrats on the project and I hope you receive help on your Apple issues.

sunshowers · 2025-05-06T17:48:39 1746553719

Great post, I found the description clear and easy to understand. I've definitely run into the issue you're describing in other contexts.

Dlemo · 2025-05-06T10:50:11 1746528611

[flagged]

mensetmanusman · 2025-05-06T11:38:41 1746531521

It’s technical to over half of programmers who don’t need to know these types of details about hw/sw interactions.

Dlemo · 2025-05-06T19:50:32 1746561032

It's about 'very technical'. If you can explain the problem in one basic sentence it's not very

aplummer · 2025-05-06T07:51:11 1746517871

Have you filed a feedback? Seems like the right next step.

bayindirh · 2025-05-06T07:52:54 1746517974

The post opens with the following TL;DR:, snipped for brevity:

> It would be great if someone can connect me with the right person inside Apple, or direct them to my feedback request FB17475838 as well as this devlog entry.

sgerenser · 2025-05-06T11:42:34 1746531754

Feedbacks often go into a black hole unless either: 1. A bunch of people file effectively the same bug report (unlikely here) 2. An individual Apple employee champions the issue internally 3. Someone makes a fuss on Twitter/X and it starts to go viral

Sounds like the OP is trying to get #2 to happen, which is probably his best bet.

badc0ffee · 2025-05-06T19:17:15 1746559035

Another trick is to schedule some Apple engineer time during WWDC, and plead your case.

musicale · 2025-05-08T01:12:57 1746666777

I was going to recommend this. They may have some suggestions of how to improve things with existing metal as well.

viraptor · 2025-05-06T12:16:34 1746533794

Feedback is as effective as creating a change.org petition to some politician to stop doing crimes please. You'll be lucky to get an acknowledgement that something's a real issue after months.

humbledrone · 2025-05-03T20:00:50 1746302450

I really like doing deep optimization work, so probably the GPU stuff was more fun from an engineering geek perspective. But the physics was also super fun, experimenting with how to model each object, and so on. The physics is a little more playful, since changing a tiny bit of simple math often does fun/weird things.

RE strings/air, I have thought about it, but only a little! Down the road I really want to explore more physics objects. It's very fun to think about how I'd integrate strings into the world, especially w.r.t. how they'd interact and connect with the masses. It seems like there could possibly be some very cool ways to do it.

humbledrone · 2025-05-03T19:57:28 1746302248

Glad to hear it. I have fun writing them, often I find it's a great way to clarify my thoughts even just for me. But also I enjoy reading other people's devlogs so am glad to contribute. :)

humbledrone · 2025-05-03T18:05:31 1746295531

This is a long story, which is still ongoing. The GPU code is very, very heavily optimized (though I do still have some ideas on how to go further). The main problem we're having on Mac hardware is that the OS heuristics for when to turn the clock rate up on the GPU work really poorly for the audio use case. If you want gory details, I've written about it:

https://anukari.com/blog/devlog/waste-makes-haste

If anyone can put me in touch directly with an OS/Metal person at Apple it would be EXTREMELY helpful. I've had limited success so far.

humbledrone · 2025-05-03T17:37:50 1746293870

Some info about custom 3D models is here: https://anukari.com/support/faq#custom-skins

The animations have some subtlety, but the basic idea is that they all to from t=0 to t=1, and then Anukari drives that t parameter. So for example, the mallet would go from resting at t=0 to fully extended at t=1, and Anukari will animate it at the right speed from 0, up to some value based on the velocity, back to zero.

Some of the objects are cyclic, like the spinning oscillators, and those also use the t=0 to t=1 convention, but require that the animation is seamless (so it looks identical at t=0 and t=1, but does whatever you want in between).

I'm not aware of anyone fully customizing the 3D stuff yet, so the instructions may be a bit rough. If you try it, feel free to contact me at evan@anukari.com or on discord and I'd be happy to answer questions.