Hacker Newsnew | past | comments | ask | show | jobs | submit | humbledrone's commentslogin

Hey everyone, it worked, I had a super productive conversation with exactly the right person on the Metal team! Thanks for helping me get Apple's attention. I didn't at all expect this amount of support.

https://anukari.com/blog/devlog/productive-conversation-appl...


>While I can't share any technical details... The engineer provided some suggestions and hints that I can use right now to maybe — just maybe — get things working in the short term

Great that you have a workaround now, but the fact that you can't even share what the workaround is, ironically speaks to the last line in https://news.ycombinator.com/item?id=43904921 of how Apple communicates

>there’s this trick of setting it to this but then change to that and it’ll work. Undocumented but now you know

When you do implement the workaround, maybe you could do it in an overtly-named function spottable via disassembly so that others facing similar constraints of latency-sensitive GPU have some lead as to the magic incantation to use?


Despite enjoying your idea, I probably won't do that, but certainly anyone who actually has this same problem should reach out to me and I'll put them in touch with the right folks at Apple, who can share the info.


Once again, HN has fulfilled its true purpose: cutting through the red tape that is placed in the front of every large corporation's customer support.

Congratulations and good luck with your project!


I attempted to preempt your question in the section of my blog post, "Why don’t you just pipeline the GPU code so that it saturates the GPU?" It's one of the less-detailed sections though so maybe you have further questions? I think the main thing is that since Anukari processes input like MIDI and audio data in real-time, it can't work ahead of the CPU, because those inputs are not available yet.

Possibly what you describe is a bit more like double-buffering, which I also explored. The problem here is latency: any form of N-buffering introduces additional latency. This is one reason why some gamers don't like triple-buffering for graphics, because it introduces further latency between their mouse inputs and the visual change.

But furthermore, when the GPU clock rate is too low, double-buffering or pipelining don't help anyway, because fundamentally Anukari has to keep up with real time, and every block it processes is dependent on the previous one. With a fully-lowered GPU clock, the issue does actually become one of throughput and not just latency.


It's a real-time audio app, so if it falls behind real time, no audio. You get cracks, pops, and the whole thing becomes unusable. If the user is doing audio at 48 kHz, the required latency is 1/48,000 seconds per sample, or realistically somewhat less than that to account for variance and overhead.


I find it hard to believe that users would notice latency under 1ms. Probably not even under 5ms.

Have you tried buffering for 5ms? Was result bad? 1 ms?


1. Anukari runs up to 16 entire copies of the physics model for polyphony, so 16 * 1024 * 48K (I should update the blog post)

2. Users can arbitrarily connect objects to one another, so each object has to read connections and do processing for N other entities

3. Using the full CPU requires synchronization across cores at each physics step, which is slow

4. Processing per object is relatively large, lots of transcendentals (approx OK) but also just a lot of features, every parameter can be modulated, needs to be NaN-proof, so on

5. Users want to run multiple copies of Anukari in parallel for multiple tracks, effects, etc

Another way to look at it is: 4 GHz / (16 voice * 1024 obj * 4 connections * 48,000 sample) = 1.3 cycles per thing

The GPU eats this workload alive, it's absolutely perfect for it. All 16 voice * 1024 obj can be done fully in parallel, with trivial synchronization at each step and user-managed L1 cache.


For anyone seeing this post a bit late: I need a bit of help from someone inside Apple who works on Metal. If you know someone, it would be great if you could connect me to them:

https://news.ycombinator.com/item?id=43901619

https://anukari.com/blog/devlog/an-appeal-to-apple


Some folks may have seen my Show HN post for Anukari here: https://news.ycombinator.com/item?id=43873074

In that thread, the topic of macOS performance came up there. Basically Anukari works great for most people on Apple silicon, including base-model M1 hardware. I've done all my testing on a base M1 and it works wonderfully. The hardware is incredible.

But to make it work, I had to implement an unholy abomination of a workaround to get macOS to increase the GPU clock rate for the audio processing to be fast enough. The normal heuristics that macOS uses for the GPU performance state don't understand the weird Anukari workload.

Anyway, I finally had time to write down the full situation, in terrible detail, so that I could ask for help getting in touch with the right person at Apple, probably someone who works on the Metal API.

Help! :)


> This is going to be a VERY LONG HIGHLY TECHNICAL post, so either buckle your seatbelt or leave while you still can.

Well, I read it all and found it not too long, extremely clear and well-written, and informative! Congrats on the writing.

I've never owned a Mac and my pc is old and without a serious GPU, so it's unlikely that I'll get to use Anukari soon, but I regret it very much, as it looks sooo incredibly cool.

Hope this gets resolved fast!


Did you try this entitlement? https://developer.apple.com/documentation/bundleresources/en...

wonder if com.apple.developer.sustained-execution also goes the other way around...


Thanks for the thought, unfortunately when running as a plugin Anukari is subject to whatever plist.txt the host application uses. I think that I did try that with the standalone binary at one point, but unfortunately I did not appear to take notes! That probably means I did not have success.


Very cool work.. and frustating running into walls imposed by manufacturers, I imagine! I've also been working on GPU-based audio plugins for a long time and have done some public material on the subject.

Just my two cents: have you considered using a server/daemon process that runs separately and therefore more controllably outside a DAW (and therefore a client-server approach for your plugin instances)? It could allow you to have a little bit more OS-based control.


Do you have a link to your stuff?

> have you considered using a server/daemon process that runs separately and therefore more controllably outside a DAW

I'm slowly coming to the same conclusion, for audio plugins on GPUs.


Interesting post & problem. I wonder if the reason that the idea of running the tasks on the same queue fails is for the same reason you have a problem in the first place - variable clock rate means it’s impossible to schedule precisely and you end up aliasing your spin stop time ideal time based on how the OS decided to clock the GPU. But that suggests that maybe your spin job isn’t complex enough to run the GPU at the highest clock because if it is running at max then you should be able to reliably time the stop of the spin even without adding a software PLL (which may not be a bad idea). I didn’t see a detailed explanation of how the spin is implemented and I suspect a more thorough spin loop that consistently drives more of the GPU might be more effective at keeping the clock rate at max perf.


I missed the Show HN, but the first thing that came to mind after seeing it was that this looks like it would lend itself well to making some very creative ASMR soundscapes with immersive multidimensional audio. I selfishly hope you or one of your users will make a demo. Congrats on the project and I hope you receive help on your Apple issues.


Great post, I found the description clear and easy to understand. I've definitely run into the issue you're describing in other contexts.


[flagged]


It’s technical to over half of programmers who don’t need to know these types of details about hw/sw interactions.


It's about 'very technical'. If you can explain the problem in one basic sentence it's not very


Have you filed a feedback? Seems like the right next step.


The post opens with the following TL;DR:, snipped for brevity:

> It would be great if someone can connect me with the right person inside Apple, or direct them to my feedback request FB17475838 as well as this devlog entry.


Feedbacks often go into a black hole unless either: 1. A bunch of people file effectively the same bug report (unlikely here) 2. An individual Apple employee champions the issue internally 3. Someone makes a fuss on Twitter/X and it starts to go viral

Sounds like the OP is trying to get #2 to happen, which is probably his best bet.


Another trick is to schedule some Apple engineer time during WWDC, and plead your case.


I was going to recommend this. They may have some suggestions of how to improve things with existing metal as well.


Feedback is as effective as creating a change.org petition to some politician to stop doing crimes please. You'll be lucky to get an acknowledgement that something's a real issue after months.


I really like doing deep optimization work, so probably the GPU stuff was more fun from an engineering geek perspective. But the physics was also super fun, experimenting with how to model each object, and so on. The physics is a little more playful, since changing a tiny bit of simple math often does fun/weird things.

RE strings/air, I have thought about it, but only a little! Down the road I really want to explore more physics objects. It's very fun to think about how I'd integrate strings into the world, especially w.r.t. how they'd interact and connect with the masses. It seems like there could possibly be some very cool ways to do it.


Glad to hear it. I have fun writing them, often I find it's a great way to clarify my thoughts even just for me. But also I enjoy reading other people's devlogs so am glad to contribute. :)


This is a long story, which is still ongoing. The GPU code is very, very heavily optimized (though I do still have some ideas on how to go further). The main problem we're having on Mac hardware is that the OS heuristics for when to turn the clock rate up on the GPU work really poorly for the audio use case. If you want gory details, I've written about it:

https://anukari.com/blog/devlog/waste-makes-haste

If anyone can put me in touch directly with an OS/Metal person at Apple it would be EXTREMELY helpful. I've had limited success so far.


Some info about custom 3D models is here: https://anukari.com/support/faq#custom-skins

The animations have some subtlety, but the basic idea is that they all to from t=0 to t=1, and then Anukari drives that t parameter. So for example, the mallet would go from resting at t=0 to fully extended at t=1, and Anukari will animate it at the right speed from 0, up to some value based on the velocity, back to zero.

Some of the objects are cyclic, like the spinning oscillators, and those also use the t=0 to t=1 convention, but require that the animation is seamless (so it looks identical at t=0 and t=1, but does whatever you want in between).

I'm not aware of anyone fully customizing the 3D stuff yet, so the instructions may be a bit rough. If you try it, feel free to contact me at evan@anukari.com or on discord and I'd be happy to answer questions.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: