Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> a significant amount of memory goes into the KV cache

Is there a good paper (or talk) how inference looks at scale? (Kinda like ELI-using-single-gpus)



The PagedAttention paper is a good starting point as it represents the first major open source inference engine that had "pretty good" batch performance for transformers.

https://arxiv.org/pdf/2309.06180




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: