> a significant amount of memory goes into the KV cache Is there a good paper (o... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

yk on May 14, 2024 | parent | context | favorite | on: Gemini Flash

> a significant amount of memory goes into the KV cache

Is there a good paper (or talk) how inference looks at scale? (Kinda like ELI-using-single-gpus)

AaronFriel on May 15, 2024 [–]

The PagedAttention paper is a good starting point as it represents the first major open source inference engine that had "pretty good" batch performance for transformers.

https://arxiv.org/pdf/2309.06180

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact