Technical GlossaryGenerative AI and LLM
Paged Attention
An attention-management technique that handles KV cache memory more efficiently and improves resource use under multi-request serving.
Paged attention is important for improving LLM serving efficiency, especially under high concurrency. By managing memory in a paged structure, it enables more balanced resource use in long-context and multi-user scenarios. It is a good example of how deeply system engineering and model behavior are intertwined in large-model deployment.
You Might Also Like
Explore these concepts to continue your artificial intelligence journey.
