Hugging Face’s Post

Hugging Face reposted this

ML @ Hugging Face

1w

MInference 1.0 Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention paper page: https://buff.ly/45RdSmP The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.

3 Comments

1w

Dynamic Sparse Attention significantly reduces the latency in the pre-filling stage of long-context Large Language Models (LLMs) by up to 10x on an A100 GPU, while maintaining accuracy. This is achieved through efficient sparse computation on GPUs, leveraging unique patterns in long-context attention matrices for sparse calculations.

CTO and Co-Founder @ Converge Bio Some men see things as they are and ask, "Why?" I dream things that never were and ask, "Why not?"

1w

Great work, Ahsen Khaliq! In biology, a longer context is one of the critical factors. A human DNA sequence is ~3 billion letters long, so we must do better. Efficient tokenization is another critical factor. Compressing the knowledge with long tokens can reduce the number of tokens by 70% in some biological use cases and produce better models trained on 70% fewer tokens. (https://doi.org/10.1093/bioinformatics/btae196). These results mean a great deal of compute and accuracy is hiding in the tokenization phase.

Research SDE at @MSRA SH

1w

Thanks Ahsen Khaliq for sharing our work. Long-context inference is highly resource-intensive, yet inherently sparse and dynamic. We employ dynamic sparse attention to accelerate the pre-filling stage of 1M inference by up to 10x. For more details, visit our project page: http://aka.ms/MInference.

See more comments

To view or add a comment, sign in

Hugging Face

663,231 followers

View Profile Follow

More from this author

Explore topics