Another start to the month, another Hugging Face Accelerate release!
We've been cooking, let's talk about how 👨🍳
* There's a new profiler in town that can help you collect performance metrics during training and inference, and you can then visualize it in tools like Chrome. Check out the new docs (linked below) for more info!
* Thanks to Stas Bekman we were able to track down, identify, and fix a slowdown during `import accelerate` that helped reduce our import times by over 60%! We've taken steps to ensure such slowdowns can't go unnoticed again, and we hope you enjoy being able to accelerate a bit faster!
* We've added support for more complex PyTorch DDP communication hooks, allowing you to customize how gradients are communicated across workers. (New docs linked below)
* With XPU now in native PyTorch, we've made sure to upstream the support so you can continue right away with using the native implementation (note, this requires PyTorch >= 2.4)
* And so, so much more
Enjoy the release!
We worked on a mini-project to show how to run SD3 DreamBooth LoRA fine-tuning on a free-tier Colab Notebook 🌸
The project is educational and is meant to serve as a template. Only good vibes here please 🫡
👉 https://lnkd.in/g_znevg3 👈
Enjoy!
We need more knowledge sharing about running ML infrastructure at scale!
Here's the mix of AWS instances we currently run our serverless Inference API on.
For context, the Inference API is the infra service that powers the widgets on Hugging Face Hub model pages + PRO users and Enterprise orgs can use it programmatically.
64 g4dn.2xlarge
48 g5.12xlarge
48 g5.2xlarge
10 p4de.24xlarge
42 r6id.2xlarge
9 r7i.2xlarge
6 m6a.2xlarge (control plane and monitoring)
–––
Total = 229 instances
This is a thread for AI Infra aficionados 🤓 What mix of instances do you run?
MInference 1.0
Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
paper page: https://buff.ly/45RdSmP
The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.
RT-DETR is now supported in Hugging Face Transformers! 🙌
RT-DETR, short for “Real-Time DEtection TRansformer”, is a computer vision model developed at Peking University and Baidu, Inc. capable of real-time object detection. The authors claim better performance than YOLO models in both speed and accuracy. The model comes with an Apache 2.0 license, meaning people can freely use it for commercial applications. 🔥
RT-DETR is a follow-up work of DETR, a model developed by AI at Meta that successfully used Transformers for the first time for object detection. The latter has been in the Transformers library since 2020. After this, lots of improvements have been made to enable faster convergence and inference speed. RT-DETR is an important example of that as it unlocks real-time inference at high accuracy!
Big congrats to Daniel Choi for contributing this model!
* Demo notebooks (fine-tuning + inference): https://lnkd.in/eA_WzsyE
* Demo Space: https://lnkd.in/ewzWTSHA
* Paper: https://lnkd.in/eR3Qg6dm#ai#artificialintelligence#objectdetection#huggingface#computervision
🤗 Last week, Gemma 2 was released. Since then, implementations have been tuned to reflect the model performance:
```
pip install -U transformers==4.42.3
```
We saw reports of tools (transformers, llama.cpp) not being on par with Google-led releases (Google AI Studio).
Why is that? ✨
The first and most important aspect is that Google implemented soft-capping of logits within the attention.
This change is meaningful: we tested with and without this soft capping, and while metrics show little difference, we see very significant changes in a long context.
The current implementations of Flash Attention kernels do not allow for this soft-capping (yet), so we cannot take advantage of the Flash Attention speed gains without losing performance.
Conclusion: soft-capping is required, especially for the 27B model, but it affects speed optimizations.
This is particularly the case for inference but seems to be less relevant during fine-tuning. We'd be interested in seeing whether toggling FA2 on/soft capping off during fine-tuning results in correct fine-tunes, as this would significantly accelerate training on most cards.
---
Secondly, the model was trained in bfoat16 and seems to perform best with that precision.
The 27b model is sensible to precision, and running it in fp16 results in incorrect outputs.
We've checked and confirmed that using BNB to run the checkpoints in 4-bit + 8-bit works correctly.
---
If you're looking for presets to run with transformers, this is what we recommend for optimal performance:
- Version v4.42.3
- Running with `attn_implementation='eager'` (so no FA/FA2)
- Running in bfloat16 to start with
Scaling Synthetic Data Creation with 1,000,000,000 Personas
paper page: https://buff.ly/3XJ6Hek
We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub's use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.
🤯DiffIR2VR-Zero: Zero-shot video restoration to high-resolution using pre-trained image restoration diffusion models.
- Video denoising and up to 8x super-resolution
- Framework outperforms trained models in generalizing across diverse datasets and extreme degradations
- Compatible with every 2D restoration model
#superresolution#zeroshot#diffir2vr