See Hugging Face’s activity on LinkedIn

Head of Open Source at Hugging Face

1w Edited

🤗 Last week, Gemma 2 was released. Since then, implementations have been tuned to reflect the model performance: ``` pip install -U transformers==4.42.3 ``` We saw reports of tools (transformers, llama.cpp) not being on par with Google-led releases (Google AI Studio). Why is that? ✨ The first and most important aspect is that Google implemented soft-capping of logits within the attention. This change is meaningful: we tested with and without this soft capping, and while metrics show little difference, we see very significant changes in a long context. The current implementations of Flash Attention kernels do not allow for this soft-capping (yet), so we cannot take advantage of the Flash Attention speed gains without losing performance. Conclusion: soft-capping is required, especially for the 27B model, but it affects speed optimizations. This is particularly the case for inference but seems to be less relevant during fine-tuning. We'd be interested in seeing whether toggling FA2 on/soft capping off during fine-tuning results in correct fine-tunes, as this would significantly accelerate training on most cards. --- Secondly, the model was trained in bfoat16 and seems to perform best with that precision. The 27b model is sensible to precision, and running it in fp16 results in incorrect outputs. We've checked and confirmed that using BNB to run the checkpoints in 4-bit + 8-bit works correctly. --- If you're looking for presets to run with transformers, this is what we recommend for optimal performance: - Version v4.42.3 - Running with `attn_implementation='eager'` (so no FA/FA2) - Running in bfloat16 to start with

8 Comments

The Artificially Intelligent Enterprise

Good to know!

Allen Roush

Senior AI Researcher - Specialize in LLM, NLP, Argument Mining, and Generative Art

Ravid Shwartz Ziv examples of new models requiring time for the model inference providers to work through the pain points - why Llama3 took awhile to get decent fine-tunes made on it.