Hugging Face’s Post

Hugging Face reposted this

View profile for Lysandre Debut, graphic

Head of Open Source at Hugging Face

🤗 Last week, Gemma 2 was released. Since then, implementations have been tuned to reflect the model performance: ``` pip install -U transformers==4.42.3 ``` We saw reports of tools (transformers, llama.cpp) not being on par with Google-led releases (Google AI Studio). Why is that? ✨ The first and most important aspect is that Google implemented soft-capping of logits within the attention. This change is meaningful: we tested with and without this soft capping, and while metrics show little difference, we see very significant changes in a long context. The current implementations of Flash Attention kernels do not allow for this soft-capping (yet), so we cannot take advantage of the Flash Attention speed gains without losing performance. Conclusion: soft-capping is required, especially for the 27B model, but it affects speed optimizations. This is particularly the case for inference but seems to be less relevant during fine-tuning. We'd be interested in seeing whether toggling FA2 on/soft capping off during fine-tuning results in correct fine-tunes, as this would significantly accelerate training on most cards. --- Secondly, the model was trained in bfoat16 and seems to perform best with that precision. The 27b model is sensible to precision, and running it in fp16 results in incorrect outputs. We've checked and confirmed that using BNB to run the checkpoints in 4-bit + 8-bit works correctly. --- If you're looking for presets to run with transformers, this is what we recommend for optimal performance: - Version v4.42.3 - Running with `attn_implementation='eager'` (so no FA/FA2) - Running in bfloat16 to start with

  • No alternative text description for this image
Allen Roush

Senior AI Researcher - Specialize in LLM, NLP, Argument Mining, and Generative Art

1w

Ravid Shwartz Ziv examples of new models requiring time for the model inference providers to work through the pain points - why Llama3 took awhile to get decent fine-tunes made on it.

Emmanuel Batt

Driving Innovation with Digital Technology

1w

Did I dream it, or was it available a few hours ago in Hugging Chat and then removed shortly after?

Like
Reply
Ingo Villnow

Data Scientist, Machine Learning Engineer, Customer Insights Specialist at Vattenfall Europe Sales GmbH

1d

And how about using FA3 with soft cap?

Like
Reply
Stefano Martire

PhD student @ IIT | AI4Biochemistry 🤖 🧪

1w
See more comments

To view or add a comment, sign in

Explore topics