Hugging Face

Hugging Face

Software Development

The AI community building the future.

About us

The AI community building the future.

Website
https://huggingface.co
Industry
Software Development
Company size
51-200 employees
Type
Privately Held
Founded
2016
Specialties
machine learning, natural language processing, and deep learning

Products

Locations

Employees at Hugging Face

Updates

  • Hugging Face reposted this

    View profile for Zachary Mueller, graphic

    Technical Lead for Accelerate at HuggingFace

    Another start to the month, another Hugging Face Accelerate release! We've been cooking, let's talk about how 👨🍳 * There's a new profiler in town that can help you collect performance metrics during training and inference, and you can then visualize it in tools like Chrome. Check out the new docs (linked below) for more info! * Thanks to Stas Bekman we were able to track down, identify, and fix a slowdown during `import accelerate` that helped reduce our import times by over 60%! We've taken steps to ensure such slowdowns can't go unnoticed again, and we hope you enjoy being able to accelerate a bit faster! * We've added support for more complex PyTorch DDP communication hooks, allowing you to customize how gradients are communicated across workers. (New docs linked below) * With XPU now in native PyTorch, we've made sure to upstream the support so you can continue right away with using the native implementation (note, this requires PyTorch >= 2.4) * And so, so much more Enjoy the release!

    • No alternative text description for this image
  • Hugging Face reposted this

    View profile for Julien Chaumond, graphic

    CTO at Hugging Face

    We need more knowledge sharing about running ML infrastructure at scale! Here's the mix of AWS instances we currently run our serverless Inference API on. For context, the Inference API is the infra service that powers the widgets on Hugging Face Hub model pages + PRO users and Enterprise orgs can use it programmatically. 64 g4dn.2xlarge 48 g5.12xlarge 48 g5.2xlarge 10 p4de.24xlarge 42 r6id.2xlarge 9 r7i.2xlarge 6 m6a.2xlarge (control plane and monitoring) ––– Total = 229 instances This is a thread for AI Infra aficionados 🤓 What mix of instances do you run?

    • No alternative text description for this image
  • Hugging Face reposted this

    View profile for Ahsen Khaliq, graphic

    ML @ Hugging Face

    MInference 1.0 Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention paper page: https://buff.ly/45RdSmP The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.

  • View organization page for Hugging Face, graphic

    655,341 followers

    Transformers v4.42 includes a new Transformer-based model capable of real-time object detection. See below for more info:

    View profile for Niels Rogge, graphic

    Machine Learning Engineer at ML6 & Hugging Face

    RT-DETR is now supported in Hugging Face Transformers! 🙌 RT-DETR, short for “Real-Time DEtection TRansformer”, is a computer vision model developed at Peking University and Baidu, Inc. capable of real-time object detection. The authors claim better performance than YOLO models in both speed and accuracy. The model comes with an Apache 2.0 license, meaning people can freely use it for commercial applications. 🔥 RT-DETR is a follow-up work of DETR, a model developed by AI at Meta that successfully used Transformers for the first time for object detection. The latter has been in the Transformers library since 2020. After this, lots of improvements have been made to enable faster convergence and inference speed. RT-DETR is an important example of that as it unlocks real-time inference at high accuracy! Big congrats to Daniel Choi for contributing this model! * Demo notebooks (fine-tuning + inference): https://lnkd.in/eA_WzsyE * Demo Space: https://lnkd.in/ewzWTSHA * Paper: https://lnkd.in/eR3Qg6dm #ai #artificialintelligence #objectdetection #huggingface #computervision

  • Hugging Face reposted this

    View profile for Lysandre Debut, graphic

    Head of Open Source at Hugging Face

    🤗 Last week, Gemma 2 was released. Since then, implementations have been tuned to reflect the model performance: ``` pip install -U transformers==4.42.3 ``` We saw reports of tools (transformers, llama.cpp) not being on par with Google-led releases (Google AI Studio). Why is that? ✨ The first and most important aspect is that Google implemented soft-capping of logits within the attention. This change is meaningful: we tested with and without this soft capping, and while metrics show little difference, we see very significant changes in a long context. The current implementations of Flash Attention kernels do not allow for this soft-capping (yet), so we cannot take advantage of the Flash Attention speed gains without losing performance. Conclusion: soft-capping is required, especially for the 27B model, but it affects speed optimizations. This is particularly the case for inference but seems to be less relevant during fine-tuning. We'd be interested in seeing whether toggling FA2 on/soft capping off during fine-tuning results in correct fine-tunes, as this would significantly accelerate training on most cards. --- Secondly, the model was trained in bfoat16 and seems to perform best with that precision. The 27b model is sensible to precision, and running it in fp16 results in incorrect outputs. We've checked and confirmed that using BNB to run the checkpoints in 4-bit + 8-bit works correctly. --- If you're looking for presets to run with transformers, this is what we recommend for optimal performance: - Version v4.42.3 - Running with `attn_implementation='eager'` (so no FA/FA2) - Running in bfloat16 to start with

    • No alternative text description for this image
  • Hugging Face reposted this

    View profile for Ahsen Khaliq, graphic

    ML @ Hugging Face

    Scaling Synthetic Data Creation with 1,000,000,000 Personas paper page: https://buff.ly/3XJ6Hek We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub's use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.

    • No alternative text description for this image

Similar pages

Browse jobs

Funding

Hugging Face 7 total rounds

Last Round

Series D
See more info on crunchbase