Leandro von Werra’s Post

View profile for Leandro von Werra, graphic

Chief Loss Officer at Hugging Face

We just released the 🌸 BigCodeBenchmark: testing LLMs on more realistic and harder coding tasks involving tool usage. 🛠 While benchmarks like HumanEval are saturating even GPT-4o or DeepSeekCoder-v2 just land around 50% while humans get 97%! A few highlights 🚀: - 🛠 tasks utilize diverse function calls from 139 popular Python libraries. - 🤓 complex, user-oriented instructions for each task - 📊 includes verified examples and high test coverage - 🙋♂️ comes in a standard function complete form as well as instruction version Resources: - 🤗📊 HF Leaderboard: https://lnkd.in/eqxnEAPE  - 🤗🗂️ HF Dataset: https://lnkd.in/esc8MRVD - 🤗🔍 HF Data Viewer: https://lnkd.in/esp_ZaTC - 💻 Code: https://lnkd.in/eTgzMWRv - 📝 Paper: https://lnkd.in/e4MYx3CK

  • No alternative text description for this image
Leandro von Werra

Chief Loss Officer at Hugging Face

1mo

Awesome work led by Terry Yue Zhuo!

Like
Reply
Utkarsh Priyadarshi

CS @UW Madison | Ex Founder & CEO Toonication.com | AI/ML Researcher | Harvard Innovation Fellow

1mo

Love this! Had some questions regarding this, would love to discuss on a meet?

Like
Reply
Partha Pratim Ray

Top 2% Scientist by Stanford University, Founder of Indian Knowledge Forum, IoT Researcher, Generative AI Enthusiast, Indian Knowledge Bearer, FIETE, Technical Evangelist, Promoter of Indian Knowledge

1mo

I'll keep this in mind

Like
Reply
Claudio Spiess

Research Intern @ IBM | CS PhD student @ UC Davis | ML for Software Engineering, LLMs for code, AI4SE

1mo

Super exciting work! There's definitely a need for harder/more realistic benchmarks than HumanEval, MBPP, etc.

Like
Reply
Saurav Nanda

Generative AI | NLP | Cloud

1mo

This was much need! Thanks Leandro and Hugging Face Team!

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics