We just released the 🌸 BigCodeBenchmark: testing LLMs on more realistic and harder coding tasks involving tool usage. 🛠 While benchmarks like HumanEval are saturating even GPT-4o or DeepSeekCoder-v2 just land around 50% while humans get 97%! A few highlights 🚀: - 🛠 tasks utilize diverse function calls from 139 popular Python libraries. - 🤓 complex, user-oriented instructions for each task - 📊 includes verified examples and high test coverage - 🙋♂️ comes in a standard function complete form as well as instruction version Resources: - 🤗📊 HF Leaderboard: https://lnkd.in/eqxnEAPE - 🤗🗂️ HF Dataset: https://lnkd.in/esc8MRVD - 🤗🔍 HF Data Viewer: https://lnkd.in/esp_ZaTC - 💻 Code: https://lnkd.in/eTgzMWRv - 📝 Paper: https://lnkd.in/e4MYx3CK
Love this! Had some questions regarding this, would love to discuss on a meet?
I'll keep this in mind
Super exciting work! There's definitely a need for harder/more realistic benchmarks than HumanEval, MBPP, etc.
WOWOWOW
This was much need! Thanks Leandro and Hugging Face Team!
Chief Loss Officer at Hugging Face
1moAwesome work led by Terry Yue Zhuo!