Today is a huge day for open source AI: Argilla is joining Hugging Face 🤗 🚀 It's time to double down on community, good data for AI, product features, and open collaboration. We're thrilled to continue our path with the wonderful Argilla team and a broader team and vision, with shared values and culture! Thanks to our investors Zetta Venture Partners (James Alcorn), Criteria Venture Tech (Roma Jelinskaite, Albert Morro, Aleix Pérez), Eniac Ventures (Hadley Harris, Dan Jaeck, Monica Lim), and many others, so lucky to have worked with you! https://lnkd.in/dfxvgpsT
Argilla
Desarrollo de software
Madrid, MADRID 9026 seguidores
The Platform where experts improve AI models
Sobre nosotros
Build robust NLP products through faster data labeling and curation. Argilla empowers teams with the easiest to use human-in-the-loop and programmatic labelling features.
- Sitio web
-
https://www.argilla.io
Enlace externo para Argilla
- Sector
- Desarrollo de software
- Tamaño de la empresa
- De 11 a 50 empleados
- Sede
- Madrid, MADRID
- Tipo
- Empresa propia
- Fundación
- 2017
- Especialidades
- NLP, artificial intelligence, Data science y Open Source
Productos
Argilla
Plataformas de etiquetado de datos
The feedback layer for enterprise LLMs Build robust language models with human and machine feedback. Argilla empowers data teams from fine-tuning and RLHF to continuous model improvement.
Ubicaciones
-
Principal
Calle de Vandergoten, 1
Madrid, MADRID 28005, ES
-
Moli Canyars, 7
Carpesa, Valencia 46132, ES
Empleados en Argilla
Actualizaciones
-
ZenML has a new integration with Argilla 🤟 Synthethic data lovers, we will also discuss distilabel ⚗️ Alex S. will show you how to use it within our coming community meetup. Sara Han Díaz Lorenzo is laying the final hand on the PR for Argilla 2.0 support for the integration 😎
We finally have a ZenML and Argilla collab 😍 Building a data flywheel for your RAG applications is critical for the successful deployment of your LLM. In the latest Argilla community meetup, Alex S. will showcase how you can use synthetic data generated by distilabel to bootstrap embedding model fine-tuning and then use human feedback in Argilla to iteratively and continuously improve the model performance. Thank you Daniel Vila Suero and David Berenstein for the invite <3! You don't want to miss this! The event is on Thursday, August 8, 5 PM-6 PM GMT+2. Sign up for free here. 👉 👉👉https://lu.ma/4b5ick1e
-
Argilla ha compartido esto
Dropping magpie-ultra-v0.1, the first open synthetic dataset built with Llama 3.1 405B. Created with distilabel, it's our most advanced and compute-intensive pipeline to date. https://lnkd.in/ecXn_Gbi Almost two months ago, Magpie by University of Washington and Ai2 was released. It described a simple mechanism to generate instruction-response pairs with no system prompt or seed data, taking advantage of the autoregressive capabilities of the LLMs and the SFT fine-tuning made with a chat template. They released two new datasets: Magpie-Air generated with Llama 3 8B Instruct and Magpie-Pro generated with Llama 3 70B Instruct. As I mentioned before, no need of system prompt or seed data to generate the instruction-response pairs, as Magpie it's kind of a hack that allows to extract similar instruction-response pairs to the ones used during the SFT phase of an LLM. As you may know Argilla joined Hugging Face, and a few weeks later the new family of models Llama 3.1 by AI at Meta was released! It came with a big big model: Llama 3.1 405B. We saw this as an opportunity and decided to replicate the Magpie recipe with the chunky boy to create Magpie Ultra v0.1, the first public synthetic dataset created with Llama 3.1 405B. The dataset contains 50K unfiltered rows containing instruction-response pairs of different categories: Information seeking, Reasoning, Planning, Editing, Coding & Debugging, Math, Data analysis, Creative writing, Advice seeking, Brainstorming or Others. It contains all the columns needed to do a proper filtering assuring a leaner final dataset containing more difficult, high-quality, safe and diverse instructions. We will work these days to bring out a filtered version. The dataset can be used for SFT, but it can also be used RLAIF as we generated two responses: one with the instruct and one with the base model. Probably, and as described in the Llama 3 document, the models that will get the most out of it when fine-tuned on it, will be small models of course! You can explore the dataset in Argilla: https://lnkd.in/eK3dahJE I'm very excited about this dataset as I was able to make the GPUs of the science-cluster go brrrrrr It also helped me a lot to test the upcoming features of the distilabel 1.3.0, which will be released next Tuesday. Only thing I can say is that scaling synthetic dataset
argilla/magpie-ultra-v0.1 · Datasets at Hugging Face
huggingface.co
-
Argilla ha compartido esto
Play with Argilla 2.0 UI on Hugging Face 🚀 https://lnkd.in/eY-83EBe The new demo is full of nice datasets, thanks to the awesome David Berenstein!
-
Argilla ha compartido esto
🦙 Mixture of Llamas 🦙 Sharing a new synthetic data generation example with Llama-3.1 and distilabel! I'm sharing a new example of synthetic data generation on Colab, implementing the recent Mixture of Agents method with the recent Llama3.1 models. What's Mixture of Agents for LLMs? A new approach to leverage the collective strengths of multiple LLMs to produce better outputs as follows: 👩🎓 Several proposer LLMs generate outputs for a given input multiple times, improving responses by including previous outputs in the system prompt (70B and CodeLlama in the example) 👩🏫 An aggregator LLM combines these outputs into a high-quality final response (405B model in the example) Free Colab we made with Gabriel Martín Blázquez: https://lnkd.in/dmdVfegM
-
High quality data makes models go from their model to your model Read the full post announcing Argilla 2.0: https://lnkd.in/dGJJefHU
-
Argilla ha compartido esto
Argilla 2.0 is out! 🥳 During the last months the Argilla team has been working to create this new versions which unifies what were known as the "old datasets" (TextClassificationDataset, TokenClassificationDataset, etc) and the `FeedbackDataset`s in a new class called `Dataset`, and the new Python SDK which is super nice and super easy to use! The `Dataset` class comes with all the ingredients required for a good annotation job: - Highly configurable as it allows multiple fields and multiple questions displayed to the annotators. - Easy to filter using either metadata, semantic search or text search - and... this is 🆕... task distribution! This version has been shipped with the task distribution feature which allows defining how many annotations (annotator overlap) is required per record! This is only the first strategy, we will be adding more soon 🤗 If you don't know Argilla or you want to start working with it, then you can start today by creating a 🆓 space on Hugging Face: https://lnkd.in/d5K6weJF
Hugging Face – The AI community building the future.
huggingface.co
-
Argilla ha compartido esto
🌐 Contribute to build a truly multilingual benchmark for LLMs in collab with Cohere For AI. Help review some MMLU translations in your language! 🔥 The progress - More than 22K contributions, ~150 contributors - 8 languages are completed (Russian, Hindi, Telugu, Arabic, Spanish, Korean, French, Ukrainian) If you have 10 min, join this Hugging Face Argilla Space and start reviewing translations in your language: https://lnkd.in/dTyHaPEF Vietnamese, Portuguese, Amharic, German, and Indonesian are almost completed, help complete them so they can be included in the benchmark. There are many other languages that need contributions too. Watch the progress for your language: https://lnkd.in/dqzg6KQq
-
Argilla ha compartido esto
Since Argilla joined Hugging Face, I've mainly worked on making it crazy easy to improve datasets with feedback. This before and after highlights how the SDK contributes to this by natively integrating with packages like datasets. This makes it easy to get feedback on changing data. 👯♀️ Changes from before to now: - You don't need to define records all the time. Just align your dataset and feedback task, and Argilla will match fields to fields and questions to questions. - Argilla can use the identifiers in your dataset, so you don't need to define external ids. Later on, you can use this edit, delete, or change records. - You don't need to create new datasets for each version. Argilla can update only the records with actual changes. - Log is back! Seasoned users of Argilla will remember the log function from the early days. Because Argilla supports updating records, it now make sense again to use a dynamic log method to create or update records based on their IDs. 🛣 Why should I care? (Re-)Sharing datasets with a team of experts is a crucial step in the ML lifecycle. Good engineers will aim to do as much as possible. So, a good feedback SDK should make this super simple. 🐉 A lot has changed in the background to support this abstraction, so it’s refreshing to share the development succinctly like this. If you want details, check out this how-to-guide on managing records: https://lnkd.in/er-dfRuM #ai #llm #datasets #opensourceai
-
Argilla ha compartido esto
I can’t be more excited to announce Argilla 2.0 👻 What makes it even more special? We’re doing it with the Hugging Face team, which means: ✌️ More talent, more feedback, more learning and this is awesome for the future of Argilla! 👉 A real opportunity to say hello to the rest of the OSS AI community who haven't heard about Argilla before! So basically, to make short presentations: Argilla is designed for everyone and it’s for anyone who values data and wants high-quality AI projects. Big 💗 for the entire Argilla team who always focus on quality, listening the community needs, both thinking and building great things! https://lnkd.in/dGw54zKK
Argilla 2.0 is out
argilla.io
Páginas similares
Buscar empleos
Financiación
Última ronda
Semilla5.500.000,00 US$