Today marks an inflection point for open-source AI with the launch of AI at Meta Llama 3.1 405B, the largest openly available foundation model, that rivals the best closed-source models in AI, rapidly accelerating the adoption of open-source AI with developers and enterprises. We are excited to partner with Meta to bring all the Llama 3.1 models (8B, 70B, 405B, and LlamaGuard) to Together Inference and Together Fine-tuning. Together Inference delivers horizontal scalability with industry-leading performance of up to 80 tokens per second for Llama 3.1 405B and up to 400 tokens per second for Llama 3.1 8B, which is 1.9x to 4.5x faster than vLLM while maintaining full accuracy with Meta’s reference implementation across all models. Together Turbo endpoints are available at $0.18 for 8B and $0.88 for 70B, 17x lower cost than GPT-4o. This empowers developers and enterprises to build Generative AI applications at production scale in their chosen environment – Together Cloud (serverless or dedicated endpoints) or on private clouds. As the launch partner for the Llama 3.1 models, we're thrilled for customers to leverage the best performance, accuracy, and cost for their Generative AI workloads on the Together Platform while allowing them to keep ownership of their models and their data secure. Function calling is supported natively by each of the models, and JSON mode is available for the 8B and 70B models (coming soon for the 405B model). Together Turbo endpoints empower businesses to prioritize performance, quality, and price without compromise. It provides the most accurate quantization available for Llama-3.1 models, closely matching full-precision FP16 models. These advancements make Together Inference the fastest engine for NVIDIA GPUs and the most cost-effective solution for building with Llama 3.1 at scale. https://lnkd.in/gFwBNQhJ
-
-
-
-
-
+1