Grok-1.5 正式发布
阅读原文· x.aixAI 发布 Grok-1.5 大模型,推理与编程能力显著提升,支持 128K 上下文窗口,数学与代码基准测试成绩大幅改进,已向 X 平台 Premium+ 用户开放。
Introducing Grok-1.5, our latest model capable of long context understanding and advanced reasoning. Grok-1.5 will be available to our early testers and existing Grok users on the 𝕏 platform in the coming days.
By releasing the model weights and network architecture of Grok-1 two weeks ago, we presented a glimpse into the progress xAI had made up until last November. Since then, we have improved reasoning and problem-solving capabilities in our latest model, Grok-1.5. Capabilities and Reasoning
One of the most notable improvements in Grok-1.5 is its performance in coding and math-related tasks. In our tests, Grok-1.5 achieved a 50.6% score on the MATH benchmark and a 90% score on the GSM8K benchmark, two math benchmarks covering a wide range of grade school to high school competition problems. Additionally, it scored 74.1% on the HumanEval benchmark, which evaluates code generation and problem-solving abilities.
| Benchmark | Grok-1 | Grok-1.5 | Mistral Large | Claude 2 | Claude 3 Sonnet | Gemini Pro 1.5 | GPT-4 | Claude 3 Opus | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | MMLU | 73% 5-shot | 81.3% 5-shot | 81.2% 5-shot | 75% 5-shot | 79% 5-shot | 83.7% 5-shot | 86.4% 5-shot | 86.8 5-shot | | MATH | 23.9% 4-shot | 50.6% 4-shot | — | — | 40.5% 4-shot | 58.5% 4-shot | 52.9% 4-shot | 61% 4-shot | | GSM8K | 62.9 8-shot | 90% 8-shot | 81% 5-shot | 88% 0-shot CoT | 92.3% 0-shot CoT | 91.7% 11-shot | 92% 5-shot | 95% 0-shot CoT | | HumanEval | 63.2% 0-shot | 74.1% 0-shot | 45.1% 0-shot | 70% 0-shot | 73% 0-shot | 71.9% 0-shot | 67% 0-shot | 84.9% 0-shot | Long Context Understanding