MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

发布于 2024-06-27  3 次阅读


AI 摘要

这篇论文标题是“MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data”。论文研究了大型语言模型(LLMs)在数学问题解决能力方面的表现,使用了新开发的“MathOdyssey”数据集。该数据集包括来自知名机构专家创建的高中和大学水平的多样化数学问题,旨在严格测试LLMs在高级问题解决场景中的能力,并涵盖更广泛的学科领域。研究人员对开源模型(如Llama-3和DBRX-Instruct)以及来自GPT系列和Gemini模型的闭源模型进行了基准测试。结果表明,虽然LLMs在常规和中等难度的任务上表现良好,但在奥林匹克级别的问题和复杂的大学水平问题上面临着重大挑战。研究强调了继续进行研究以增强LLMs数学推理能力的持续需求。该数据集、结果和代码都是公开可用的。

[PDF] [Site] [Kimi]

Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset. The dataset includes diverse mathematical problems at high school and university levels, created by experts from notable institutions to rigorously test LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. By providing the MathOdyssey dataset as a resource to the AI community, we aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving. We conduct benchmarking on open-source models, such as Llama-3 and DBRX-Instruct, and closed-source models from the GPT series and Gemini models. Our results indicate that while LLMs perform well on routine and moderately difficult tasks, they face significant challenges with Olympiad-level problems and complex university-level questions. Our analysis shows a narrowing performance gap between open-source and closed-source models, yet substantial challenges remain, particularly with the most demanding problems. This study highlights the ongoing need for research to enhance the mathematical reasoning of LLMs. The dataset, results, and code are publicly available.

Hello
最后更新于 2024-08-02