trishullab / PutnamBench

An evaluation benchmark for undergraduate competition math in Lean4, Isabelle, Coq, and natural language.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PutnamBench

PutnamBench is a benchmark for evaluation of theorem-proving algorithms on competition mathematics problems sourced from the William Lowell Putnam Mathematical Competition years 1962 - 2023. Our formalizations currently support three formal languages : Lean 4 $\land$ Isabelle $\land$ Coq. PutnamBench comprises of 1697 manually-crafted formalizations, aggregated over all languages.

PutnamBench aims to support research in automated mathematical reasoning by providing a multilingual benchmark for evaluating theorem-proving algorithms. It is released under permissive licenses (Apache 2.0 for Lean 4 and Isabelle, MIT for Coq). The informal statements are also available with permission from the MAA.

PutnamBench includes factored solutions for problems which require exhibiting a numerical answer in addition to its proof of correctness. For these problems, one can attempt two tasks: proving the problem with the numerical answer written into the theorem statement, or additionally producing the answer along with the proof.

We are hosting a leaderboard and will readily receive evaluation results which are accompanied by a preprint or publication. Do not include proofs as confirmation in any public setting. Please reach out privately at george.tsoukalas@utexas.edu with any requests for additions to the leaderboard.

We strongly encourage community feedback! Please let us know if you have any comments for improving PutnamBench. If you notice any mistakes, please raise an issue on the repository and we will address it. We kindly ask that you do not write formal proofs for any of the problems in an effort to reduce contamination.

Statistics

Language Count
Lean 4 640
Isabelle 640
Coq 417

We also report the number of problems in a certain category. Note that some problems fall under multiple categories. While the categories are intended to capture general features of the problems, we also note that there is a high variance of problems inside an individual category.

Category Total Quantity
Algebra 253
Analysis 226
Number Theory 107
Geometry 68
Linear Algebra 51
Abstract Algebra 28
Combinatorics 26
Probability 9
Set Theory 8

Versioning

  • Version: v0
  • In preliminary release to allow for initial community feedback. We seek to release an official first version following several weeks of discussion with the community.

Citation

The associated paper for PutnamBench is available at this link. Please consider including the following citation if you find PutnamBench useful.

@misc{tsoukalas2024putnambenchevaluatingneuraltheoremprovers,
      title={PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition}, 
      author={George Tsoukalas and Jasper Lee and John Jennings and Jimmy Xin and Michelle Ding and Michael Jennings and Amitayush Thakur and Swarat Chaudhuri},
      year={2024},
      eprint={2407.11214},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2407.11214}, 
}

About

An evaluation benchmark for undergraduate competition math in Lean4, Isabelle, Coq, and natural language.


Languages

Language:Isabelle 44.6%Language:Lean 31.5%Language:Coq 21.9%Language:Python 2.0%Language:Shell 0.1%