mit-plv / engine-bench

Benchmarks for various proof engines

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Systematic evaluation of proof engine performance

CI (Coq) CI (LaTeX)

The proof engine is the component of a proof assistant that maintains the state of a partial proof or construction and provides an interface with fine-grained but individually meaningful (and checkable!) steps. For orientation: the state of the proof engine usually features a list of remaining goals and a proof step might consist of specializing a function to a variable (beta reduction) or creating one new expression node. Our goal is threefold: First, identify a minimal interface in terms of which higher-level operations (for example, rewriting all occurrences of a pattern using an equality lemma) can be expressed. Then benchmark the implementation of these operations in existing proof engines to highlight asymptotic performance trends. Finally, we will try to specify the amortized complexity each proof engine operation would need to have to implement representative high-level tactics without asymptotic slowdown.

Currently, this repository contains contains benchmarks and quick demonstrations for Coq 8.11.

Interface requirements

Our choice of primitive operations to be benchmarked as the canonical interface of a proof assistant is based on several criteria. We require each operation to be individually checkable and that after an erroneous step the proof engine can immediately return an error that can be explained from first principles, as opposed to failing long arbitrarily thereafter with a generic error message like "bad proof" or "ill-typed term". For this reason, we consider expressions as a part of some particular context in the proof engine rather than as "free-floating" objects which the user can attempt to use in any context. We treat expression construction as a part of the proof engine API, even if internally the proof engine is further decomposed into an expression language implementation and proof state code that uses it. Based on our experience that (performance-) engineering on top of heuristic procedures is a bad time, we further require the proof engine operations to be deterministic and have fully specified behavior and running time. In particular, conversion- or type-checking with dependent types is not a primitive operation, each rule individually is. The same desire for predictability leads us to specify primitive operations with explicit sharing in their input and output and measure their performance in the presence of explicit let binders.

We treat tactic language features such as exceptions, backtracking, multiple resumption, non-proof state, and debug prints as orthogonal to the core challenge of representation and manipulation of the proof state (it seems reasonable to expect that non-trivial interactions and synergies do exist, but they are outside the scope of this work). Still, we require that the proof engine supports operations on any open goal so that multi-goal support can be implemented in the tactic language. We similarly do not investigate features whose sole purpose is to present the proof state to the user, for example generating human-readable names for binders and converting the internal representation to nicely formatted strings.

Operations

  • creating a proof goal given context and statement / creating a new expression hole given context and type
  • creating an application node given function and arguments (using only syntactic conversion for type checking)
  • creating a let binder given bound expression (new hole for continuation)
  • creating a lambda/quantifier given bound type (new hole/subproof for continuation)
  • creating a type annotation given term (using only syntactic conversion for type checking)
  • one-step reductions: lambda application, constructor elimination, definition unfolding
  • using a expression/subproof from a smaller context to fill a hole in a larger context
  • replacing an expression with a reference to an alpha-equivalent variant
  • extracting subterms of a given term*
  • optional: reducing an expression to normal form (with performance expectations with reference to a non-proof-assistant interpreter)

As the job of the proof engine is to maintain partial proofs, we consider all of these operations with terms and contexts which may contain holes and unfinished subproofs. We do not specify a particular representation of partial proofs and constructions, anything that can be worked with using the operations described above is fair game; when matching existing operations to our specifications we will interpret the operation descriptions more loosely than the overarching requirements discussed earlier. Ideally, the amortized overhead due to operating in a rich context would be asymptotically polylogarithmic (or at least otherwise sublinear) and fast in practice. Needless to say, we are not aware of any proof engine that comes close to achieving this.

* This operation may seem obvious, and, indeed, we expect it to be trivial in any proof engine in which it is valid. However, in extensional type theories, just because λ _ : False, f is well-typed doesn't mean that f is well-typed in the empty context, even if it doesn't mention the binder. Furthermore, it's possible for f x to be well-typed without f and/or x being individually well-typed in some proof assistants such as Nuprl. (For example, (λ _, true) (0 = ℕ) may be well-typed because it reduces to true, even though 0 = ℕ is not well-typed when = is heterogenous. TODO: Is this example actually correct?)

Performance Graphs

A compilation of autogenerated performance graphs for Coq is available at here.

Adding a new let binder underneath n lets (should be Õ(1))

Adding a new binder underneath n binders (should be Õ(1))

Typechecking an application of a function to n arguments (very convenient if this is Õ(n))

Fast alpha-equivalence check (should be Õ(term size))

One-step reductions:

one_step_reduction one_step_reduction_with_abstract iota_reduction_fact8 iota_reduction_fact9 iota_reduction_with_abstract_fact8 iota_reduction_with_abstract_fact9

Inserting a cast node should not have overhead over type checking

Evar creation should be Õ(1)

Evar context management in Coq

Sample unification problem (context-changing):

  • eq_refl : (fun y => y) = ((fun e y => e y) ?e)
  • eq_refl : (fun y => y) = (fun y => ?e@{no y} y)
  • User: inside ?e, do intro, i.e., ?e@{no y} := fun z => ?e2@{z}
  • eq_refl : (fun y => y) = (fun y => (fun z => ?e2@{z, no y}) y)

Lifting identity evar substitution should really be Õ(1)

Composing identity evar substitution should also be Õ(1)

Lifting non-identity evar substitutions (should be Õ(size of subst)):

Optional operations:

Full reduction on a function of complexity O(f) should be Õ(f + input term size + output term size)

Performance Criterion: pattern on k variables should be Õ(term size + k + cost of retypechecking the output term (only if input term is not simply typed))

Performance Criterion: pattern should be Õ(term size * size of thing being patterned + cost of retypechecking the output term (only if input term is not simply typed))

Tactics to consider implementing on this API

  • pattern
    • pattern-without-conversion should be fast
    • pattern-with-conversion should be as slow as needed (because of conversion during typechecking) but not gratuitously slow
    • neither needs to be a primitive operation.
  • destruct / induction
  • assert
  • rewrite
  • setoid_rewrite (rewrite with binders)
  • autorewrite
  • rewrite_strat / autosetoid_rewrite
  • intro / intros / revert / generalize / set / clear / clearbody
  • refine / exact / assumption
  • change / refine + unification
  • unify, roughly how an unuification algorithm can do its job by calling intantiate / refine-in-evar / reduction-step-with/in-evar
  • constr:(⋯) / open_constr:(⋯)
  • match goal / match constr (readback problem)
  • moving constrs / out-of-band-things across contexts?
  • multi-step-reduction / cbv / lazy / vm_compute / ⋯ (probably not much detail, either use the steps above or use a skyhook vm)
  • pose / pose proof
  • congruence or some arithmetic tactic?
  • non-reflective ring tactic? try to estaiblish correspondance between time spent in a reflective and in a non-reflective implementation, to show that our proof engine can do okay on things previously pushed to reflection
  • apply (rapply!) / apply ⋯ in ⋯ / auto

Miscellaneous notes

Jason (Sun 6:03pm): [benchmarking one-step reductions in Coq] is heard because Coq doesn't expose a satisfactory one-step reduction. (But maybe you claim the thing to do is to just bench the version that we can hack up in Coq, even when we know most of the time isn't spent in the single step of reduction?) I think it's hard to construct them in a way where you're actually benching the single step. If we do it via Ltac match + constr hacks, I expect we incur overhead in retypechecking and Ltac matching (I suppose I might be wrong, but we'd have to be dealing with truly enormous terms before we expect one-step reduction to take more than 0.0004 seconds (Coq can only measure down to 0.001). Alternatively, we could do a non-one-step reduction when there's only one step to do, but it's not clear to me to what extent this is benching what we want to bench. Alternatively we could try to bench a conversion problem where there's just one step of reduction to do, but again I think we'll end up just measuring the conversation overhead

Backtracking???? maybe discuss that this is really about functional interface / persistence, and the cost of providing that

Discuss how proof by reflection is like replacing the proof engine with a special-purpose reflective implementation

We might also want alpha-variation as a fast primitive (even if the original term took arbitraily long to typecheck)

About

Benchmarks for various proof engines

License:MIT License


Languages

Language:Coq 92.9%Language:TeX 3.8%Language:Python 1.5%Language:Makefile 1.2%Language:Shell 0.5%Language:C 0.1%