Improve generated and random testing support

Question

Improve generated and random testing support

Shinmera opened this issue 4 years ago · comments

Parachute should support testing where the tests are runtime (possibly randomly) generated better. In particular, this would involve:

Making it possible to avoid printing information about test forms when there's a huge amount.
Some way to note the context and history of a test form to provide better information about where and how the failure occurred.
Possibly some scaffolding to generate the tests, though I'm not yet sure what exactly that would involve.
Maybe a system to define a grammar that can construct random operations to be tested, though this could be out of scope.

Yukari Hafner · Answer 1 · Fri Nov 25 2022 16:09:55 GMT+0800 (China Standard Time)

Added largescale report to address point 1.

Francis St-Amour · Answer 2 · Sat Feb 25 2023 06:15:58 GMT+0800 (China Standard Time)

Ah! I came here to create this issue :P

I'll come back to add details, ideas and a plan.

Ok, quick look at the readme. I think a new type of test make sense (I'll confirm once I've looked at the implementation.)

Francis St-Amour · Answer 3 · Sun Feb 26 2023 03:16:40 GMT+0800 (China Standard Time)

Here's the current behaviour, to help highlight what needs improvment:

(defpackage #:randomized-testing
  (:documentation "Scratch file to explore randomized tests with parachute")
  (:use #:cl)
  (:import-from #:parachute
                #:define-test
                #:define-test+run
                #:is
                #:true
                #:false))

(in-package #:randomized-testing)

;; Define a dummy function to test
(defun buggy (x)
  "Like cl:identity, but returns the wrong value for 2."
  (if (= 2 x) 3 x))

;; Define a dummy PRGN
(defun make-fake-prng ()
  (let ((x 0))
    (lambda ()
      (incf x))))

;; That's what a radomized test would like _with the current implementation_
(define-test+run buggy
  (let ((prng (make-fake-prng)))
    (loop
      :repeat 10                 ; we run the randomized test 10 times
      :for x = (funcall prng)
      :do (is = x (buggy x)))))

;; Here's what the output looks like:

#|
？ RANDOMIZED-TESTING::BUGGY
0.000 ✔   (is = x (buggy x))
0.000 ✘   (is = x (buggy x))
0.000 ✔   (is = x (buggy x))
0.000 ✔   (is = x (buggy x))
0.000 ✔   (is = x (buggy x))         ; These are not useful
0.000 ✔   (is = x (buggy x))         ; They are not descriptive either
0.000 ✔   (is = x (buggy x))
0.000 ✔   (is = x (buggy x))
0.000 ✔   (is = x (buggy x))
0.000 ✔   (is = x (buggy x))
0.004 ✘ RANDOMIZED-TESTING::BUGGY

;; Summary:
Passed:     9                        ;  Should we count them differently?
Failed:     1
Skipped:    0

;; Failures:
1/  10 tests failed in RANDOMIZED-TESTING::BUGGY  ; What about this count?
The test form   (buggy x)            ; For more complex tests, this would not be
evaluated to    3                    ; enough to know which inputs failed the test
when            2
was expected to be equal under =.

#<PARACHUTE:PLAIN 11, FAILED results>
((IS = X (BUGGY X)))
|#

The user can improve the situation a bit, by supplying a description to the test result:

(define-test+run buggy
  (let ((prng (make-fake-prng)))
    (loop
      :repeat 10 ; we run the randomized test 10 times
      :for x = (funcall prng)
-     :do (is = x (buggy x)))))
+     :do (is = x (buggy x) "Failed with input ~a" x))))

Then the output would be slightly improved:

[...]

;; Summary:
Passed:     9
Failed:     1
Skipped:    0

;; Failures:
1/  10 tests failed in RANDOMIZED-TESTING::BUGGY
The test form   (buggy x)
evaluated to    3
when            2
was expected to be equal under =.
+Failed with input 2

[...]

Edit based on Shinmera's comment about uax-9

If we expand the is macro, the code looks like this:

(define-test+run buggy
  (let ((prng (make-fake-prng)))
    (loop
      :repeat 10 ; we run the randomized test 10 times
      :for x = (funcall prng)
      :do (eval-in-context *context*
                 (make-instance 'comparison-result
                                :expression '(is = x (buggy x))
                                :value-form '(buggy x)
                                :body (lambda () (buggy x))
                                :expected x
                                :comparison '=)))))

And if we tweak it just a little bit:

(define-test+run buggy
  (let ((prng (make-fake-prng)))
    (loop
      :repeat 10 ; we run the randomized test 10 times
      :for x = (funcall prng)
      :do (eval-in-context *context*
                 (make-instance 'comparison-result
-                               :expression '(is = x (buggy x))
-                               :value-form '(buggy x)
+                               :expression `(is = x (buggy ,x))
+                               :value-form `(buggy ,x)
                                :body (lambda () (buggy x))
                                :expected x
                                :comparison '=)))))

We get a much more useful output:

？ RANDOMIZED-TESTING::BUGGY
  0.000 ✔   (is = x (buggy 1))
  0.000 ✘   (is = x (buggy 2))
  0.000 ✔   (is = x (buggy 3))
  0.000 ✔   (is = x (buggy 4))
  0.000 ✔   (is = x (buggy 5))
  0.000 ✔   (is = x (buggy 6))
  0.000 ✔   (is = x (buggy 7))
  0.000 ✔   (is = x (buggy 8))
  0.000 ✔   (is = x (buggy 9))
  0.000 ✔   (is = x (buggy 10))
  0.004 ✘ RANDOMIZED-TESTING::BUGGY

;; Summary:
Passed:     9
Failed:     1
Skipped:    0

;; Failures:
   1/  10 tests failed in RANDOMIZED-TESTING::BUGGY
The test form   (buggy 2)
evaluated to    3
when            2
was expected to be equal under =.

Still too verbose for now.

Yukari Hafner · Answer 4 · Sun Feb 26 2023 03:18:34 GMT+0800 (China Standard Time)

In uax-9 I manually construct test results to avoid the eager capture of the default macros: https://github.com/Shinmera/uax-9/blob/master/test.lisp#L74

This leads to much better reports, but is obviously quite unwieldy.

Yukari Hafner · Answer 5 · Sun Feb 26 2023 03:28:03 GMT+0800 (China Standard Time)

My current thought is that we need some way to communicate variable parts in a test form to the test macro, so that it knows which parts to take from the runtime environment, and which parts to take from the expression. That would at least solve the reports issue. How exactly that could be worked out, I'm not sure. It is very likely though that we'll need separate analogues of is, etc. for this purpose.

Yukari Hafner · Answer 6 · Sun Feb 26 2023 03:39:22 GMT+0800 (China Standard Time)

Another thought: for instance, I would like to randomly generate trees, then perform a random sequence of interactions on the tree, checking an invariant at each step. Ideally the test report on failure would say something like:

The test form   (is-consistent tree)
evaluated to    NIL
when            T
was expected to be equal under geq
after operation (frob 5 tree)
with arguments
  5
  (a b (c d (e)) (((h))))
returned
  (a b c d . e)

Meaning: it captures the i/o on each operation, and then stores them in the failure report so we know how to replicate the results.

In cases where global state is modified we might also need to instruct it to capture that global state, too.

Yukari Hafner · Answer 7 · Sun Feb 26 2023 03:43:04 GMT+0800 (China Standard Time)

Generalising this: it seems useful to be able to attach some contextual operation to another test form. Maybe something like (true* (is-consistent tree) :after (frob 5 tree)) which could be generated from a macro like (is-invariant (is-consistent tree) (frob 1 tree) (frob 2 tree) ...)

Yukari Hafner · Answer 8 · Sun Feb 26 2023 03:52:51 GMT+0800 (China Standard Time)

To avoid macroising it too much, how about a call-context storage object which keeps the form, arguments, and result, and a macro (capture-call-context form context) to capture that data into such an object, which can then be passed as an argument to a normal result construction.

Francis St-Amour · Answer 9 · Tue Feb 28 2023 12:59:04 GMT+0800 (China Standard Time)

Again, I'll come back later to add details, but I was able to hack something up (as in "it works, but the code is very horrible and I didn't think about all the implications yet"):

(define-test+run buggy
  (randomize ((x (gen-integer 0 3)))
    (is = x (identity x))
    (is = x (buggy x))))

And it outputs something like this:

;; Summary:
Passed:     0
Failed:     1
Skipped:    0

;; Failures:
   1/   1 tests failed in RANDOMIZED-TESTING::BUGGY
Randomized test failed after 11 passed and 0 skipped tests.
Seed: 16819773938075529869
With 
  X = 2
The test form   (buggy x)
evaluated to    3
when            2
was expected to be equal under =.

Note to myself: also print the failed result's expression.

Francis St-Amour · Answer 10 · Thu Mar 02 2023 23:36:52 GMT+0800 (China Standard Time)

Here are the things I've tried:

breeze/scratch-files/randomized-testing.lisp

It's low quality, because I just wanted to try anything and because I needed to learn a bit more about how parachute works. Lastly, the code is in my project "breeze" because the whole point of this project is to try stuff.

So, I would like some feedback before I try making something cleaner.

Oh! I also thought about the DSL for generating more complex data (like trees), and I think it might not be that hard. Using a function that looks like like an eval, but that returns a function that takes a prng as input... Something along those lines

Yukari Hafner · Answer 11 · Sat Mar 04 2023 22:27:35 GMT+0800 (China Standard Time)

Sorry, I'm having trouble gleaming your thinking from the lisp file, so I'm not entirely sure what, concretely, you're proposing. Let me try to outline my current understanding, and please correct me if I'm wrong or missing something:

We want to be able to do three things:

Provide context to a result, so that the report can tell us more about what was going on leading up to the failing test form. It should be convenient for the user to capture this "context", whether that be a single prior call, or a trace/sequence of calls leading up to the failure.
Remove noise from the report somehow, either by not recording certain results at all, or by just not printing them when they're not relevant.
Provide some way to generate randomised operations.

I'm not talking about data at all here, as 3. seems like a much more general and powerful concept than explicitly generating data structures. If we can support randomising operations, we can use that to generate the random data, too, and also test protocols that don't primarily rely on data structures, or rely on data structures that are highly private and annoying to construct via anything but these operations we want to test, anyhow.

I think 3. could also be left as a separate issue, as with 1. we can also leave it up to the user to decide how to "generate" the operations, as long as the system can take care of capturing the context we care about to analyse the failure.

Finally, 2. should be rather trivial and more a question of taste. I can see either approach working, the latter probably just with some kinda flag to mark results as "unimportant", which are then skipped in the report.

Francis St-Amour · Answer 12 · Sun Mar 05 2023 12:00:03 GMT+0800 (China Standard Time)

Sorry, I'm having trouble gleaming your thinking from the lisp file

That's totally fine and I was pretty much expecting it.

We're on the same page about the 3 points.

But I think there's one thing we might not be exactly on the same page: I'm trying to implement 2 kinds of randomized tests: PBT and MBT (more on that in a bit), whereas you seem (to me) to be more focused on the MBT kind. I'm also trying to make sure that everything is reproducible by using PRNGs that can be seeded, and trying to make sure that it's going to be possible to add shriking later on.

Some background info

PBT stands for Property-Based Testing
- PBT is basically: generate random data, test against that data, rince and repeat
MBT stands for Model-Based Testing
- The "Model" is a simplified version of the thing you want to test (in my prototype for example, I modeled a list by just a number corresponding to the number of items in the list)
- You apply operations (that I called actions in my code) to both the model and the real thing, then you check that they are both in agreement; if they're not, then the test fails.
- You can use multiple models to test different aspects of the thing being tested.
- It's pretty good to tests stateful things.
Shrinking is also known as "Test Case reduction"
- Said otherwise: when a randomized test fails, shrinking tries to find a simpler test case that can reproduce the same issue. For example, it could reduce a failing tests from 1000 operations.
- In the general case, it is hard.
- In practice, it works well and is very much worth the effort
- I want to implement internal shriking because it's much mush easier and better for everyone.

PBT was popularized by an Haskell library named QuickCheck, and now there are "quick/fast/rapid" test libraries everywhere xD (see the wikipedia page). I'm trying to use the same vocabulary where it make sense.

What I tried so far

Reporting and PBT

I tried two things to help with reporting:

modifying the is macro to recognize some special syntax, and generate several comparison-result with the right expression.
- that worked... but oh god the code is unwieldy and overall not nice to use
wrapping the calls to "testers" (is, true, etc.) with a macro that overrides *parent* and *context*.
- I added a randomized-context that prevents tests from being reported (to reduce the noise).
- I added a randomized-result that keeps track of the actual number of test results
- with that I was able to make it report only when there was a failure
- the result also contains the generated values that triggered the failure
- when formatting the randomized-result it prints out the seed and the generated values; and then print the actual test result (a comparison-result in my prototype).
- this is what I was showing in my previous comment

Furthermore, the macro I use to rebind the *context* and *parent* looks like a let:

(randomize ((x (gen-integer 0 3)))
  (is = x (identity x)))

It expands into a loop that generates the data for the tests.

Implementation aside, I think I quite like this syntax. What about you?

Almost forgot the thing I like the most about it: it's easy to modify existing tests to use generated data.

MBT

For model-based testing, I did a very query and very very dirty prototype.

the SUT (system under test) is a stack (which I just called list in the code...)
the model is a number that is supposed to match the length of the list
there are 3 operations
- push, is always applicable
- pop, only applicable when the length of the list is positive
- length, always applicable
so the operations (that I called actions in my code) are modeled by 2 things:
- a function to test if the operation is applicable
- a function to apply the operation

Finally, there's a loop that

chose an applicable operation
apply the operation
make sure that all invariants hold (in my example, there's 1 invariant: the length of the list should be = to the model)

My questions so far

What do you think about showing the randomized-results as only 1, event though it actually ran many more?
What do you think about stopping each randomized tests on the first failure?
Do you think the shenanigans I did with the randomized-result and randomized-context? Is that the right direction, in your opinion?
I think I should be able to get away with only 1 class that does the jobs of both randomized-result and randomized-context. Do you see anything wrong with that?

I have more, but let's start with that.

Roadmap

Here's my high-level plan:

Make the reporting not horrible when generating thousands of tests
Make it easy for people to generate data
Make it easy to do MBT

I swear I didn't mean to write a novel, but I just thought about something (that is super-evident): I could split the randomized macro in 2:

A part that handles the "report only the failed children results; and print this extra context when such failure occurs"
Another part that handles data generation

Both parts would be bundled together in one randomized macro.

P.S. I didn't re-read all of this... Knowing myself, it's probably full of typos and weird syntax. I'll try to re-read it later when I have slept 😅

Bonne nuit!

Francis St-Amour · Answer 13 · Wed Aug 02 2023 11:10:42 GMT+0800 (China Standard Time)

More food for thought: https://lobste.rs/s/xk9mhh/run_property_tests_until_coverage_stops