Planning a testing methodology

Question

Planning a testing methodology

venzen opened this issue 2 years ago · comments

I generally adhere to the Python philosophy that it is better to ask forgiveness than permission. However, in this case it makes sense to gauge the opinions of @davidrmiller and other active contributors. Before I start coding on new or additional functionality, it will be useful to consider implications, best practice and David's vision for this project.

I have a few proposals to facilitate the user experience and put shoulder-to-wheel with some TODO items I noticed in the code:

Hard-coded filenames such as epoch-log.txt (mentioned by David in analysis.cpp) can be specified in the config file. Additional parameters will need to be added.
Optional Automation of tool scripts. As is currently the case with graphlog.gp, it is also possible to automate the genome plotting done by graph-nnet.py. It will entail a similar workflow: Indiv::printGenome() outputs genome plotting instructions to ./logs/net.txt (and console, if desired) and endOfGeneration() (for example) can call Python to draw a sample genome. This will also require a parameter in the config file - to enable or disable this functionality - and to specify a filename and system command. It is, therefore, possible to provide hooks for arbitrary tool scripts the user may choose to run during a simulation.
Parameter records. When running simulations systematically for research it will be valuable to keep a record of the parameters being used for successive or specific simulations. I propose a config file parameter that enables/disables output of parameters to a user specified file. This file should also record any changes to parameters during the course of a simulation - value change, sim generation at time of change, and a date and timestamp.
Convert asserts to useful output. assert() statements are useful during development but are misleading for users. Some people have complained about "segmentation faults" while these are simply code asserts that correctly (and meaningfully) trigger. Unless critical, asserts should be replaced with useful console output that users can report, but that allows sim execution to continue. A stack of errors encountered during execution can be logged (during execution) and dumped to the console at the end of code execution.
Baseline params for barebones simulation. When running biosim without any config file (minus biosim4.ini too) it should execute a short minimal simulation that, ideally, never fails. The current set of hard-coded parameters seems to achieve this. I only had to get rid of an assert() statement in the genome comparator function to achieve a reliable, repeatable minimal simulation. The reason why I mention this is that future code changes should not break this baseline state.

I have a few other ideas, such as coding a Python matplotlib tool and genome editor, but the above list of items will provide the foundation for that, while resolving some of the self-evident current issues.

Let's hear some feedback and opinions.

David R. Miller · Answer 1 · Sun Nov 28 2021 02:49:16 GMT+0800 (China Standard Time)

@venzen, these are great ideas. Thank you for presenting them for discussion.

I'd like to put a high priority on thinking about some sort of testing methodology. Before accepting each PR, I'd like to make sure that the simulator still works in single-threaded and multi-threaded mode, with two or three different biosim4.ini configurations. Some of your ideas are relevant in moving in that direction.

Since the results of simulations are not deterministic, we can't just blindly compare the simulator output with a reference. And manual testing would require volunteer effort because my available time is limited. I have some vague ideas of how to partially automate a testing process, but would like to hear other ideas.

Once we have a testing process in place and documented, we can more safely attend to those other excellent ideas for cleaning up and extending the code.

venzen · Answer 2 · Sun Nov 28 2021 03:08:15 GMT+0800 (China Standard Time)

Testing is important and I think it is prudent to get some baseline checks and balances in place before extending functionality. I want to point out that my skill in Python is magnitudes better than in C++, so where possible I'd prefer writing Python unit tests.

Let's discuss the testing methodology. I have some hours every week to contribute to that.

venzen · Answer 3 · Sun Nov 28 2021 03:18:37 GMT+0800 (China Standard Time)

I'll wait for you to give your thoughts on the most important areas/functionality to test. Automating the testing process makes sense, although the initial unit tests will have to be manual until some stability and familiar process is achieved.

Go ahead and share your thoughts, I'm interested to know what needs testing, about the envisioned process and to spend a few days thinking about that

David R. Miller · Answer 4 · Sun Nov 28 2021 03:57:19 GMT+0800 (China Standard Time)

Python is awesome. A python program could construct a biosim4.ini config file, kick off a simulation that runs for a minute, then inspect logs/epoch-log.txt to see if the final numbers are within sensible ranges. It could also capture stdout and check for anything suspicious. We could start with a small number of biosim4.ini configurations that have extreme, stressful, or problematic parameters, then empirically determine what sort of output to expect from them.

Andor Pathó · Answer 5 · Sun Nov 28 2021 04:42:21 GMT+0800 (China Standard Time)

Couldn't the simulation be made deterministic, by making any random generation used inside deterministic using a set seed value? The program could take the seed value as a parameter, and would always produce the same output for the same seed value (provided other inputs are the same as well of course). Multithreaded retrieval of random values is the only problem I can see with this approach, but it could at least make single threaded tests very simple. Am I missing something?

David R. Miller · Answer 6 · Sun Nov 28 2021 08:11:26 GMT+0800 (China Standard Time)

@apatho, that's a good observation. A small modification to the random number generation would let us make a nice, deterministic single-threaded smoke test. That could be a valuable component in a testing process. The advantage of testing with non-deterministic execution is that, over time, uncommon edge-case errors will appear and inform us of problems we would otherwise miss.

venzen · Answer 7 · Sun Nov 28 2021 11:32:50 GMT+0800 (China Standard Time)

@davidrmiller @apatho A "deterministic single-threaded smoke test" can be one of the tests. A few non-deterministic stress tests will reveal edge cases. as David mentions. Agree that both are required. To proceed I suggest:

A tests directory with subdir configs containing a few .ini files with extreme params. @davidrmiller you should probably create these since you have a better idea of what "extreme" params you want to test.
Provide a modified RND number generator for crafting a deterministic baseline test.
I will start work on a Python script that does what David suggests above:

construct a biosim4.ini config file
kick off a simulation that runs for a minute (approx).
inspect logs/epoch-log.txt to see if the final numbers are within sensible ranges
capture stdout and check for anything suspicious (assert failures, error output, etc)

I will take a few days to think about the testing process and methodology. In the meantime it will be useful to ask questions and bounce ideas toward consensus.

venzen · Answer 8 · Sun Nov 28 2021 11:39:52 GMT+0800 (China Standard Time)

BTW, having biosim4 internal time for outputting timestamps will be useful. I've activated lastModTime in params.cpp for each config file read. Shall I PR this?

venzen · Answer 9 · Sun Nov 28 2021 16:56:31 GMT+0800 (China Standard Time)

While running and monitoring a series of fast sims, an issue emerged: sometimes the pace of sim execution is faster than the monitoring can keep up with. This may be a result of disk I/O or Python's inherent lag.

Having a param that specifies a wait time between generations (at spawnNewGeneration() or at genomeAnalysisStride) could be the solution.

David R. Miller · Answer 10 · Sun Nov 28 2021 17:42:12 GMT+0800 (China Standard Time)

I use a wait time like that on a particular laptop to keep it from overheating :-) But maybe there's another way. There's a parameter in biosim4.ini that sets the maximum number of generations to simulate, after which the simulator exits. Can the python program invoke the simulator with stdout and stderr redirected to a file? That would allow the OS to capture the output while the simulator runs full speed. The python program can wait quietly until the simulator exits, then peek at the captured stdout, stderr, and log files. Perhaps a timeout is also needed so the python program can force-kill the simulator if it doesn't terminate within a reasonable time.

venzen · Answer 11 · Sun Nov 28 2021 18:12:57 GMT+0800 (China Standard Time)

Yes, Python's suprocess.Popen() allows stdout and stderr to be captured and I think this is the best way to run tests.
What is causing issues is the updating of graphlog.gp when running fast sims. However, probably best not to draw graphs for these tests (because we're not testing graphing right now). We'll cross that bridge when we get there.

Asa Hopkins · Answer 12 · Mon Nov 29 2021 18:27:57 GMT+0800 (China Standard Time)

A small modification to the random number generation would let us make a nice, deterministic single-threaded smoke test.

The option already exists, RandomUintGenerator takes a bool for whether to run deterministically or not, maybe just add an option to biosim4.ini to decide which mode it uses? When I've been profiling, setting that and numThreads = 1 has been enough to get a reproducible output. I think allowing a user-set seed would be good, that would just mean replacing std::mt19937 generator(time(0)); with std::mt19937 generator(seed); on line 38 of random.cpp in the case that the user passes a seed.

venzen · Answer 13 · Mon Nov 29 2021 18:32:08 GMT+0800 (China Standard Time)

@Asa-Hopkins Thanks for the explanation. Is there any specification for the format of the seed?

Asa Hopkins · Answer 14 · Mon Nov 29 2021 18:43:03 GMT+0800 (China Standard Time)

The specification says that either a result_type or a Sseq& are viable, I'm not entirely sure what those are but the examples give both an unsigned int and a str as possible seeds, so it seems fairly versatile.

venzen · Answer 15 · Mon Nov 29 2021 18:46:18 GMT+0800 (China Standard Time)

Sure, I'm reading the PDF linked in random.cpp and it mentions:

Avoid setting the seeds to zero or small numbers in general –
try to choose large “complex” seed values (see discussion below on warming up RNGs).

I'll read further.

David R. Miller · Answer 16 · Mon Nov 29 2021 22:55:59 GMT+0800 (China Standard Time)

I like where this is going. There is even a slight possibility for deterministic multi-threaded program execution (with some performance penalty), but this will require investigation into the options for configuring OpenMP thread scheduling. I can elaborate more later.

Thinking about the config file, here is an initial stab at how the parameters might look in the default biosim4.ini file with descriptions that match the language and style of the other parameters:

# If true, then the random number generator (RNG) will be seeded
# by the value in RNGSeed, causing each thread to receive a
# deterministic sequence from the RNG. If false, the RNG will be
# randomly seeded and program output will be non-deterministic.
deterministic = false 

# If deterministic is true, the random number generator will be seeded
# with this value. If deterministic is false, this value is ignored.
RNGSeed = 1234567

venzen · Answer 17 · Tue Nov 30 2021 09:26:54 GMT+0800 (China Standard Time)

@davidrmiller I'm working on a Python unit test of RNG functionality. This requires a separate compilation of random.cpp in the tests directory where all tests can be run.

The config file options that you posted, above, make sense and can be included in biosim4.ini once we are satisfied with testing.

On this same topic, I propose that we use the tried-and-tested ConfigParser module in Python unit tests. It already contains all the functionality to read, write and validate .INI files and with a lot of future flexibility. The caveat is that biosim4.ini will have to contain at least one [section] declaration. Like this:

[DEFAULT]
# numThreads must be 1 or greater. Best value is less than or equal to
# the number of CPU cores.
numThreads = 10

# sizeX, sizeY define the size of the 2D world. Minimum size is 16,16.
# Maximum size is 32767, 32767.
sizeX = 128
sizeY = 128

Similar to # comments we tell updateFromConfigFile() to ignore lines starting with a square bracket. This allows Python to manipulate config files for testing. You may choose to group params in logical sections - or not. The single [DEFAULT] section declaration is sufficient.

Here is the configparser documentation:

https://docs.python.org/3/library/configparser.html

Waiting for your feedback before proceeding with this.

David R. Miller · Answer 18 · Tue Nov 30 2021 10:03:22 GMT+0800 (China Standard Time)

It's reasonable to add a section declaration at the top of the config file to facilitate Python parsing.

Also, as you suggested earlier, I'm creating a few biosim4.ini config files for future use in non-deterministic testing. Each one runs a simulation that lasts a minute or two. I'm trying to characterize their results by running each one 500 times. So far, the best way I can see to determine if a run is successful is to check that there is no output on stderr and that the numbers in the last line of logs/epoch-log.txt are within certain ranges.

Michael Pilosov · Answer 19 · Wed Dec 01 2021 06:29:33 GMT+0800 (China Standard Time)

@venzen, these are great ideas. Thank you for presenting them for discussion.

I'd like to put a high priority on thinking about some sort of testing methodology. Before accepting each PR, I'd like to make sure that the simulator still works in single-threaded and multi-threaded mode, with two or three different biosim4.ini configurations. Some of your ideas are relevant in moving in that direction.

Since the results of simulations are not deterministic, we can't just blindly compare the simulator output with a reference. And manual testing would require volunteer effort because my available time is limited. I have some vague ideas of how to partially automate a testing process, but would like to hear other ideas.

Once we have a testing process in place and documented, we can more safely attend to those other excellent ideas for cleaning up and extending the code.

I believe we'd need to add mpi into the Dockerfile to accomplish parallel testing. How can I verify that parallelism works? I've messed with the threads options but assume nothing happens because mpi is missing?

Michael Pilosov · Answer 20 · Wed Dec 01 2021 06:33:40 GMT+0800 (China Standard Time)

PS +100 for this discussion, I need to turn on notifs for this repo...

I've extensive experience with designing tests for nondeterministic processes, and the approach proposed by @davidrmiller of checking ranges is definitely suitable and common practice, just expect the occasional one to fail and have to re-run the github action.

venzen · Answer 21 · Wed Dec 01 2021 10:29:31 GMT+0800 (China Standard Time)

@Asa-Hopkins I'm trying to compile a standalone executable of random.cpp. This is for local testing and profiling of the RNG. However, my lack of knowledge of C has me stumbling in the dark. Perhaps you can assist if it is an easy fix...

So, I have left random.h unchanged, and have added a main() function after (and outside) namespace BS{} in random.cpp that allows reading of command line arguments:

bool getBoolVal(const std::string &s)
{
    if (s == "true" || s == "1")
        return true;
    else if (s == "false" || s == "0")
        return false;
    else
        return false;
}

int main(int argc, char **argv)
{
    bool thisbool = getBoolVal(argv[1]);
    BS::RandomUintGenerator(thisbool);
    std::cout << "RNG " << BS::randomUint() << std::endl;

    return 0;
}

The objective is to pass a bool as a command line arg that enables deterministic or non-deterministic RNG. The next step would be to pass a seed in argv[2] to randomize() as you previously described. However, now, attempting only bool argv[1] I am getting compilation errors.

Excuse my inexperience with this basic skill. Perhaps the solution is simple and self-evident.

David R. Miller · Answer 22 · Wed Dec 01 2021 11:57:36 GMT+0800 (China Standard Time)

@venzen, I just peeked inside random.h and random.cpp and it's quite a mess. It's got remnants of code from several experimental implementations before I settled on the Jenkins algorithm (it's quite fast). The system's built-in std::mt19937 generator is used only to randomly seed Jenkins. Marsaglia is no longer used. And in random.cpp, it looks as though "omp.h" is included for no reason. Good luck hacking on that mess and let me know if you have any questions.

David R. Miller · Answer 23 · Wed Dec 01 2021 16:24:13 GMT+0800 (China Standard Time)

I also noticed that initializing the global RNG object with a deterministic seed would be a little tricky because it had to be done only after the parameters are read instead of at object construction time. So I went ahead and created a branch to do that. Deterministic seeding seems to work in single-threaded mode. Maybe we should output a warning on stderr if deterministic == true and numThreads > 1. Comments are welcome.

venzen · Answer 24 · Wed Dec 01 2021 18:41:24 GMT+0800 (China Standard Time)

Yes, I tried to port RandomUintGenerator to Python using the pybind11 module. I got a big headache that lasted all night 😆 no joy.
Look forward to reviewing the branch this evening...

Asa Hopkins · Answer 25 · Wed Dec 01 2021 20:59:02 GMT+0800 (China Standard Time)

This is for local testing and profiling of the RNG.

Personally what I've been doing is just editing main.cpp for those kinds of tests, rather than trying to make a new main function elsewhere. PR #30 is more along the lines of what I was thinking, with the values being read from biosim4.ini rather than from the command line.

venzen · Answer 26 · Wed Dec 01 2021 21:46:37 GMT+0800 (China Standard Time)

@Asa-Hopkins, sure, and David's latest PR 'deterministic' implements this in a similar good way. My attempt to compile the RNG in isolation is a personal exercise and an attempt to test and interrogate the RNG directly. I'll give it another try with the new code.

David R. Miller · Answer 27 · Thu Dec 02 2021 03:24:55 GMT+0800 (China Standard Time)

Thinking about a test script. That little example shell script example above (or a Python equivalent) needs to allow the user to select a test from among a collection of tests. I'm wondering how to store the test parameters and expected results for a collection of tests.

The possibilities fall into two categories: store the data inside the test script, or store them in a separate file.

To illustrate the latter, here is a possible JSON format for specifying tests. Just one test is shown, but more can be added. The test script could take the name of the test as argument e.g., ./tests/test.py quicktest. Specifying the test config in a separate file seems cleaner in some sense.

On the other hand, encoding the same information inside the test script in data structures has advantages: it avoids a dependency on a parser and consolidates the testing framework into a single file. Comments are welcome.

{
  "tests": {
    "quicktest": {
      "description": "Quick test that only runs a few seconds",
      "parameters": [
        "stepsPerGeneration=100",
        "maxGenerations=2"
      ],
      "results": {
        "generations": 2,
        "survivors": {
          "min": 980,
          "max": 1050
        },
        "diversity": {
          "min": 0.98465,
          "max": 0.999
        }
      }
    }
  }
}

venzen · Answer 28 · Thu Dec 02 2021 10:14:32 GMT+0800 (China Standard Time)

@davidrmiller the above outline makes sense and I will provide a Python example in a few days when I have more time.

venzen · Answer 29 · Sat Dec 04 2021 22:50:23 GMT+0800 (China Standard Time)

@davidrmiller I discovered that in Python3 the configparser module is native - it does not need to be installed. This is good because it means we can easily read, write, parse and filter .ini files, biosim4.ini included.

Regarding the design you outlined above, my personal preference would be to store and retrieve data from a file. It is both easier to update/manage and more flexible in case of format/parameter/test-scenario changes. If we choose this route then this is how it might work:

The Python test app has a config file for both its own internal settings, as well as, various test cases. It can read the default biosim4.ini configfile in the parent directory ../ as well as its own (or any arbitrary) configfile in tests/configs/. It can also write/update params in those files using std .ini format.

We should choose the structure of the test app's config file. Two potential options are displayed further down the page. The same constraint I mentioned before applies. There should be, at least, a [DEFAULT] section header at the top of each file. Of course, there can be any number of sections, and these allow us to structure the file contents. However, there is no support for sub-sections such as those in JSON format that you describe above. However, this is not necessarily an obstacle because the config file keys can have descriptive names that allow the code to discern sub-structures.

Here are 2 options, created by the test proto-app from the JSON you provided:

Option 1: multiple sections to order data hierarchy

$ cat configs/biosim4test.ini 
[internal]
name = biosim4-test
version = 21.12.03
url = https://github.com/davidrmiller/biosim4/

[quicktest]
description = Quick test that only runs a few seconds

[quicktest-params]
stepspergeneration = 100
maxgenerations = 2

[quicktest-results]
generations = 100

[quicktest-results-survivors]
min = 980
max = 1050

[quicktest-results-diversity]
min = 0.98465
max = 0.999

[slowtest]
description = Slow test that runs for years

[slowtest-params]
stepspergeneration = 10000

Option 2: few sections with descriptive key names:

$ cat configs/biosim4test.ini 
[internal]
name = biosim4
version = 21.12.03
github_url = https://github.com/davidrmiller/biosim4/

[quicktest]
description = Quick test that only runs a few seconds
param-stepspergeneration = 100
param-maxgenerations = 2
result-generations = 100
result-survivors-min = 980
result-survivors-max = 1050
result-diversity-min = 0.98465
result-diversity-max = 0.999

[slowtest]
description = Slow test that runs for years
param-stepspergeneration = 10000

I prefer option 2 and it follows nicely from your requirement that the user be able to specify a test name as arg to the script. Passing quicktest would tell the script to look for a section with that name and it would then parse config params accordingly. Given how the configparser is geared to the task of filtering and manipulating keys, option 2 is easier to implement and more efficient.

Feedback and thoughts?

David R. Miller · Answer 30 · Sun Dec 05 2021 03:01:08 GMT+0800 (China Standard Time)

Very nice. I like option 2 as well. It's visually easy to parse.