Make population training more flexible and idea for a more powerful GA server.

Question

Make population training more flexible and idea for a more powerful GA server.

Neopallium opened this issue 4 years ago · comments

Robert Gabriel Jakabosky commented 4 years ago

Right now the Population has full control over training/testing each member in a generation. I would like to have more control over how each member of a generation is tested against the problem.

The reason is to be able to make a server that just manages the genomes/populations with many worker computers doing the testing of each generation's members against the problem. The workers would request a member/problem pair from the server, evaluate that member against the problem and return the fitness score back to the server. The server could keep a track of Genomes & Problems in a database. Similar problems (same basic code but different inputs) could be trained faster by re-using genes from high scoring Genomes.

The current logic is like this:

Create Population for a set of Genome, Environment and Problem.
Set evolution parameters.
Run evolution over population with callback for when each generation is finished.

I would like to split the Population::run() and Population::train() methods so an application can do the following:

Create/setup population like steps 1 & 2 above.
Instead of calling run() get the current list of members from the population.
Process those members against the Problem (this might include offloading the work to workers over the network).
Once all members have been evaluated and a fitness score has been collected, call into the Population with the fitness scores to end the current generation. This would be like the Population::train() method, just skipping the call to Generation::optimize() and setting each members fitness.
Get the end of generation results and decide if another generation needs to be tested (loop to step 2).

pkalivas · Answer 1 · Mon Jul 13 2020 21:05:09 GMT+0800 (China Standard Time)

I agree, this has been something on my mind for a bit now but I haven't gotten around to thinking through an implementation. I'd be more than happy to work through this and update the crate as features are added. I think the thought of decoupling the genome from the population where they can be added/updated/give general information to the larger population from outside of the run() loop, is probably the best way to go.

Robert Gabriel Jakabosky · Answer 2 · Mon Jul 13 2020 23:33:41 GMT+0800 (China Standard Time)

I have already made the minimal changes to radiate to allow a custom run() loop.
Now I am in the middle of updating the neat-web example. I am thinking about adding a TrainingSetDto to radiate-web crate so the neat-client can send the inputs/answers to the neat-server with the population setup.

The basic idea is to allow clients to post a "simulation" to the server, then poll the server for the status (gen. #, best fitness, last gen. runtime). Once that is in-place I will add worker support.

Also while looking into the Population and Generation code, I was thinking it would be nice and easy to remove the Problem code from those structs. Then make a PopulationSolver (or some other name like Simulation) struct that would handle solving a problem using a Population instance. It would be easy to do, but break the current API. Right now I have only added methods to Population and Generation to support custom run() loops.

It might be possible to create a generic set of Genome/Gene types (Phenotype?). Instead of having Neat and NeatEnvironment compiled directly into the server (and having to recompile the server for each new Genome). Some generic Gene types could be created for nodes in a tree, or layers in a neural network (with parameters: # inputs, # outputs, # hidden, list of weights). Then the server only needs code for the generic Genome and Gene types. The generic Genome could then be used by workers to create the Neat network or other model.

pkalivas · Answer 3 · Tue Jul 14 2020 02:15:29 GMT+0800 (China Standard Time)

Great. I think decoupling the Problem from the Population and Genome would be pretty ideal. In my experience using the current API, that has been a pain point. Abstracting that out to a Simulation of sorts sounds like it makes a lot of sense to me, especially in regards to a distributed type system.

In regards to your third point, I'm not sure I 100% understand. Using Neat as an example, would you mean some sort of "Genome Builder" to take the parameters of the model (weights, connections, etc) and turn them into a generic "Genome" type?

With the Tree and Neat being pretty generic, I think it makes sense to move the "models" folder out into it's own lib/logical home (radiate_models or something like that). That way the core engine can stay generic with the models only being pre-built problem solvers.

Robert Gabriel Jakabosky · Answer 4 · Tue Jul 14 2020 02:56:30 GMT+0800 (China Standard Time)

I mean that right now there is a Genome<Neat, NeatEnvironment> that does crossover/distance GA logic for Neat models. I wanted to keep that GA logic separate from the models.

Models like Neat would only define Gene types (one for each layer type Dense/LSTM/GRU). Each Gene type could have a fixed/variable set of parameters (input size, output size, hidden size, optional weights). The GA logic crossover/distance would only happen on the Gene types. The models would only need to be able to take a Genome (set of Genes) to create the model (Neat, etc..) from it.

I think a set of common Gene types could be shared/reused between models, or just provide a few simple ones with the core engine. This might also make it easier to work with larger/complex neural networks. Also it would be interesting to have a UI for viewing a Genome.

If this is possible, it would allow the server to just have GA code for those common Gene types. Then the workers just download a Genome (not the model) convert it to a model and train it against the problem.

Robert Gabriel Jakabosky · Answer 5 · Tue Jul 14 2020 03:37:32 GMT+0800 (China Standard Time)

I just pushed what I have done so far:

Support for custom run() loops.
Add TrainingSetDto to radiate_web
Update neat-web client/server to support polling for simulation status.
Client posts a new simulation to the server and gets back a uuid. Then it keep polling the status of that simulation.
Right now the server is just doing one generation each time the client requests the status.

I should be able to add worker support tomorrow.

What pain points have you run into with the current API?

pkalivas · Answer 6 · Tue Jul 14 2020 04:45:03 GMT+0800 (China Standard Time)

Ah, I see. Yes that actually makes a lot of sense, that could also lead to an easier way to construct a layered Neat network or even a Tree. Something like sending some sort of model building dto to the server then constructing the starting network accordingly. Right now the initial model construction has to be hard-coded which could be improved.

The pain points I've had with the Problem mostly revolve around it being so tightly coupled with the Genome and Population. It honestly isn't that bad, but has been a point of annoyance for me even though I haven't gotten around to fixing it. As a rough example, I currently use the API to solve a problem where the struct implementing the Problem trait holds quite a bit of other information/structs. It requires a lot of code I'd rather not be there mostly because of the need of the empty() function in the Problem trait. Although it sounds like going forward the Problem will be separate enough from the rest of the API it wouldn't be an issue anymore.

Robert Gabriel Jakabosky · Answer 7 · Tue Jul 14 2020 16:35:40 GMT+0800 (China Standard Time)

That empty() method could be removed from the Problem trait by requiring all problems to impl Default

One way would be use Default as a supertrait of Problem (haven't tried it, but it should work):
trait Problem: Default {.....

My current use-case for radiate is for a simple game AI. Right now the AI agents are only being trained to explore a 2d map and don't interact with each other. But in the future I want to train them in a shared "game instance" which is nother reason I needed more control over the run() loop, to allow them to interact at each time-step in the game simulation.

I am porting the game AI from this book AI Techniques for Game Programming, which has a NEAT implementation in C++.

We could move all of the Problem code out of the core engine code (Population/Generation) into a different module (maybe radiate::sim).

pkalivas · Answer 8 · Wed Jul 15 2020 02:28:17 GMT+0800 (China Standard Time)

I think requiring the Default trait would clear most of those troubles up, should have thought of that!

Radiate should fit that problem pretty well and this will definitely speed it up. If your AI needs any sort of memory, depending on the number of worker nodes set up, training LSTM/GRU layers could be viable in real time then.

Robert Gabriel Jakabosky · Answer 9 · Wed Jul 15 2020 04:28:25 GMT+0800 (China Standard Time)

Just got worker support working in the neat-web example. rocket or reqwest often has some connection issues (connection reset, or incomplete message errors). So I had to add some retry logic. Those connection issues cause lost work units, but the server will allow another work to retry old work units after an expire timeout.

TODO:

Work batching. Allow workers to request more then one work unit. Might only be useful for the simple XOR like problems.
Better error handling and/or re-try logic in neat-worker and neat-client.
Allow neat-client to be restarted later with a simulation uuid for long running simulations.
Allow workers to upload work results and get a new work unit with just one HTTP request.
Try to refactor the "Work" structs. Right now there are too many structs with similar names.
Remove some dead code from neat-server
More flexible work unit expire timeout, or make it dynamic based on how long other work units take to finish.

Once I have cleaned it up I will submit a PR. The work doesn't break the existing API, so it would be nice to get it included before doing the other work we talked about.

pkalivas · Answer 10 · Wed Jul 15 2020 09:40:28 GMT+0800 (China Standard Time)

Great, looks cool! I'll merge the PR whenever I see it.

I've actually ran into an issue much like that - timeouts with the client taking too long to train and causing the request to time out. It is especially relevant while training Neat networks with LSTM/GRU layers evolving. The way I got around it was with something I call a 'training node', using redis-async pubsub. I've been using a slightly modified Tree for an application, where I want multiple Trees for prediction, think Random Forest. My server publishes a struct holding all the training data/population settings and the nodes pick it up and enter a loop going until a request to the server says to stop. After each loop iteration they evolve a population and post their final evolved models back to the server. Once the server has the amount of Tree's it needs for a forest, it returns false and the nodes break out of their loop and wait for the next publish. That way I don't really care about requests timing out, also for large training sets I don't have to worry about too much data being pushed around.

Not sure if this would be overkill for something like this, not to mention the addition of holding a redis db somewhere, but could be a possible solution/something to think about.

Robert Gabriel Jakabosky · Answer 11 · Sun Jul 19 2020 02:13:30 GMT+0800 (China Standard Time)

PR #4 submitted. Also I fixed the radiate_matrix_evtree crate to make it work with the latest Arc changes to radiate.

pkalivas · Answer 12 · Mon Jul 20 2020 18:49:48 GMT+0800 (China Standard Time)

Great, thanks. Just merged your pull request and updated the crates on crates.io to their latest versions.