Performance speed-up options?
yxie20 opened this issue · comments
Hello Miles! Thank you for open-sourcing this powerful tool! I am working on including PySR in my own research, and running into some performance bottlenecks.
I found regressing a simple equation (e.g. the quick-start example) takes roughly 2 minutes. Ideally, I am aiming to reduce that time to ~30 seconds. Would you give me some pointers on this? Meanwhile, I will try break down the challenge in several pieces:
- Activating a new environment at each API call: I noticed that a new Julia (?) environment is created each time I call pysr() api (see terminal output below). Could we keep the environment up so we can skip this process for subsequent calls?
Running on julia -O3 /tmp/tmpe5qmgemh/runfile.jl
Activating environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
Updating registry at `~/.julia/registries/General`
No Changes to `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
No Changes to `~/anaconda3/envs/rw/lib/python3.7/site-packages/Manifest.toml`
Activating environment on workers.
Activating environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
Activating environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
Activating Activating environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
Activating Activating environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
Importing installed module on workers...Finished!
Started!
-
If the above wouldn't work, then allowing y to be vector-valued (as mentioned in #35) would be a second-best option! Even better, if we could create a "batched" version of
pysr(X, y)
apipysr_batched(X, y)
, such thatX
andy
are python lists, and we return the results in a list as well, so that we only generate one Julia script, and callos.system()
once to keep the Julia environment up. -
Multi-threading: I noticed that increasing
procs
from 4 to 8 resulted in slightly longer running time. I am running on a 8-core 16-tread CPU. Did I do something dumb? -
I went into
pysr/sr.py
and addedruntests=false
flag in line 438 and 440. That saved ~20 seconds.
Hi @yxie20,
Thanks for trying out PySR! Your suggestions are very good - I think having some batched call so that processes don't need to start for each dimension would be really nice.
- The workers aren't creating an environment, but rather, activating a Julia environment. This is done within each process. This is expensive at startup, but it should be negligible for long running jobs. This is required for multi-node computation since they are entirely separate processes. For single node, it would be nice to have multiple threads instead, but I think having a single interface makes things easier to maintain. For very short jobs, you can do
procs=0
which turns off multiprocessing but avoids this expensive startup. That may be a good solution to multi-dimensional output in the short term, actually? - Good suggestions; would be nice to add, and I'd be very interested in having this too! Will take a few code changes in the backend but I think there's a smart way to do it that would incur very few structural changes.
- This is probably just because of startup time. More procs means more work for the head node. In the limit of large runtimes, more procs will be better (assuming you have populations>procs), but for very short runtimes indeed it will hurt. You can turn off multiprocessing with procs=0. On a 16 thread CPU you could do 16 procs, and have populations=2*16 so each thread is always occupied. Also, if you have many procs, you might want to increase ncyclesperiteration so that the processes take longer between sending results back to the head node; that way it doesn't get saturated.
runtests=True
runs some tests on the backend before execution. This includes things like testing user operators for bad definitions (e.g.,sqrt
instead ofsqrt_abs
- although I automatically swap these now), whether user-defined operators are successfully copied to processes on other nodes, and also testing whether the whole pipeline works. I think it's good to have as a default to flag issues that are difficult to debug once the pipeline is actually running, but indeed if you know your setup already works and want speed, you can turn it off.
Hopefully this helps!
Cheers,
Miles
FYI I just added multi-output capabilities to the backend! On the multi-output
branch in SymbolicRegression.jl. Will work its way into PySR soon enough.
Cheers,
Miles
Thank you Miles! I'm excited to give it a try! Now a basic question: How can I update the Julia backend so that I PySR can use the multi-output
branch?
Thanks again!
It will be in v0.6.0 of PySR. Not ready yet; I'll write when it is.
Cheers,
Miles
Release candidate is up:
pip install --upgrade pysr==0.6.0rc1
It will allow for a matrix of y.
Let me know how this works!
Cheers,
Miles
Looks like we got an error:
Importing installed module on workers...Finished!
Testing module on workers...Finished!
Testing entire pipeline on workers...Finished!
Started!
ERROR: LoadError: SystemError: opening file "out2_/tmp/tmpjk5ery5w/hall_of_fame.csv": No such file or directory
Stacktrace:
[1] systemerror(p::String, errno::Int32; extrainfo::Nothing)
@ Base ./error.jl:168
[2] #systemerror#62
@ ./error.jl:167 [inlined]
[3] systemerror
@ ./error.jl:167 [inlined]
[4] open(fname::String; lock::Bool, read::Nothing, write::Nothing, create::Nothing, truncate::Bool, append::Nothing)
@ Base ./iostream.jl:293
[5] open(fname::String, mode::String; lock::Bool)
@ Base ./iostream.jl:355
[6] open(fname::String, mode::String)
@ Base ./iostream.jl:355
[7] open(::SymbolicRegression.var"#47#73"{Options{Tuple{typeof(+), typeof(*)}, Tuple{typeof(cos), typeof(exp), typeof(sin)}, L2DistLoss}, Vector{PopMember}, SymbolicRegression.../Dataset.jl.Dataset{Float32}}, ::String, ::Vararg{String, N} where N; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Base ./io.jl:328
[8] open
@ ./io.jl:328 [inlined]
[9] EquationSearch(datasets::Vector{SymbolicRegression.../Dataset.jl.Dataset{Float32}}; niterations::Int64, options::Options{Tuple{typeof(+), typeof(*)}, Tuple{typeof(cos), typeof(exp), typeof(sin)}, L2DistLoss}, numprocs::Int64, procs::Nothing, runtests::Bool)
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/8HpEO/src/SymbolicRegression.jl:398
[10] EquationSearch(X::Matrix{Float32}, y::Matrix{Float32}; niterations::Int64, weights::Nothing, varMap::Vector{String}, options::Options{Tuple{typeof(+), typeof(*)}, Tuple{typeof(cos), typeof(exp), typeof(sin)}, L2DistLoss}, numprocs::Int64, procs::Nothing, runtests::Bool)
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/8HpEO/src/SymbolicRegression.jl:144
[11] top-level scope
@ /tmp/tmpjk5ery5w/runfile.jl:7
in expression starting at /tmp/tmpjk5ery5w/runfile.jl:7
Traceback (most recent call last):
File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/sr.py", line 759, in get_hof
all_outputs = [pd.read_csv(f'out{i}_' + str(equation_file) + '.bkup', sep="|") for i in range(1, nout+1)]
File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/sr.py", line 759, in <listcomp>
all_outputs = [pd.read_csv(f'out{i}_' + str(equation_file) + '.bkup', sep="|") for i in range(1, nout+1)]
File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pandas/io/parsers.py", line 610, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pandas/io/parsers.py", line 462, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pandas/io/parsers.py", line 819, in __init__
self._engine = self._make_engine(self.engine)
File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pandas/io/parsers.py", line 1867, in __init__
self._open_handles(src, kwds)
File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pandas/io/parsers.py", line 1368, in _open_handles
storage_options=kwds.get("storage_options", None),
File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pandas/io/common.py", line 647, in get_handle
newline="",
FileNotFoundError: [Errno 2] No such file or directory: 'out1_/tmp/tmpjk5ery5w/hall_of_fame.csv.bkup'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "batch_test.py", line 20, in <module>
temp_equation_file=True,
File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/sr.py", line 365, in pysr
equations = get_hof(**kwargs)
File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/sr.py", line 763, in get_hof
raise RuntimeError("Couldn't find equation file! The equation search likely exited before a single iteration completed.")
RuntimeError: Couldn't find equation file! The equation search likely exited before a single iteration completed.
Sorry, this looks like a bug. I missed the behavior where both temp_equation_file=True
and multioutput=True
. Will fix now.
Fixed in 0.6.3+
Thank you Miles! It seems like 0.6.3 is even slower than before.
-
Did you change any default argument values?
-
How come even if I set niterations to 5, pysr still runs 200 iterations? (I think this is the main reason why 0.6.3 runs much slower.
-
Is there an early-stop option possible, perhaps based on MSE score to further speed-up the performance?
-
I am also getting weird printout like this during the first few iterations:
==============================
Cycles per second: 0.000e+00
Head worker occupation: 0.0%
Progress: 0 / 200 total iterations (0.000%)
==============================
Best equations for output 1
Hall of Fame:
-----------------------------------------
Complexity Loss Score Equation
==============================
- Yes, the default arguments now do more iterations and have a larger number of populations, resulting in the default run taking longer.
annealing
has also been set to false, which makes simple equations take longer, but more complex equations achievable. For simple equations you could probably setannealing=True
again. niterations
arguments is iterations per population, so the progress bar showspopulations*niterations
.- No, but you can set the
timeout
argument or hit<ctrl-c>
. Might be nice to have a user-set condition for early stopping though, that's a good point. - What do you mean by "first few iterations"? Do you mean it has 0 equations but several iterations have passed?
By the way, if you want smaller startup time, you could set julia_optimization=0
. That will turn off the optimizing compiler for the Julia code, which should let it start faster.
Thank you for 1) and 2)! Looking forward to 3)!
For 4), exactly as you said. I get a few of those empty (zero equations) for several iterations. It always says:
Cycles per second: 0.000e+00
Head worker occupation: 0.0%
Progress: 0 / <however much> total iterations (0.000%)
then no equations listed. After about 2 minutes of hanging, the normal printout appears. The hanging is much longer when populations is set to a large number.
The results are fine! Maybe it's nothing to worry about!
It doesn't say "Progress: 1 / ...", right? It's stuck at "Progress: 0"? This is expected behaviour, although maybe I should wait for some equations before starting the printing.
By the way - on PySR 0.6.5, which will be up later today - I added a patch which boosts performance by nearly 2x. It turns out the optimization library I was using (main bottleneck) did not require a differentiable function, so I implemented a faster non-differentiable version.
One other idea. The backend of PySR is in Julia, and Julia has a bit of a slow startup time, hence the slow startup of PySR.
There's a way to avoid the startup time, by using this package - https://github.com/dmolina/DaemonMode.jl. It would probably let you execute PySR runs in quick succession. The idea would be to startup a Julia daemon when first running PySR, pre-compile SymbolicRegression in that daemon, then execute each new script within that daemon. Thus, you wouldn't need to restart Julia everytime you call PySR.
Edit: just tried it; it doesn't really help.
More ideas, which would probably help quite a bit:
- Use
Threads
instead ofDistributed
. That would cut down on startup time quite a bit, since you are only using a single Julia process instead of one for eachprocs
. - Get PySR to start Julia with workers
julia -p {procs}
, rather than create them dynamically and copy-in user definitions. - Make EquationSearch specialize to type of parallelism, rather than have it as a variable.
Following up: If early-stop (based on MSE) can be implemented, that would be super helpful in speeding up pysr on my end, where I have PySR running inside a large for loop. Do you think this is possible? Thank you!
Sure; what sort of things would you want to trigger the stop? An absolute error reached, or relative error, or something like no error improvement for N iterations?
I think both 2) relative error and 3) convergence makes sense! I can work with 1) absolute error as well. For my purposes, having both 2) and 3) or 1) and 3) will be sufficient.
Thank you, Miles!
Just a note on multiple-output (output y being dimensional) with early exit:
Right now it seems like we compute each output dimension sequentially. To make early exit work correctly, we should probably early-exit on each dimension, so we don't exit when first few dimensions have finished while the rest hasn't started.
Thanks again!
FYI multithreading
is now an option in PySR v0.6.11. That should help startup time.
Right now it seems like we compute each output dimension sequentially.
Actually, each output is computed at the same time asynchronously. One particular batch of computation may finish earlier than another, which might make it seem that it is done sequentially.
we should probably early-exit on each dimension, so we don't exit when first few dimensions have finished while the rest hasn't started.
This is a really good idea to do early stopping on each output separately. That would free up more cores for the remaining outputs.