Performance speed-up options?

Question

Performance speed-up options?

yxie20 opened this issue 3 years ago · comments

Hello Miles! Thank you for open-sourcing this powerful tool! I am working on including PySR in my own research, and running into some performance bottlenecks.

I found regressing a simple equation (e.g. the quick-start example) takes roughly 2 minutes. Ideally, I am aiming to reduce that time to ~30 seconds. Would you give me some pointers on this? Meanwhile, I will try break down the challenge in several pieces:

Activating a new environment at each API call: I noticed that a new Julia (?) environment is created each time I call pysr() api (see terminal output below). Could we keep the environment up so we can skip this process for subsequent calls?

Running on julia -O3 /tmp/tmpe5qmgemh/runfile.jl
  Activating environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
    Updating registry at `~/.julia/registries/General`
  No Changes to `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
  No Changes to `~/anaconda3/envs/rw/lib/python3.7/site-packages/Manifest.toml`
Activating environment on workers.
  Activating environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
  Activating environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
  Activating  Activating  environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
  Activating  Activating environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml` 
environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
Importing installed module on workers...Finished!
Started!

If the above wouldn't work, then allowing y to be vector-valued (as mentioned in #35) would be a second-best option! Even better, if we could create a "batched" version of pysr(X, y) api pysr_batched(X, y), such that X and y are python lists, and we return the results in a list as well, so that we only generate one Julia script, and call os.system() once to keep the Julia environment up.
Multi-threading: I noticed that increasing procs from 4 to 8 resulted in slightly longer running time. I am running on a 8-core 16-tread CPU. Did I do something dumb?
I went into pysr/sr.py and added runtests=false flag in line 438 and 440. That saved ~20 seconds.

Miles Cranmer · Answer 1 · Thu May 27 2021 00:55:49 GMT+0800 (China Standard Time)

Hi @yxie20,

Thanks for trying out PySR! Your suggestions are very good - I think having some batched call so that processes don't need to start for each dimension would be really nice.

The workers aren't creating an environment, but rather, activating a Julia environment. This is done within each process. This is expensive at startup, but it should be negligible for long running jobs. This is required for multi-node computation since they are entirely separate processes. For single node, it would be nice to have multiple threads instead, but I think having a single interface makes things easier to maintain. For very short jobs, you can do procs=0 which turns off multiprocessing but avoids this expensive startup. That may be a good solution to multi-dimensional output in the short term, actually?
Good suggestions; would be nice to add, and I'd be very interested in having this too! Will take a few code changes in the backend but I think there's a smart way to do it that would incur very few structural changes.
This is probably just because of startup time. More procs means more work for the head node. In the limit of large runtimes, more procs will be better (assuming you have populations>procs), but for very short runtimes indeed it will hurt. You can turn off multiprocessing with procs=0. On a 16 thread CPU you could do 16 procs, and have populations=2*16 so each thread is always occupied. Also, if you have many procs, you might want to increase ncyclesperiteration so that the processes take longer between sending results back to the head node; that way it doesn't get saturated.
runtests=True runs some tests on the backend before execution. This includes things like testing user operators for bad definitions (e.g., sqrt instead of sqrt_abs - although I automatically swap these now), whether user-defined operators are successfully copied to processes on other nodes, and also testing whether the whole pipeline works. I think it's good to have as a default to flag issues that are difficult to debug once the pipeline is actually running, but indeed if you know your setup already works and want speed, you can turn it off.

Hopefully this helps!
Cheers,
Miles

Miles Cranmer · Answer 2 · Thu May 27 2021 04:17:41 GMT+0800 (China Standard Time)

FYI I just added multi-output capabilities to the backend! On the multi-output branch in SymbolicRegression.jl. Will work its way into PySR soon enough.

Cheers,
Miles

Yiheng Xie · Answer 3 · Fri May 28 2021 23:18:32 GMT+0800 (China Standard Time)

Thank you Miles! I'm excited to give it a try! Now a basic question: How can I update the Julia backend so that I PySR can use the multi-output branch?

Thanks again!

Miles Cranmer · Answer 4 · Sat May 29 2021 02:45:03 GMT+0800 (China Standard Time)

It will be in v0.6.0 of PySR. Not ready yet; I'll write when it is.

Cheers,
Miles

Miles Cranmer · Answer 5 · Sun May 30 2021 14:17:52 GMT+0800 (China Standard Time)

Release candidate is up:

pip install --upgrade pysr==0.6.0rc1

It will allow for a matrix of y.

Let me know how this works!
Cheers,
Miles

Yiheng Xie · Answer 6 · Sun Jun 06 2021 04:39:59 GMT+0800 (China Standard Time)

Looks like we got an error:

Importing installed module on workers...Finished!
Testing module on workers...Finished!
Testing entire pipeline on workers...Finished!
Started!
ERROR: LoadError: SystemError: opening file "out2_/tmp/tmpjk5ery5w/hall_of_fame.csv": No such file or directory
Stacktrace:
  [1] systemerror(p::String, errno::Int32; extrainfo::Nothing)
    @ Base ./error.jl:168
  [2] #systemerror#62
    @ ./error.jl:167 [inlined]
  [3] systemerror
    @ ./error.jl:167 [inlined]
  [4] open(fname::String; lock::Bool, read::Nothing, write::Nothing, create::Nothing, truncate::Bool, append::Nothing)
    @ Base ./iostream.jl:293
  [5] open(fname::String, mode::String; lock::Bool)
    @ Base ./iostream.jl:355
  [6] open(fname::String, mode::String)
    @ Base ./iostream.jl:355
  [7] open(::SymbolicRegression.var"#47#73"{Options{Tuple{typeof(+), typeof(*)}, Tuple{typeof(cos), typeof(exp), typeof(sin)}, L2DistLoss}, Vector{PopMember}, SymbolicRegression.../Dataset.jl.Dataset{Float32}}, ::String, ::Vararg{String, N} where N; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Base ./io.jl:328
  [8] open
    @ ./io.jl:328 [inlined]
  [9] EquationSearch(datasets::Vector{SymbolicRegression.../Dataset.jl.Dataset{Float32}}; niterations::Int64, options::Options{Tuple{typeof(+), typeof(*)}, Tuple{typeof(cos), typeof(exp), typeof(sin)}, L2DistLoss}, numprocs::Int64, procs::Nothing, runtests::Bool)
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/8HpEO/src/SymbolicRegression.jl:398
 [10] EquationSearch(X::Matrix{Float32}, y::Matrix{Float32}; niterations::Int64, weights::Nothing, varMap::Vector{String}, options::Options{Tuple{typeof(+), typeof(*)}, Tuple{typeof(cos), typeof(exp), typeof(sin)}, L2DistLoss}, numprocs::Int64, procs::Nothing, runtests::Bool)
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/8HpEO/src/SymbolicRegression.jl:144
 [11] top-level scope
    @ /tmp/tmpjk5ery5w/runfile.jl:7
in expression starting at /tmp/tmpjk5ery5w/runfile.jl:7
Traceback (most recent call last):
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/sr.py", line 759, in get_hof
    all_outputs = [pd.read_csv(f'out{i}_' + str(equation_file) + '.bkup', sep="|") for i in range(1, nout+1)]
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/sr.py", line 759, in <listcomp>
    all_outputs = [pd.read_csv(f'out{i}_' + str(equation_file) + '.bkup', sep="|") for i in range(1, nout+1)]
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pandas/io/parsers.py", line 462, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pandas/io/parsers.py", line 819, in __init__
    self._engine = self._make_engine(self.engine)
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pandas/io/parsers.py", line 1867, in __init__
    self._open_handles(src, kwds)
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pandas/io/parsers.py", line 1368, in _open_handles
    storage_options=kwds.get("storage_options", None),
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pandas/io/common.py", line 647, in get_handle
    newline="",
FileNotFoundError: [Errno 2] No such file or directory: 'out1_/tmp/tmpjk5ery5w/hall_of_fame.csv.bkup'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "batch_test.py", line 20, in <module>
    temp_equation_file=True,
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/sr.py", line 365, in pysr
    equations = get_hof(**kwargs)
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/sr.py", line 763, in get_hof
    raise RuntimeError("Couldn't find equation file! The equation search likely exited before a single iteration completed.")
RuntimeError: Couldn't find equation file! The equation search likely exited before a single iteration completed.

Miles Cranmer · Answer 7 · Sun Jun 06 2021 13:03:35 GMT+0800 (China Standard Time)

Sorry, this looks like a bug. I missed the behavior where both temp_equation_file=True and multioutput=True. Will fix now.

Miles Cranmer · Answer 8 · Sun Jun 06 2021 14:56:27 GMT+0800 (China Standard Time)

Fixed in 0.6.3+

Yiheng Xie · Answer 9 · Mon Jun 07 2021 20:42:50 GMT+0800 (China Standard Time)

Thank you Miles! It seems like 0.6.3 is even slower than before.

Did you change any default argument values?
How come even if I set niterations to 5, pysr still runs 200 iterations? (I think this is the main reason why 0.6.3 runs much slower.
Is there an early-stop option possible, perhaps based on MSE score to further speed-up the performance?
I am also getting weird printout like this during the first few iterations:

==============================

Cycles per second: 0.000e+00
Head worker occupation: 0.0%
Progress: 0 / 200 total iterations (0.000%)
==============================
Best equations for output 1
Hall of Fame:
-----------------------------------------
Complexity  Loss       Score     Equation

==============================

Miles Cranmer · Answer 10 · Mon Jun 07 2021 23:17:55 GMT+0800 (China Standard Time)

Yes, the default arguments now do more iterations and have a larger number of populations, resulting in the default run taking longer. annealing has also been set to false, which makes simple equations take longer, but more complex equations achievable. For simple equations you could probably set annealing=True again.
niterations arguments is iterations per population, so the progress bar shows populations*niterations.
No, but you can set the timeout argument or hit <ctrl-c>. Might be nice to have a user-set condition for early stopping though, that's a good point.
What do you mean by "first few iterations"? Do you mean it has 0 equations but several iterations have passed?

Miles Cranmer · Answer 11 · Mon Jun 07 2021 23:20:03 GMT+0800 (China Standard Time)

By the way, if you want smaller startup time, you could set julia_optimization=0. That will turn off the optimizing compiler for the Julia code, which should let it start faster.

Yiheng Xie · Answer 12 · Tue Jun 08 2021 02:58:30 GMT+0800 (China Standard Time)

Thank you for 1) and 2)! Looking forward to 3)!

For 4), exactly as you said. I get a few of those empty (zero equations) for several iterations. It always says:

Cycles per second: 0.000e+00
Head worker occupation: 0.0%
Progress: 0 / <however much> total iterations (0.000%)

then no equations listed. After about 2 minutes of hanging, the normal printout appears. The hanging is much longer when populations is set to a large number.

The results are fine! Maybe it's nothing to worry about!

Miles Cranmer · Answer 13 · Wed Jun 09 2021 01:55:07 GMT+0800 (China Standard Time)

It doesn't say "Progress: 1 / ...", right? It's stuck at "Progress: 0"? This is expected behaviour, although maybe I should wait for some equations before starting the printing.

By the way - on PySR 0.6.5, which will be up later today - I added a patch which boosts performance by nearly 2x. It turns out the optimization library I was using (main bottleneck) did not require a differentiable function, so I implemented a faster non-differentiable version.

Miles Cranmer · Answer 14 · Wed Jun 09 2021 11:43:30 GMT+0800 (China Standard Time)

One other idea. The backend of PySR is in Julia, and Julia has a bit of a slow startup time, hence the slow startup of PySR.

There's a way to avoid the startup time, by using this package - https://github.com/dmolina/DaemonMode.jl. It would probably let you execute PySR runs in quick succession. The idea would be to startup a Julia daemon when first running PySR, pre-compile SymbolicRegression in that daemon, then execute each new script within that daemon. Thus, you wouldn't need to restart Julia everytime you call PySR.

Edit: just tried it; it doesn't really help.

Miles Cranmer · Answer 15 · Sun Jun 20 2021 12:04:42 GMT+0800 (China Standard Time)

More ideas, which would probably help quite a bit:

Use Threads instead of Distributed. That would cut down on startup time quite a bit, since you are only using a single Julia process instead of one for each procs.
Get PySR to start Julia with workers julia -p {procs}, rather than create them dynamically and copy-in user definitions.
Make EquationSearch specialize to type of parallelism, rather than have it as a variable.

Yiheng Xie · Answer 16 · Sat Jul 03 2021 23:11:40 GMT+0800 (China Standard Time)

Following up: If early-stop (based on MSE) can be implemented, that would be super helpful in speeding up pysr on my end, where I have PySR running inside a large for loop. Do you think this is possible? Thank you!

Miles Cranmer · Answer 17 · Mon Jul 05 2021 11:48:04 GMT+0800 (China Standard Time)

Sure; what sort of things would you want to trigger the stop? An absolute error reached, or relative error, or something like no error improvement for N iterations?

Yiheng Xie · Answer 18 · Mon Jul 05 2021 21:41:03 GMT+0800 (China Standard Time)

I think both 2) relative error and 3) convergence makes sense! I can work with 1) absolute error as well. For my purposes, having both 2) and 3) or 1) and 3) will be sufficient.

Thank you, Miles!

Yiheng Xie · Answer 19 · Wed Jul 07 2021 08:07:04 GMT+0800 (China Standard Time)

Just a note on multiple-output (output y being dimensional) with early exit:
Right now it seems like we compute each output dimension sequentially. To make early exit work correctly, we should probably early-exit on each dimension, so we don't exit when first few dimensions have finished while the rest hasn't started.

Thanks again!

Miles Cranmer · Answer 20 · Tue Jul 13 2021 05:28:02 GMT+0800 (China Standard Time)

FYI multithreading is now an option in PySR v0.6.11. That should help startup time.

Right now it seems like we compute each output dimension sequentially.

Actually, each output is computed at the same time asynchronously. One particular batch of computation may finish earlier than another, which might make it seem that it is done sequentially.

we should probably early-exit on each dimension, so we don't exit when first few dimensions have finished while the rest hasn't started.

This is a really good idea to do early stopping on each output separately. That would free up more cores for the remaining outputs.