# PySR: High-Performance Symbolic Regression in Python

PySR is built on an extremely optimized pure-Julia backend, and uses regularized evolution, simulated annealing, and gradient-free optimization to search for equations that fit your data.

Docs |
pip |
conda |
Stats |
---|---|---|---|

(pronounced like *py* as in python, and then *sur* as in surface)

If you find PySR useful, please cite it using the citation information given in CITATION.md. If you've finished a project with PySR, please submit a PR to showcase your work on the Research Showcase page!

### Test status:

Linux |
Windows |
macOS (intel) |
---|---|---|

Docker |
Conda |
Coverage |

Check out SymbolicRegression.jl for the pure-Julia backend of this package.

Symbolic regression is a very interpretable machine learning algorithm for low-dimensional problems: these tools search equation space to find algebraic relations that approximate a dataset.

One can also extend these approaches to higher-dimensional spaces by using a neural network as proxy, as explained in 2006.11287, where we apply it to N-body problems. Here, one essentially uses symbolic regression to convert a neural net to an analytic equation. Thus, these tools simultaneously present an explicit and powerful way to interpret deep models.

*Backstory:*

Previously, we have used eureqa, which is a very efficient and user-friendly tool. However, eureqa is GUI-only, doesn't allow for user-defined operators, has no distributed capabilities, and has become proprietary (and recently been merged into an online service). Thus, the goal of this package is to have an open-source symbolic regression tool as efficient as eureqa, while also exposing a configurable python interface.

# Installation

pip | conda |
---|---|

1. Install Julia manually (see downloads) 2. `pip install pysr` 3. `python -c 'import pysr; pysr.install()'` |
1. `conda install -c conda-forge pysr` 2. `python -c 'import pysr; pysr.install()'` |

This last step will install and update the required Julia packages, including
`PyCall.jl`

.

Common issues tend to be related to Python not finding Julia.
To debug this, try running `python -c 'import os; print(os.environ["PATH"])'`

.
If none of these folders contain your Julia binary, then you need to add Julia's `bin`

folder to your `PATH`

environment variable.

# Introduction

Let's create a PySR example. First, let's import numpy to generate some test data:

```
import numpy as np
X = 2 * np.random.randn(100, 5)
y = 2.5382 * np.cos(X[:, 3]) + X[:, 0] ** 2 - 0.5
```

We have created a dataset with 100 datapoints, with 5 features each. The relation we wish to model is $2.5382 \cos(x_3) + x_0^2 - 0.5$.

Now, let's create a PySR model and train it. PySR's main interface is in the style of scikit-learn:

```
from pysr import PySRRegressor
model = PySRRegressor(
niterations=40,
binary_operators=["+", "*"],
unary_operators=[
"cos",
"exp",
"sin",
"inv(x) = 1/x", # Custom operator (julia syntax)
],
model_selection="best",
loss="loss(x, y) = (x - y)^2", # Custom loss function (julia syntax)
)
```

This will set up the model for 40 iterations of the search code, which contains hundreds of thousands of mutations and equation evaluations.

Let's train this model on our dataset:

`model.fit(X, y)`

Internally, this launches a Julia process which will do a multithreaded search for equations to fit the dataset.

Equations will be printed during training, and once you are satisfied, you may quit early by hitting 'q' and then <enter>.

After the model has been fit, you can run `model.predict(X)`

to see the predictions on a given dataset.

You may run:

`print(model)`

to print the learned equations:

```
PySRRegressor.equations = [
pick score equation loss complexity
0 0.000000 4.4324794 42.354317 1
1 1.255691 (x0 * x0) 3.437307 3
2 0.011629 ((x0 * x0) + -0.28087974) 3.358285 5
3 0.897855 ((x0 * x0) + cos(x3)) 1.368308 6
4 0.857018 ((x0 * x0) + (cos(x3) * 2.4566472)) 0.246483 8
5 >>>> inf (((cos(x3) + -0.19699033) * 2.5382123) + (x0 *... 0.000000 10
]
```

This arrow in the `pick`

column indicates which equation is currently selected by your
`model_selection`

strategy for prediction.
(You may change `model_selection`

after `.fit(X, y)`

as well.)

`model.equations`

is a pandas DataFrame containing all equations, including callable format
(`lambda_format`

),
SymPy format (`sympy_format`

- which you can also get with `model.sympy()`

), and even JAX and PyTorch format
(both of which are differentiable - which you can get with `model.jax()`

and `model.pytorch()`

).

Note that `PySRRegressor`

stores the state of the last search, and will restart from where you left off the next time you call `.fit()`

. This will cause problems if significant changes are made to the search parameters (like changing the operators). You can run `model.reset()`

to reset the state.

There are several other useful features such as denoising (e.g., `denoising=True`

),
feature selection (e.g., `select_k_features=3`

).
For examples of these and other features, see the examples page.
For a detailed look at more options, see the options page.
You can also see the full API at this page.

# Docker

You can also test out PySR in Docker, without installing it locally, by running the following command in the root directory of this repo:

`docker build --pull --rm -f "Dockerfile" -t pysr "."`

This builds an image called `pysr`

. If you have issues building (for example, on Apple Silicon),
you can emulate an architecture that works by including: `--platform linux/amd64`

.
You can then run this with:

`docker run -it --rm -v "$PWD:/data" pysr ipython`

which will link the current directory to the container's `/data`

directory
and then launch ipython.