BYU-PRISM / GEKKO

GEKKO Python for Machine Learning and Dynamic Optimization

Home Page:https://machinelearning.byu.edu

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Default to remote = False

abe-mart opened this issue · comments

The more I've seen, the more I've realized that it's very unusual for a computational Python package to make calls to a remote server by default, especially calls that share the data being processed. This is especially relevant with the rising concern around malware in the Python ecosystem. I would propose that it would make more sense (and be safer) to default to remote = False, and leave remote = True as an optional feature for users who desire it. This would have the additional benefit of relieving the load on the APM server and providing a better experience for users who need the additional solvers available there.

The downside is that this change would break backward compatibility, so it shouldn't be made lightly. Per semantic versioning, it would need to be a new major version number.

Thoughts? Is this a good idea?

I personally completely agree with this. I think it does erode confidence in our package when people realize it has been making network calls with their data if they did not realize that to begin with. I think there are even people that try benchmarking solutions before realizing that the solution is not even happening on their machine and is subject to unpredictable network delays. That said, there is the issue of which linear solvers are compiled with the APM executable.

Thanks for bringing up this great point. The decision to use a remote solve by default was because of the lack of platform support with compiled apm executables for each computing platform. Now there are Windows, MacOS, Linux, and Linux ARM (Raspberry Pi). @dhill2522 also brings up the point that the local solve is limited on some platforms. IPOPT is only available for local Windows, but not available for local MacOS, Linux, and ARM Linux. IPOPT is the default solver and it would no longer be the default solver for everything but Windows with the switch. The local IPOPT executable is too large to distribute at 50 MB (in my opinion) with those platforms with the gcc compilers. Gekko is currently about 10 MB and it would increase the size of the Gekko distribution to 160 MB, unless a runtime download were implemented.

One of the advantages with the remote solve is that the anonymized data and models can be used as benchmarks to improve the solvers (see APMonitor Terms and Conditions). Gekko sends the variables as v1, v2, ... and parameters as p1, p2, ... with no code comments. When I've had conversations with companies about this, they don't want to send data to the server and either switch to local solves or else set up a local server that they manage.

Another advantage of remote solve is that a local executable is not initiated on a person's local computer. Personally, I'd feel better about text information transmitted back and forth to a compute server versus a local executable initiating on my computer. The executable is written in Fortran so there isn't much risk, but it is still a potential for concern.

Overall, the remote server is probably better for those just trying Gekko with sample code on APMonitor or other websites. Anyone using it for business applications probably wants to switch to remote=False or set up a local server.

I'm open to switching the default to remote=False, but I just wanted to bring up the points about cross-platform accessibility of the default solver and the potential improvement to the solvers with a more comprehensive benchmark set.

I had not realized that packaging with IPOPT would be such a drastic increase in distribution size. Is that size even with completely pre-compiled wheels?

Maybe it would be possible to have a GEKKO-IPOPT package that added IPOPT support for those that require it and have the required disk capacity? This could be done as a replacement for or add-on to the existing GEKKO package.

Another option to consider could be to distribute a GEKKO docker/container image for the normally remote server that contained everything nicely compiled and packaged. GEKKO could then easily be set to search for and use a "remote" sever on the local machine. This could provide remote server capabilities with minimal installation difficulty and minimal cross-platform wrangling. If a two-step docker build is used, the resulting container may well be almost as small as the compiled binaries.

I agree that there are a lot of benefits to the remote=True solve, and it should be kept around, but my opinion is that it should be an opt-in feature, not an opt-out. My main concern is that it is unexpected behavior. I would venture that many new users don't realize that their models and data are being transmitted offsite by default, let alone that they are being harvested for benchmarking! If I download scipy and call fsolve for example, I would expect the computation to be done locally, and I would be taken aback if instead my problem was sent off somewhere for processing without my explicit instruction. The current operation is a little sneaky, and people as a rule don't like sneaky software...

I think that downloading compiled binaries as part of a package is pretty well-established practice in scientific Python, though most packages allow you to build from source if you're paranoid.

I think that many of the issues with remote=False as the default could be addressed with some good error messages. For example "The default IPOPT solver is not supported on your platform. Please use the APOPT solver (options.SOLVER=1), or use the GEKKO(remote=True) option to solve on the APMonitor server (see terms and conditions)." That way people know what they are getting into, but the option is still plainly accessible for those who need it.

I just can't keep track of how many times I've said things over the years like "Go download this cool optimization package, but make sure you set this specific option in your code every single time you use it or you'll violate our organization's data policies!"

I agree with greater transparency and default options that are business friendly. I'm trying to think about the overall experience of most users. Most of the 5k-10k downloads/mo are likely someone running an example problem such as the Gekko examples, Optimization Introductions, or many of the other example problems that are online. There is also a large group of students using it for a class projects or research projects. Switching to GEKKO(remote=False) would change the behavior of many of the example problems, unless they were switched to remote=True to retain the expected behavior. I can change the examples on the APMonitor site, but there are many other GitHub repos that would require pull-requests. The IPOPT solver is the default because it generally works better over a wider class of problems. Before switching to remote=False by default, the IPOPT local option needs to be resolved for Linux and MacOS. Switching to remote=False now would be too much of a breaking change for most of the users, especially those who are just "kicking the tires".

The public server gets so many jobs that I had to set up a cron job to delete all jobs each day, otherwise it was filling up the storage after a few days. There is a large potential to improve solvers, optimize hyper-parameters, or otherwise help users navigate errors in models with the anonymized models. Nothing has been done with that yet, but I have been approached by those interested in training LLMs for optimization. I could update the terms to mention that jobs are deleted daily from the server.

Here's a potential idea to resolve the competing interests: how about a gekko wrapper or fork such as gekko-local that is a much larger distribution with the IPOPT binaries and remote=False default or even remote=True removed? The gekko package could then be a small (50 kB) distribution that doesn't have local solvers or just keep APOPT and BPOPT for a 5 MB distribution. In the immediate term, a better message such as "Solving on Remote Server, switch to GEKKO(remote=False) to solve locally" is needed.

Added to gekko.py

            #solve remotely
            if disp:
                print("Solving on APM Server, switch to GEKKO(remote=False) to solve locally")
            response = cmd(self._server, self._model_name, 'solve', disp, debug)

A gekko-local package may be an acceptable compromise if the difference is made clear up front.

Switching to remote=False would be a breaking change, and would probably need to wait until the next major version if it was done as it would require some time investment. I'd guess that the majority of student users run Windows and wouldn't be impacted, though the Colab examples would need to be updated, and external repos might need to be pinned to a previous version or updated.

The Terms and Conditions may need a refresh, as reading through them I don't see anything about retaining models and data for benchmarking except possibly Section 5:

APS welcomes any feedback, suggestions, modifications, data, mathematical models, or any reports on bugs or errors (collectively "Feedback") regarding the Site. Therefore, you agree that any and all such Feedback you provide to APS shall be of a non-confidential nature. Further, you grant APS and its affiliates an irrevocable license under all necessary rights (including, but not limited to, all intellectual property rights) to use, modify, copy, distribute, sublicense, display, perform and prepare derivative works of such Feedback for any purpose without payment or accounting to you of any kind, in perpetuity.

Referring to any problem solved with Gekko as "Feedback" seems a stretch, and if so the rights claimed are very broad. Collecting user-created data for AI training, especially without opt-in consent is a touchy subject these days. Not trying to be antagonistic here, just to head off any trouble before it starts.

I also have a hard time suggesting gekko for commercial use because of the remote solves.
We should be able to create platform-specific wheels to only provide the necessary executables. Would that keep each download to <50mb? That seems like a reasonable size.

That does seem like a good option. 50 mb is pretty insignificant for the large majority of systems these days.

I totally agree that binaries up to 50mb should be fine for most cases. Even a lot of the ARM SBCs could likely handle this fine if they are already running Python.

How difficult would it be to compile wheels like Logan is suggesting?

Compiling for additional platforms isn't hard, but the mixed F90 / C++ code base doesn't work with the latest gcc compilers. I have to use Intel Fortran / Intel C++ compilers for IPOPT and those have been harder to deploy on platforms where Intel is not supported. A pure F90 gcc deploy isn't hard, but the mixed language compile has been challenging with IPOPT and the many supporting libraries that it requires.