postgresml / postgresml

The GPU-powered AI application database. Get your app to market faster using the simplicity of SQL and the latest NLP, ML + LLM models.

Home Page:https://postgresml.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot use `pgml.activate_venv()` to set environment for parallel workers.

higuoxing opened this issue · comments

Currently, pgml provides a UDF called pgml.activate_venv()1. However, when a query requires parallel workers, the venv environment cannot be set for parallel workers. This is not very easy to reproduce but parallel queries are not rare in PostgreSQL. Probably we can remove this UDF since we've already had the GUC parameter 'pgml.venv'2 to control the venv path.

Steps to reproduce:

  1. The package xgboost exists in my venv environment.

  2. Remove the IMMUTABLE qualifier from pgml.validate_python_dependencies, so that this UDF can be execute on parallel workers multiple times.

    diff --git a/pgml-extension/src/api.rs b/pgml-extension/src/api.rs
    index ad952e48..440df23d 100644
    --- a/pgml-extension/src/api.rs
    +++ b/pgml-extension/src/api.rs
    @@ -27,7 +27,7 @@ pub fn activate_venv(venv: &str) -> bool {
     }
    
     #[cfg(feature = "python")]
    -#[pg_extern(immutable, parallel_safe)]
    +#[pg_extern(parallel_safe)]
     pub fn validate_python_dependencies() -> bool {
         unwrap_or_error!(crate::bindings::python::validate_dependencies())
     }
  3. Construct a query that involves parallel workers.

    CREATE TABLE t1(i int);
    INSERT INTO t1 VALUES(generate_series(1,500000));
    INSERT INTO t1 VALUES(generate_series(1,500000));
    INSERT INTO t1 VALUES(generate_series(1,500000));
    INSERT INTO t1 VALUES(generate_series(1,500000));
    
    pgml=# select pgml.activate_venv('/tmp/virtualenv');
    activate_venv
    ---------------
     t
    (1 row)
    
    pgml=# explain (analyze) select count(pgml.validate_python_dependencies()) 
    from t1;
    INFO:  Python version: 3.11.5 (main, Sep  2 2023, 14:16:33) [GCC 13.2.1 20230801]
    INFO:  Scikit-learn 1.3.0, XGBoost 2.0.1, LightGBM 4.1.0, NumPy 1.26.1
    INFO:  Python version: 3.11.5 (main, Sep  2 2023, 14:16:33) [GCC 13.2.1 20230801]
    ERROR:  The xgboost package is missing. Install it with `sudo pip3 install xgboost`
    ModuleNotFoundError: No module named 'xgboost'
    CONTEXT:  parallel worker
    

Footnotes

  1. https://github.com/postgresml/postgresml/blob/785815d47698551cfc59634e889b564b156e6a3e/pgml-extension/src/api.rs#L25

  2. https://github.com/postgresml/postgresml/blob/785815d47698551cfc59634e889b564b156e6a3e/pgml-extension/src/bindings/python/mod.rs#L12

Is this related to #1146? Can we kill 2 birds with one stone?

Is this related to #1146? Can we kill 2 birds with one stone?

Probably no? #1146 is adding support for storing cache in a custom location. This issue states that the pgml.activate_venv() isn't implemented correctly that when there're queries involve parallel workers the UDF pgml.activate_venv() cannot setup venv for parallel workers.

One possible solution is removing the pgml.activate_venv() UDF and tell users to use the pgml.venv GUC parameter to set up the virtual environment.

cc @liuxueyang

I think it would be great if we can also remove pgml.activate_venv() in favor of setting pgml.venv at boot, since that covers our known use cases for venvs, without any runtime performance impact or complexity.

@montanalow Do we need to refactor the pgml.venv in #1146 or create another separate PR?

I think #1146 will remove the need to call pgml.activate_venv() all together. We'll leave that function as is for now, since dropping it would be a breaking API change, we'll need to wait for a major version bump.

pgml.activate_venv

Sorry, I didn't get it. activate_venv is for activating the Python's virtualenv. #1146 is for caching models from huggingface, right? Why #1146 removes the need to call pgml.activate_venv()?

@higuoxing I guess what he said is that we can keep pgml.activate_venv() as is for now. There is no need to remove the UDF now since it would be a breaking change.

Right, sorry, I confused comments between #1147 & #1146. #1146 establishes a better pattern for gathering env vars in _PG_init, rather than calling a function at runtime. To fix this issue #1147, I think we should adopt a similar approach that will set the Python venv at server start only, but we don't need to hold up #1146 for this additional fix. cc @levkk