paketo-buildpacks / cpython

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Prevent `__pycache__` directories in `/layers` at build time

fg-j opened this issue · comments

Describe the Enhancement

Per investigation in paketo-buildpacks/python#507, when pip installs dependencies, it caches precompiled Python bytecode in __pycache__ directories. The bytecode isn't reproducible, because it contains timestamps. Currently, those __pycache__ directories end up in /layers, which makes python builds non-reproducible.

Possible Solution

As #352 indicated, setting PYTHONPYCACHEPREFIX changes the location where these cached files are stored. The environment variable should be set to some temporary location at build time so __pycache__ directories don't pollute otherwise-reproducible layers.

Motivation

Exploration in paketo-buildpacks/python#507 indicates that setting PYTHONPYCACHEPREFIX to /tmp at build time is sufficient to make python + pip-install image builds reproducible!

@fg-j Can you clarify what you did to set PYTHONPYCACHEPREFIX to /tmp at build time?

Did you run pack build --env PYTHONPYCACHEPREFIX=/tmp? Did you modify the code of the cpython buildpack? Did you modify other buildpacks?

How did you modify this variable only for build time, and not for launch time?

I see in paketo-buildpacks/conda-env-update#189 that you modified the conda-env-update buildpack. I assume you did that for all buildpacks to test this.

I assume that because you aren't setting this environment variable on the layer but just for the buildpack process itself, that this does not interfere with the launch time modifications in #352

@robdimsdale For most of my investigation, I used the --env flag of pack to set the environment variable at build time (see this comment for an example pack build).

In paketo-buildpacks/conda-env-update#189, I experimented with setting the environment variable in the buildpack code itself, since I needed to splice out the conda-meta/history file anyway. I saw the same effect on conda builds when setting PYTHONPYCACHEPREFIX with --env and when setting it inside the buildpack.

Throughout my investigation, I set the environment variable only at build time. The change should not have had any impact on the value of PYTHONPYCACHEPREFIX that the buildpack sets at launch time.

Great, super helpful, thanks.

I constantly find myself confused about when the lifecycle will set an env var for build vs launch. Can you confirm (or correct) my understanding:

  • If I set the env var via pack build --env FOO=bar then FOO=bar will only be set at build time
  • If I modify the buildpack to use os.SetEnv("FOO","bar") (or equivalent in languages other than Go) it will only modify the environment variable for the duration of the process (i.e. while the buildpack is running in the build phase)
  • If I set the env var on the layer, then the env var will be persisted on the layer.
    • If I set the layer to build: true then the env var will be present on all future uses of the layer during the build phase.
      • The env var won't be read by downstream buildpacks during their build phase unless they explicitly read it from the layer and choose to set it during their own process.
    • If I set the layer to launch: true then the env var will be present on the layer during launch.
      • The env var will be present in the environment variable of the running app, assuming the layer makes it on to the container
  • If I set the env var during a pre-launch process (e.g. the env process in Cpython) it will only be present during launch, and is set dynamically at launch time (vs statically at build time)

I appreciate that this isn't really for you to answer, but you've been spending time in this space so hopefully you can help me get an accurate mental model.

Also cc @ryanmoran - is my model above of environment variables accurate?

@robdimsdale Everything you laid out there looks correct to me. One bit of nuance is that it's possible to set the BuildEnv, LaunchEnv, ProcessLaunchEnv and SharedEnv for a given layer (see packit). So, if you want, you can decide when an environment variable should apply regardless of when the layer is present. A layer needs to be present for its environment variables to impact the build/launch env. But when a layer is present, each associated environment variable may/may not impact the environment depending on how it's been set.

To give a concrete example:

In the dotnet-core-sdk buildpack, we put the layer containing the SDK on the $PATH only in the build environment because we only want to use the binary installed in that layer at build-time. Some other buildpack might request the SDK layer at launch time. The SDK buildpack doesn't control that. Regardless of whether the layer is present at launch, it won't be on the $PATH.

Ah, ok. That's helpful context.

Again, mostly for my own notes:

It sounds like there is a slight difference in what I was describing about setting the environment variable in the layers:

  • All env vars set by a buildpack on a layer with "Build Env" will be automatically loaded by the lifecycle for future buildpacks that use that layer. Concretely, if we set a variable with "build env" in the Cpython buildpack, then all downstream buildpacks will have that env var loaded without having to explicitly read it from the layer.

I think that what that means for the Python __pycache__ directory is as follows:

  • we can set PYTHONPYCACHEPREFIX=/tmp (or PYTHONDONTWRITEBYTECODE=any-value) in the layer generated by the CPython buildpack
  • all downstream buildpacks will pick up that environment variable automatically because they depend on the CPython buildpack and its layers.

For posterity, the official spec is here.