[spec: webpack 5] - A module disk cache between build processes

Question

[spec: webpack 5] - A module disk cache between build processes

mzgoddard opened this issue 6 years ago · comments

Current Problems & Scenarios

Users get fast webpack builds on large code bases by running continuous processes that watch the file system with webpack-dev-server or webpack's watch option. Starting those continuous processes can take a lot of time to build all the modules for the code base anew to fill the memory cache webpack has to make rebuilds fast. Some webpack uses remove the benefit from running a continuous process like running tests on a Continuous Integration instance that will be stopped after it runs, or making a production build for staging or release which is needed less frequently than development builds. The production build will just use more resources while the development build completes faster from not using optimization plugins.

Community solutions and workarounds help remedy this with cache-loader, DllReferencePlugin, auto-dll-plugin, thread-loader, happypack, and hard-source-webpack-plugin. Many workarounds also include option tweaks that trade small loses in file size or feature power for larger improvement to build time. All of these involve a lot of knowledge about webpack and the community or finding really good articles on what others have already figured out. webpack itself does not have some simpler option to turn on or have on by default.

With the module memory cache there is a second important cache in webpack for build performance, the resolver's unsafe cache. The unsafe cache is memory only too, and an example of a performance workaround that is on by default in webpack's core. It trades resolving accuracy for fast repeated resolutions. That trade means continuous webpack processes need to be restarted to pick up changes to file resolutions. Or that the option can be disabled but for the number of resolutions that will change like that restarting will save more time overall than having the option regularly be off.

Proposed Solution

Freeze all modules in a build at needed stages during compilation and write them to disk. Later iterative builds, the first build of a continuous process using an existing on disk module cache, read the cache, validate the modules, and thaw them during the build. The graph relations between modules are not explicitly cached. The module relations need to also be validated. Validating the relations is equivalent to rebuilding the relations through webpack's normal dependency tracing behaviour.

The resolver's cache can also be frozen and validated with saved missing paths. The validated resolver's "safe" cache allows retracing dependencies to execute quickly. Any resolutions that were invalidated will be run through the resolver normally allowing file path changes to be discovered in iterative and rebuilds.

Plain json data is easiest to write to and read from disk as well as provide a state the module's data can be in during validation. Fully thawing that data into their original shape will require a Compilation to be running so the Module, Dependency's and other webpack types can be created according to how that Compilation is configured to create a copy of the past Module indistinguishable from the last build.

Creating this data will like involve two sets of APIs. One creates the duplicates and constructing thawed Instances from the disk read duplicate. The second uses the first to handle the variation in subclassed types in webpack. As an example the webpack 3 has 49 Dependency subclasses that can be used by the core of webpack and core plugins. The first API duplicating a NormalModule doesn't handle the Dependency instances in the module's dependencies list, it calls to the second API to create duplicates of those values. The second API uses the first to create those duplicates. To keep this from running in a circular cycle, uses of the first API are responsible for not duplicating cyclical references and for creating them while thawing using passed state information like webpack's Parser uses.

The first data API will likely be a library used to implement a schema of a Module or Dependency. The second data API may use webpack's dependencyFactories strategy or Tapable hooks. A Tapable or similar approach may present opportunities to let plugin authors cache plugin information that is not tracked by default.

A file system API is needed to write and read the duplicates. This API organizes them and uses systems and libraries to operate efficiently or to provide an important point for debugging to loader authors, plugin authors, and core maintainers. This API may also act as a layer that may separate some information in a common shape to change its strategy. Asset objects may be treated this way if they are found to best be stored and loaded with a different mechanism then the rest of the module data.

This must be a safe cache. Any cached information must be able to be validated.

Modules validate their build and rendered source through timestamps and hashes. Timestamps cannot always be validated. Either a file changed in a way that didn't change its timestamp or the timestamp decreased in cases like a file being deleted in a context dependency or a file be renamed to the path of the old file. Hashes of the content, like the rendered source and chunk source use can be validated. All timestamp checks in modules and elsewhere must be replaced with hash or other content representative comparisons instead of filesystem metadata comparisons. File dependency timestamps can be replaced with hashes of their original content. Context dependency timestamps can be replaced with hashes of all the sorted relative paths deeply nested under them.

The cached resolver information needs to validate the filesystem shape and can do that by stat()ing the resolved path and all tested missing paths. A missing resolved path invalidates the resolution. An existing missing path invalidates the resolution.

Two larger "validations" also need to be performed.

The webpack's build configuration needs to be the same as the previous build. Instead of invalidating in case of a different build configuration though, a separate cache stored adjacent to the other cached modules under other configurations. Webpack configurations can frequently switch like in cases of using webpack or webpack-dev-server which turns on hot module replacement. Hot module replacement means the configuration is different and needs a separate cache as the module's will have a different output due to the additional plugin. One way to compare this is a hash of the configuration. The configuration can be stringified including any passed function's source and then hashed. An iterative build will check the new hash to choose its cache. Smarter configuration hashes could be developed to account for options that will not modify the already built modules.

The second larger validation is ensuring that dependencies stored in folders like node_modules have not changed. yarn and npm 5 can help here by trusting them to do this check and hashing their content. A back up can hash the combined content of all package.json files under the first depth of directories under node_modules. webpack will track the content of built modules, but it does not track the source of loaders, plugins, and dependencies used by those and webpack. A change to those may have an effect on how a built module looks. Any changes to these not-tracked-by-webpack files currently will mean the entire cache is no longer valid. A sibling cache could be created but if that can be determined to be regularly useful to keep the old cache.

User Stories (That speak in solving spirit of these problem areas)

Priority Story

1 As a plugin or loader author, I can use a strategy or provided tools to test with the cache. In addition I have a strategy or means to have the cache invalidate entirely or specific modules as I am editing a loader or plugin.

1 As a user, I can rely on the cache to speed up iterative builds and notify me when an uncached build is starting. I can also turn off the notifications if I desire. I should never need to personally delete the cache for some performance trade off. The cache should reset itself as necessary without my input. I understood I may need to do this for bugs. Best such bugs be squashed quickly.

1 As a user, I should be able to use loaders and plugins that don't work with the cache. Modules with uncacheable loaders will not be cached. Modules with nested objects that cannot be duplicated or thawed from containing values that are not registered in the second data API will produce a warning about their cacheability status and allowed to be built in the normal uncached fashion.

1 As a core maintainer, I can test and debug other webpack core features and core plugins in use with the cache to make sure it can validate and verify itself for use.

Non-Goals

This RFC will not look into using a cache built with different node_modules dependencies than those last installed. This would be a large effort on its own likely involving trade offs and may best be its own RFC.

This cache will be portable. Reusable on different CI instances or in different repo clones on the same or different computers. This RFC will not figure out the specifics of sharing a cache between multiple systems and leaves this to users to best figure out.

This spec can be bridged into other proposed new features with its module caching behaviour. This document and issue does not intend to make those leaps.

Requirements

A api or library to create duplicates of specific webpack types and later those back into the specific types with some given helper state like the compilation and related module, etc. Uses of this api must handle not duplicating cyclical references, like a dependency to its owning module, and thawing the reference given the helper state.

A data relation API that either has duplication/thaw handles registered by some predicate, or like dependencyFactories, or through tapable hooks.

A (disk) cache organization API that either creates objects to handle writing to and reading from disk kind of like the FileSystem types. This API is for reading and writing the duplicate objects. Its API shape needs to support writing only changed objects. This might be done in a batch database like operation, letting the cache system send a list of changed items to write so the cache organization API doesn't need to redo work to discover what did and did not change. It will likely need to read all of the cached objects from disk during an iterative build. Core implementations of this API will likely need to be one, a debug implementation, and two, a space and time efficient implementation.

JSON is at least the starting resting format written to disk. The organization API might be used to wrap the actual disk implementation. The wrapping implemetation will turn the JSON objects into strings or buffers and back for the wrapped implmentation. That can be JSON.stringify and parse or some other means to do this work quickly as this step is a lot of work. Beating JSON.parse performance is pretty tricky.

Either in watchpack or another module, timestamps either need to be replaced with hashes for file and context dependencies or they can be added to the callback arguments. With a disk cache, timestamps will not be a useful comparison for considering if needs to be redone. The timestamps are not guaranteed to represent changes to file or directory content.

Use file and context dependency hashes in needRebuild instead of timestamps.

Hash a representative value of the environment, dependencies in node_modules and like. A different value from the last time a cache was used means no items in the cache can be used and they must be destroyed and replaced by freshly built items.

Hash webpack's compiler configuration and use it as a cache id so multiple adjacent caches are stored. The right cache needs to selected early on at some point of plugins being applied to the compiler after defaults are set and configuration changes are made by tools like webpack-dev-server.

These adjacent caches should be automatically cleaned up by default to keep the cache from running away in size by each one adding to a larger sum. This might happen automatically say if there are more than 5 caches including the one in use, cumulatively they use more than 500 MB. The oldest ones are deleted first until the cumulative size comes under the 500 MB threshold. Alternative to the cumulative size a if there are more than 5 caches and some are older than 2 weeks, caches older than 2 weeks are deleted.

Replace the resolver's unsafe cache with a safe cache that validates a resolution by stating every resolved file and every originally attempted check. Doing this in bulk skips the logic flow the resolver normally executes. Very little time is spent doing this as it doesn't rely on js logic to build the paths. The paths are already built. The resolver's cached items may be stored with their respective module, consolidating all of the data for a cached module into one object for debugging and cleanup. If a module is no longer used in builds, removing it also removes the resolutions that would lead to it, and less information will need to be read from disk.

Questions

How are loaders that load an external configuration (babel-loader, postcss-loader) treated in regards to the cache configuration hash/id? Any method to do this needs to be done before a Compilation starts.
What enhanced-resolve cases exist that may not be recorded in the missing set?
How do loader and plugin authors work on their code and test it with the cache?
JSON is a good resting format. Should we look at others? Beating JSON.parse performance is pretty tricky. protobufjs implementations improve on it in many cases because they store the keys as integers in the output. The protobuf schema defines the key to integer relationship explicitly so its easy to go back and forth.
Are the version of node or operating system values that should be included in the environment (node_modules and other third party dependencies) comparison? Should they be part of the configuration hash?

Fundementals

0CJS

The disk cache should be on by default.
Each build with a different webpack configuration should store a unique copy of its cache versus another webpack configuration. E.g. A development build and a production build must have distinct caches due to them having different options set.
After N caches using M MBs total exist any caches older than W weeks past N caches and M MBs total should be deleted.
Some disk cache information should be in webpack stats. E.g. Root compilation cache id, disk space used by cache, disk space used by all caches, ...
Any change to node_modules dependencies or other third-party dependency directories must invalidate a saved cache.
The cache must be portable, reusable by CI, or between members of a project team as long as no node_modules or other third-party dependency directories change.
Use a efficient and flexible resting format and disk organization implementation.

Speed

Iterative builds, builds with a saved cache, should complete significantly faster than an uncached build. An uncached build saving a cache will be a small margin slower than one not writing a cache, as writing the cache is an additional task webpack does not yet perform. A rebuild, a build in the same process that ran an uncached or iterative build, should be a hard to measure amount slower, saving only the changed cache state and not the whole cache.

Build Size

No change.

Security

Similar security as to how third party dependencies are fetched for a project.

Success Metric

webpack iterative builds, builds with a cache, should be significantly faster that uncached builds.
Rebuilds performance should be minimally impacted.
Iterative build output should match an uncached build given no changes to the sources.
Cache sharing: a cache should be usable in the next CI run in a different CI instance, or a common updated cache could be pulled from some arbitrary store by team members and used instead of needing to run an uncached build first. (Given that the configuration is not different than the stored caches and that node_modules contains the same dependencies and versions.)

Z Goddard commented 6 years ago

I think the node version and OS would be answered separately. It would probably be wise to consider the node version to be comparable to npm/yarn/bower installed dependencies and be part of the environment hash. The OS I think gets to a deeper aspect of why the environment and configuration hashes are needed.

If we could transform the webpack object information into a general shape when saving a cache and transform back into any possible specific shaped decided by all of the installed dependencies, node versions, and webpack configuration, we wouldn't need the hashes. We would have a hermetic cache like the webpackGraph spec conversation is talking about. Since we can't we can represent that idea has hashes and instead verify that the stored specific shape is usable in the executing webpack instance.

There is some representation of this in the above cache spec that may need some expansion, but we can reduce the surface area of the hash comparison with areas that can do that specific to generic and back transformation. There may be something I'm overlooking but I think the OS difference can be handled through such a transformation like webpack records does, generalizing the file paths. I'd figure if a webpack project used multiple drives on say Windows that parts of the cache would not be usable on Mac or Linux but the parts that can be transformed would be usable. The missing parts would just be ignored since Mac and Linux would never resolve Windows like paths or be able to make comparable ones.

For the OS I think we only need the file system to be able to fs.stat for file existence, fs.readFile for file content to hash, and fs.readdir for directory content to hash. Mentioning those that does remind me how git (and other VCS) optionally translates \n to \r\n on windows. Such files will result in a new hash so a cache brought over from a machine (Windows or otherwise) that used \n to a system that used \r\n would build new hashes and like ignore most if not all of the existing cache as there would be new hashes for every module.

I haven't tested a Mac cache on a Windows machine with hard-source-webpack-plugin@alpha yet but in theory its transform on the module info from absolute paths to relative paths and changing the directory separator to a standard one when saving and the reverse when loading a cache should support this inter-OS cache sharing.

AllNamesRTaken · Answer 3 · Tue Mar 20 2018 16:10:02 GMT+0800 (China Standard Time)

Maybe this is not relevant but the caching solution should be compatible with other performance measures such as HappyPack?

Salem · Answer 4 · Thu May 03 2018 02:30:19 GMT+0800 (China Standard Time)

I mostly lurk here, so take these with a grain of salt:

This sort of change would be a huge win. I'm working to replace an old in-house build system with webpack for a large, sprawling codebase with many entrypoints. It feels like this spec addresses the specific painpoints we're running into.
To the HappyPack point, I believe HappyPack solves a separate problem (parallelizing actual loader work) and doesn't necessarily need to conflict. In general, I feel like designing around existing plugins could end up leading to passing on better solutions. For example, the need for parallelism is a problem that HappyPack solves really well, but I don't that HappyPack's existence should preclude native parallelism support any more than an existing caching plugin should preclude building a disk module cache.

Andreas Lubbe · Answer 5 · Mon Jun 04 2018 22:53:43 GMT+0800 (China Standard Time)

This would be a huge win for the ecosystem, and the amount of third party solutions shows the high demand for a feature like this. Seeing as most of them rely on internals, it seems wise to provide a solution that is built into webpack itself.

Regarding happypack, I would consider compatibility a nice-to-have, not a hard requirement. In other words, if the optimal solution is not compatible and we can't find a secondary solution that is only slightly slower, then maybe we shouldn't go for it.

In any case, what are the next steps for this spec? How can we build some momentum for it? Do the developers behind the community solutions (like @amireh) know about it? Anything that I or others not familiar with the internals can do to help?

Filipe Silva · Answer 6 · Sun Jul 08 2018 17:01:13 GMT+0800 (China Standard Time)

@mzgoddard just wanted to say I think you did a great job listing all the possible sources of cache invalidation. I had some concerns in #6386 (comment) but I feel your proposal addresses them.

After N caches using M MBs total exist any caches older than W weeks past N caches and M MBs total should be deleted.

Regarding the cache size, I expect it to be quite big on big projects with lots of chained loaders, which are also the projects that would most benefit from this cache. Deleting the cache after it reaches a certain size can lead the thrashing, where caches keep getting deleted and re-created. To avoid this a user would need to go check individual caches and try to increase the size limit to what he believes would allow a few extra caches to be stored.

For this reason I don't see cache size as being a useful criteria in determining if caches should be deleted. The size of a cache is a function of the configuration and sources, and cannot be determined by a user in advance. I think it's best to keep to number of caches, and cache age.

Are the version of node or operating system values that should be included in the environment (node_modules and other third party dependencies) comparison? Should they be part of the configuration hash?

Including the node or operating system in either hash would greatly diminish its portability. It's very common for team members to use different OSs or node versions.

The second larger validation is ensuring that dependencies stored in folders like node_modules have not changed. yarn and npm 5 can help here by trusting them to do this check and hashing their content. A back up can hash the combined content of all package.json files under the first depth of directories under node_modules. webpack will track the content of built modules, but it does not track the source of loaders, plugins, and dependencies used by those and webpack. A change to those may have an effect on how a built module looks. Any changes to these not-tracked-by-webpack files currently will mean the entire cache is no longer valid. A sibling cache could be created but if that can be determined to be regularly useful to keep the old cache.

Also worth mentioning that, IIRC, yarn can produce different folder structures for the same lockfile. At least I remember people have different deduping behaviours, which meant packages where in different places.

When hoisting is taken into account, I don't think the backup described here (hash first level package.jsons) is enough. Things can change at any level, and that can affect loader/plugin behaviour. For this reason the backup method needs to take into consideration the resolved packages at any level.

Z Goddard · Answer 7 · Sat Jul 28 2018 01:13:11 GMT+0800 (China Standard Time)

@filipesilva

For this reason I don't see cache size as being a useful criteria in determining if caches should be deleted. The size of a cache is a function of the configuration and sources, and cannot be determined by a user in advance. I think it's best to keep to number of caches, and cache age.

Agreed. I made a change to hard-source recently to auto-prune that relies on only cache age.

Also worth mentioning that, IIRC, yarn can produce different folder structures for the same lockfile. At least I remember people have different deduping behaviours, which meant packages where in different places.

I think as long as yarn guarantees the same versions of dependency dependencies for each dependency a different folder structure should be fine.

When hoisting is taken into account, I don't think the backup described here (hash first level package.jsons) is enough. Things can change at any level, and that can affect loader/plugin behaviour. For this reason the backup method needs to take into consideration the resolved packages at any level.

You're right. I think projects using hoisting will need to customize their cache configuration so it uses the right yarn.lock or package-json.lock. Lock files would be the best. Otherwise they would need a custom list of top level node_modules directories to hash the first level of. The best strategy in hoisting situations may need to be plugins or optional values, I'm not sure a strategy that checks for hoisting would make a good default.

Marcin Szczepanski · Answer 8 · Wed Oct 10 2018 07:09:53 GMT+0800 (China Standard Time)

I would like to see another user story for this kind of change which is to reduce the memory requirement for large builds, especially in a CI / cold cache scenario.

At Atlassian we recently tried to upgrade from Webpack 3 to Webpack 4 for Jira's frontend build but were unable to because we either a) couldn't even get the build to complete due to out of memory errors b) the build would take anywhere from 150% of the Webpack 3 time to 800%+ depending on source map / optimisation options (Webpack 3 currently takes about 15 minutes for a full production build)

Significant memory usage - we believe due to Webpack storing the sources, bundles, and source maps in memory the whole time - was the cause. If the build completed, it was slower due to significant GC thrashing. Our build produces some 50 separate bundles (multiple entry points + code splitting), and these issues occur even with a 8GB heap.

One similar existing issue I found was #7703.

It'd be interesting if it was a advanced tuning thing - you could tradeoff disk for memory. For smaller builds using memory is fine, and makes things faster, where using disk would slow it down. For a build the size of ours offloading to disk and freeing memory would probably actually make things faster due to reduction in GC thrashing.

Sibelius Seraphini · Answer 9 · Thu Nov 08 2018 01:57:51 GMT+0800 (China Standard Time)

I think it is better to create an extensions on webpack to enable users build their cache strategies using plugins

Alexander Akait · Answer 10 · Thu Nov 08 2018 01:58:57 GMT+0800 (China Standard Time)

@sibelius it will be built-in in webpack@5

Steven Hargrove · Answer 11 · Thu Nov 15 2018 08:56:18 GMT+0800 (China Standard Time)

@marcins
were u using uglifyjs-webpack-plugin or terser-webpack-plugin's parralel option? that would dramatically inflate perceived resource usage.

Sibelius Seraphini · Answer 12 · Thu Jan 03 2019 02:18:25 GMT+0800 (China Standard Time)

you can play with this using webpack 5, check this

https://github.com/webpack/webpack/releases/tag/v5.0.0-alpha.3

Nate Wienert · Answer 13 · Thu Aug 15 2019 11:13:39 GMT+0800 (China Standard Time)

I'd just like to voice support for keeping the DLL system as well, in some form.

We are using it very heavily, in that its useful for more than just caching. It allows you to share code between different projects, and if you get DLLs hot reloading they are significantly faster (100ms vs 2s+).

I didn't see anything about taking out DLL support, but just wanted to voice support for them. They of course would work well with caching, I see two benefits over caching: the precise ability to bundle packages for use later (being able to use one DLLs across different apps), and the much improved HMR time within a single DLL (ability to run multiple webpack processes even with each focused on it's own small bundle).

Tobias Koppers · Answer 14 · Fri Aug 16 2019 03:03:57 GMT+0800 (China Standard Time)

There are no plans removing DLLs

Roman Usherenko · Answer 15 · Fri Jun 12 2020 17:00:06 GMT+0800 (China Standard Time)

hey @sokra! What's the best way to track the webpack 5 milestone? Looks like this issue is open but it is in the DONE column in here

Alexander Akait · Answer 16 · Fri Jun 12 2020 21:01:46 GMT+0800 (China Standard Time)

Done for webpack@5, feel free to test it

webpack / webpack

[spec: webpack 5] - A module disk cache between build processes

Current Problems & Scenarios

Proposed Solution

User Stories (That speak in solving spirit of these problem areas)

Non-Goals

Requirements

Questions

Fundementals

0CJS

Speed

Build Size

Security

Success Metric

Related