Preprocessor output files are deleted when hbs renderer is called (from mdBook > 0.3.1)

Question

Preprocessor output files are deleted when hbs renderer is called (from mdBook > 0.3.1)

sytsereitsma opened this issue 5 years ago · comments

Hi,

I have a preprocessor (mdbook-plantuml) which creates files in the book/img directory. The changes in issue #985, remove these files again (the destination.exists() if clause in Renderer::render(...) in src/renderer/html_handlebars/hbs_renderer.rs).

This kind of breaks my preprocessor :-(

Any ideas on how this could/should be fixed?

Sytse

Eric Huss · Answer 1 · Tue Nov 05 2019 05:50:01 GMT+0800 (China Standard Time)

@Michael-F-Bryan can you take a look at this?

Sytse Reitsma · Answer 2 · Tue Nov 05 2019 15:47:03 GMT+0800 (China Standard Time)

I think a solution could be extending the Renderer trait with a prepare function, which can be called before running the preprocessor.

Michael Bryan · Answer 3 · Tue Nov 05 2019 19:40:11 GMT+0800 (China Standard Time)

I wouldn't expect a preprocessor to be touching anything within another backend's output directory. That could end up being quite brittle because you're making assumptions on how the backend will render things.

Your idea of lifecycle methods (e.g. before_preprocess()) is a good one, although we'd need to update the 3rd party plugin mechanism to take it into account. One way this could be propagated to 3rd party backends would be to append an argument to the executed command, with no arguments meaning "run the normal rendering code". Another idea is to use environment variables (e.g. MDBOOK_PHASE=render or MDBOOK_PHASE=before_preprocess). Both could be implemented in a backwards-compatible way.

A second idea is to add a section to the [build] config to tell mdbook "before running the preprocessors for backend X, delete its output directory":

[build]
delete-output-directory-before-build = ["html"]

The latter would probably be easier to implement, but I'm not sure if it would be scalable/extensible or considered hacky.

Otherwise, instead of generating image files, mdbook-plantuml could use data URIs. That way when you do your transformation, instead of replacing the original text with a a normal <img src="some-file.png"/> you could emit <img src="data:image/png;base64,R0lGXoGAiqh..." />.

@sytsereitsma what are your thoughts? Do any of those solutions sound easier/better to you? I'm happy to make PRs for any mdbook changes.

Sytse Reitsma · Answer 4 · Tue Nov 05 2019 21:41:25 GMT+0800 (China Standard Time)

@Michael-F-Bryan I'll change my preprocessor to use data URIs to avoid the issue altogether. You have a nice point where my preprocessor should be independent of the renderer.

I do think the lifecycle methods can avoid the problem (e.g. for other preprocessors). It is very difficult to troubleshoot for a regular user because it just seems as if the preprocessor does not work properly.

Didn't know about data URIs, thanks for the tip 👍

Michael Bryan · Answer 5 · Tue Nov 05 2019 22:06:45 GMT+0800 (China Standard Time)

It is very difficult to troubleshoot for a regular user because it just seems as if the preprocessor does not work properly.

See sytsereitsma/mdbook-plantuml#7 (comment). You can write error messages to STDERR and return a non-zero exit code from the preprocessor to let users know about errors.

I do think the lifecycle methods can avoid the problem (e.g. for other preprocessors).

If we were to go down the lifecycle method path, can you think of any other lifecycle methods we may want to add to either the preprocessor or renderer?

Sytse Reitsma · Answer 6 · Wed Nov 06 2019 15:27:28 GMT+0800 (China Standard Time)

As far as my preprocessor is concerned everything is working fine. It's just that the output gets deleted once it is done :-)

Regarding the lifecycle methods I do not see any other sensible method than before_preprocessor when looking at MDBook::execute_build_process (src/book/mod.rs).

I have one more suggestion. Since deleting a directory is a significant operation, I'd add a log message when performing the deletion (maybe even an info message).

Regarding fixing my preprocessor I think I will copy the files to the source dir of the book to be renderer agnostic. The maximum size of data URIs differs between browsers (according to MDN opera only allows 65k). Next to that it is likely that another renderer does not support data URIs.

Dylan Owen · Answer 7 · Thu Nov 07 2019 05:32:06 GMT+0800 (China Standard Time)

I ran into this issue as well with https://github.com/dylanowen/mdbook-graphviz I "solved" it by putting my output in the src directory which is pretty gross.

As for suggestions of what we could do. I like the model we have now where we're always operating across stdin -> stdout so chaining / modifying markdown works nicely.

What about extending BookItem to include a Resource File or adding Resources to the Book? Preprocessors that wanted to process on resource files would have that option, or they could also generate them and pass them down the line. Then our entire src directory becomes something we're operating on vs just the markdown.

I'm not sure how we'd serialize the resource files across the json boundary, and I'm not sure about backwards compatibility but it would give preprocessors more power to interact with the various files without needing to modify the src or the output directories.

Michael Bryan · Answer 8 · Thu Nov 07 2019 08:14:25 GMT+0800 (China Standard Time)

I like the idea of adding a "resource" section to the Book. In theory this should be backwards compatible because the JSON deserializer would interpret a missing "resource" key as an empty Vec<Resource>.

It wouldn't be backwards compatible if we were to add all non-markdown files as Resources to the Book , though (preprocessors/renderers not aware of Resources would silently drop them). That feels like a more general solution because it means we load all files under src/ into memory so files generated by a preprocessor aren't special.

I agree that writing into the src/ or book/html/ directories feels wrong. It'd be like if a build script wrote to target/debug/ or src/ instead of the directory it's told write to ($OUTPUT_DIR, the equivalent of MDBook::build_dir_for()).

Sytse Reitsma · Answer 9 · Sat Nov 09 2019 03:45:58 GMT+0800 (China Standard Time)

Something that came to mind.... Images can get rather large and plentiful, meaning the amount of data to transfer over stdio (and json) may get quite large. How about outputting to a 'preprocessor-resources' directory, which gets copied into the book output dir by the renderer?

Sytse Reitsma · Answer 10 · Mon Dec 09 2019 23:09:34 GMT+0800 (China Standard Time)

Hi Guys,

I'd like to see this resolved. The serve command gets into an infinite regeneration loop because the files in the src dir are modified by the preprocessor (with the workaround I created).

I'm more than happy to create a PR for this, but I'd like to have consensus on the path to follow.

As discussed above we have the following options (so far):

The before_preprocessor lifecycle method (quick hacky fix)
Extending the Book with a resources section and serialize the images using the stdin interface between the preprocessor and mdBook (is there a length limit for JSON strings in serde_json?).
I'd suggest deflated base64 encoded image data,
Use a preprocessor-resources directory for staging (which is ignored by the serve command)

Sytse

Dylan Owen · Answer 11 · Fri Dec 13 2019 09:17:42 GMT+0800 (China Standard Time)

That's a good point about transferring large amounts of data over stdio, I've been unable to find much info on the limits or performance hits for it though.

My rationale for passing data through stdout/in was that it would give preprocessors more power to interact with each other.

For example we could create a "Thumbnail Preprocessor" that would take in images in the src directory and generate a thumbnails page to reference them. In the current model this is possible, but it'd be great if the output of mdbook-plantuml could feed into this Thumbnail preprocessor, without needing to stage anything in the src directory.

Sytse Reitsma · Answer 12 · Fri Dec 13 2019 17:48:39 GMT+0800 (China Standard Time)

I'm not too worried about the stdio overhead compared to the whole sequence of events (i.e. generate a file, load it in memory, zip and encode it transfer it, extract and save it). My main concern is the unneeded data serialization/deserialization, even though that probably is a negligible overhead compared to the graphviz rendering.

So time to put the assumptions to the test and generate some data to base the decisions on. I'll do some spikes on how serde json handles large string values and see if I can collect some metrics on different strategies.

Kim Hå · Answer 13 · Thu Jan 16 2020 23:15:52 GMT+0800 (China Standard Time)

How about a slightly different approach, where in the first step of the build process, the src tree gets copied into an intermediary work directory (let's call it prep) that pre-processors can write to. Once they're done, the normal build takes over, but instead of building the files found in src, it builds whatever is found in prep. Once built, prep can safely be destroyed.