Proposal: Use `rustdoc` JSON instead of parsing with `syn`

Question

Proposal: Use `rustdoc` JSON instead of parsing with `syn`

madsmtm opened this issue 7 months ago · comments

Parsing Rust code with syn is useful for macro authors, but has its limitations for whole-crate tools such as cbindgen, as is apparent in that it does not properly support namespacing since it has no access to type information and such (and the support for macro expansion is also a bit spotty).

In an ideal world, I think cbindgen should be implemented using rustc_driver, and distributed alongside the Rust compiler in rustup (perhaps with a cargo cbindgen subcommand included); this would give it all the power of rustc, and allow it to do full semantic analysis of the code; weird quirks begone!

But I recognize that that's a huge ask from several teams, and that it may not be desired for the Rust project to take ownership over this project (which is effectively a blocker for using rustc_driver, as doing that out-of-tree / without support from upstream is a huge hazzle).

Luckily, there exists something else we can use! Enter rustdoc's JSON output (documentation for the format here, can be access conveniently via. the rustdoc-types crate). This fully describes a crate's public API, which is more than enough for cbindgen's needs (it only needs to know all pub extern "C" fn, and their dependent types).

While the format itself is yet unstable, it is already depended on in the ecosystem, prominently cargo-semver-checks, so I'd say there is a high likelihood that it will not outright go away without some sort of replacement¹.

So: I propose that cbindgen transitions to using this instead of syn, which would solve all of the aforementioned problems with namespacing (and likely make the implementation of cbindgen smaller, and capable of more advanced semantic type analysis in the future (example of something: knowledge of auto traits)).

What do you think?

_{1: If cbindgen ends up going this route, we should of course state so on the tracking issue.}

Predrag Gruevski · Answer 1 · Wed Feb 14 2024 05:55:54 GMT+0800 (China Standard Time)

While the format itself is yet unstable, it is already depended on in the ecosystem, prominently cargo-semver-checks

cargo-semver-checks author and maintainer here 👋 It's important to note that cargo-semver-checks invested heavily in infrastructure to isolate us from breaking changes in the rustdoc format, which is why format changes are now not such a big deal for us.

The infrastructure we use is all open-source and based on the Trustfall query engine and the trustfall-rustdoc and trustfall-rustdoc-adapter crates. It's discussed more in this 15min talk I gave at the P99 CONF conference.

If you're interested in reusing some of that work, I'd be happy to point you in the right direction and help you get started.

Emilio Cobos Álvarez · Answer 2 · Mon Apr 15 2024 07:06:46 GMT+0800 (China Standard Time)

Luckily, there exists something else we can use! Enter rust-lang/rust#76578 (documentation for the format here, can be access conveniently via. the rustdoc-types crate). This fully describes a crate's public API, which is more than enough for cbindgen's needs (it only needs to know all pub extern "C" fn, and their dependent types).

Is this true? We also need to know #[repr()] attributes, and non-public members of structs etc as well.

If rustdoc provides all that information, that might be an interesting approach. I'd be happy to take a patch that prototypes it assuming no regressions...

cc #287 for things that would allow us to handle macros.

Predrag Gruevski · Answer 3 · Mon Apr 15 2024 07:57:51 GMT+0800 (China Standard Time)

cargo-semver-checks uses rustdoc JSON for all the data points you describe: it has lints that cover types' #[repr()] attributes, extern "C" and other ABIs on functions, and also needs to be able to see non-public members of structs since they are often semver-relevant in non-obvious ways.

We generate rustdoc JSON with the --document-private-items --document-hidden-items flags, since private and hidden items are not included in the JSON by default.

Emilio Cobos Álvarez · Answer 4 · Mon Apr 15 2024 08:47:42 GMT+0800 (China Standard Time)

Cool! Curious, do you know how fast / slow rustdoc happens to be compared to syn?

Also, thinking a bit about what cbindgen features we use in Firefox I came up with:

The ability to reuse the cargo metadata across cbindgen invocations to speed up the process. We run cargo metadata on the whole workspace, and then call cbindgen a bunch of times with the same metadata, rather than once per crate. This is presumably supported by rustdoc in a way?
- At least it seems we could use cargo doc --no-deps <crate> to avoid global work for each invocation. But ideally running cbindgen on let's say, 10 crates of the same workspace would be reasonably fast / not scale linearly with the number of crates.
The fact that it runs quite fast.
- At a glance rustdoc seems probably ok speed-wise? But would be useful to have some profiles if we get to implement this before shipping it.
The ability to track the dependencies (which files we read) to make incremental builds fast and correct when you haven't touched relevant files.
- I don't know if rustdoc does something like this. I guess we'd need to assume "any file touched in a crate makes the crate dirty" or something if it does not? That'd be somewhat unfortunate... cc @jschwe who contributed depfile support, iirc.
The fact that it doesn't need to compile. This is particularly useful when paired with bindgen (which does require its C++ to compile). This way, for tightly coupled rust / c++ components, you can run cbindgen first to generate the C++ header, and run bindgen on the C++ code.
- Not sure how well rustdoc deals with e.g. a missing module that's generated via build.rs... Presumably it does need to run build.rs scripts?
- This is kind of an abuse of cbindgen, but it is really really useful for us. If nobody else uses it we could drop this, I guess.

Predrag Gruevski · Answer 5 · Mon Apr 15 2024 09:42:17 GMT+0800 (China Standard Time)

Rustdoc is a component of rustc, so to my knowledge it behaves very similarly: it runs build scripts, resolves dependencies, requires the code to type-check, handles incremental builds, etc.

I've never benchmarked it versus syn but I'd expect it's often slower because syn isn't doing nearly as much work.

Flávio J. Saraiva · Answer 6 · Thu Jun 06 2024 01:15:21 GMT+0800 (China Standard Time)

I have no experience with trustfall, but a trustfall adapter for syn seems appropriate for the cbindgen case.
It could be used to generate the internal data instead of parsing syn directly.

At a later stage, the internal data could get a trustfall adapter that would be used by the language backends.
If needed, the language backends could also query extra information from syn.

Predrag Gruevski · Answer 7 · Thu Jun 06 2024 02:08:28 GMT+0800 (China Standard Time)

If you're interested in giving it a shot, I'd be happy to help you get started with Trustfall and answer any questions!

Flávio J. Saraiva · Answer 8 · Fri Jun 07 2024 09:12:10 GMT+0800 (China Standard Time)

If you're interested in giving it a shot, I'd be happy to help you get started with Trustfall and answer any questions!

I don't mind mind trying but only after someone that actually maintains cbindgen says that this path is acceptable for them. I'm just a random user/programmer. =~

Matt Campbell · Answer 9 · Thu Jun 20 2024 22:56:34 GMT+0800 (China Standard Time)

The issue with cbindgen only supporting macro expansion on nightly Rust is an issue for AccessKit's C bindings. In the short term, I think we'll have to manually generate our header file (still using nightly) and commit it to the repo, but obviously that's not idea. Anything we can do to help this get solved? cc @DataTriny

Andy Day · Answer 10 · Tue Jul 02 2024 11:55:48 GMT+0800 (China Standard Time)

I was looking into this, and trustfall looks like a super cool idea. I was looking at the rustdoc integration that appears to be used in cargo-semver-checks, and I'm stumped on two things.

How can I get the types of struct fields / enum struct variants?
How can I get the generics of a given struct or an enum?

I did successfully only pull out the repr C items in my codebase, so I could get that working, and I could choose structs or enums, and list the names of the types and their fields, but I couldn't get the type information that would be useful. I would like to do the sort of "copy-and-paste monomorphization" that cbindgen appears to do out of the box. @obi1kenobi could you point me in the correct direction for how to do these things?

Predrag Gruevski · Answer 11 · Tue Jul 02 2024 13:01:11 GMT+0800 (China Standard Time)

Currently, the Trustfall adapter for rustdoc doesn't expose type information for struct fields / enum variant components / function arguments / generics etc. It can, it will just take a bunch of work to model everything properly given Rust's very expressive type system (impl trait + generics + lifetimes etc.). I'm currently looking for funding to make this happen.

In the meantime, I recommend a hybrid approach: using Trustfall, output the id property of the item whose type info you want, then use it to look up the item directly in the index of the rustdoc JSON data. That way you can still inspect the type information even though it didn't appear in the Trustfall query.

How does that sound? Happy to help however I can, so let me know if anything is still confusing!

Andy Day · Answer 12 · Tue Jul 02 2024 13:45:33 GMT+0800 (China Standard Time)

Yes, that is very helpful thank you! I hope you receive your funding! I think for my use case (and probably for a lot of other people's use cases) I won't need the full expressiveness of the type system (not having generics across the FFI boundary drastically simplifies things), so this is a great solution, thank you for your detailed response!

Predrag Gruevski · Answer 13 · Tue Jul 02 2024 22:56:51 GMT+0800 (China Standard Time)

Awesome! I'd love to hear about what you build and your experience along the way, so please keep me in the loop :)