Support file imports in resymgen

Question

Support file imports in resymgen

UsernameFodder opened this issue 3 years ago · comments

Context

Currently, the resymgen YAML specification requires all the symbols within a given block to be in a single file. The specification allows some flexibility by supporting multiple independent blocks, which can either share the same file or be split across files. This fits naturally with the use case of researching multiple binaries, which may be related but are largely independent.

However, with a single large binary, putting all symbols in a single file can quickly become unwieldy, and the YAML file can become quite large. Currently, the best workaround would be to split the symbol table across multiple files that all share the same block name, and using the merging capabilities of resymgen to combine the files back into a single symbol table when generating artifacts. Unfortunately, this comes with a few downsides:

It becomes harder to work with the overall symbol table as a single entity, since resymgen commands will need to be run separately on each subtable file.
Certain validity checks, such as non-overlapping symbols, cannot be enforced jointly across subtables.
The need to be able to merge subtables back together also means each of the files must define a block with the same name, address bounds, and description (or leave the description empty in one or more subtables). On top of repetition, this also introduces the possibility of different subtables covering interleaved address spaces (which may or may not be desirable, but isn't very natural when dealing with a raw binary file).

Proposed solution

`subregions` field on blocks

In the style of the Rust module system, support an optional subregions field on blocks, like so:

block_name:
  versions:
    - v1
  address: 0x0
  length: 0x1000
  description: foo
  subregions:
    - "sub1.yml"
    - "sub2.yml"
  functions: []
  data: []

Within a file foo.yml, items in the subregions list should be names of files within the sibling foo/ directory. Files should be valid resymgen YAML files, which should define one or more subregions (as arbitrarily named blocks) contained within the address range of the parent block, with accompanying lists of symbols. This will allow large address spaces to be subdivided cleanly into collections of smaller (but still internally contiguous) regions, possibly spread over many different files, while still allowing a top-level file for metadata and symbols that don't belong in a specialized subregion.

Since the subregion directory name is implied by the file name, this means that a file with multiple blocks must put subregions for different blocks in the same directory. But if blocks are grouped into a single file, it makes sense that their subregions should go in the same directory. If further splitting is desired, the (separate) subregion files can themselves define additional subregions.

Considerations

Arbitrary layers of nesting should be supported.
The gen command should resolve subregions automatically, and include imported symbols as if they were part of the parent symbol table.
For speed and flexibility, the fmt command should not resolve subregions by default, and should only sort the subregion list alphabetically (since order doesn't have meaning). It should omit the subregions field if the list is empty.
- If the -r, --recursive option is provided, the formatter should run on all files in the import tree.
For speed and flexibility, the check command should check the main file's contents but ignore the subregions field. The -u, --unique-symbols check should additionally ensure that none of the file names in the subregion list are repeated.
- If the -r, --recursive option is provided, resymgen should:
  1. Recurse into subregion files and run the same checks on them as on the main file.
  2. Run some additional relationship checks between files if certain check flags were specified:
  - -V, --complete-version-list: Ensure the main file's version list contains all versions specified in the subregion files' version lists.
  - -b, --in-bounds-symbols: Ensure the bounds of subregions fall within the main file's address bounds.
  - -o, --no-overlap (note the name change from --no-function-overlap): Ensure that none of the subregion bounds overlap, and that none of the main file's symbols (functions or data) overlap with any of the subregion bounds.
  - -u, --unique-symbols: Ensure no symbol name is repeated between the main file and subregions, or across subregions.
The merge command should recurse into subregions, and try to place incoming symbols in the appropriate subregion if one is present. Otherwise, it should fall back to merging into the main symbol table.

Additional repository changes

Update resymgen docs (resymgen.md and docstrings).
Update pmdsky-debug symbols docs.
Update GitHub Actions workflows as needed. This should only require updating any checks to use the --recursive flag; release package generation should be unaffected with the proposed change to the gen command.
Update function headers and symbol_check.py to deal with symbol table import trees.
Update symbols_vfill.py to deal with symbol table import trees.

Alternative solutions

C/C++ style `includes`

This would be like the main proposal, but with support for arbitrary file paths, which would look like this:

block_name:
  versions:
    - v1
  address: 0x0
  length: 0x1000
  description: foo
  includes:
    - "path/to/sub1.yml"
    - "path/to/sub2.yml"
  functions: []
  data: []

This seems like more flexibility than would be useful, especially since it's never expected that one would need to reuse the same subregion in multiple different parents. Arbitrary file path inclusion also introduces the risk of circular dependencies.

Directories as aggregate entities

This option would support running resymgen commands on a whole directory, automatically merging all contained files and treating the contents as one unified table. This would provide a simple way to split up a large file into multiple, and is probably less work to implement than adding a new block field. However, this approach has some disadvantages:

It lacks an elegant way to define a parent-child relationship between a main file and subregions, which is particularly desirable for documentation purposes, since the existence of a main file provides a natural and obvious place to document top-level address bounds and overall notes about the binary. This also causes issues for the merge command, which benefits from having a main file as a fallback destination for symbols.
It fits less cleanly with the current resymgen YAML spec, particularly when files contain multiple blocks. Splitting up a multi-block file would require either more nesting, or a single flat directory containing subregions of multiple different blocks at once (the main proposal has the same property, but is less confusing because a clear parent file exists).
It relies on the merge functionality of resymgen, which was experimental and idiosyncratic to begin with.

New file format

Introducing a new file format would be similar to the imports proposal, except the imported files would be some simpler format rather than standalone resymgen YAML files. This could mean slightly less boilerplate around the main goal of splitting up long functions and data lists. But adding a whole new format is probably even more complex than using the existing one. It also loses some of the nice properties of subtables being standalone, such as being able to run checks and formatting on individual subtables, and trivial arbitrary nesting support.

YAML inclusion with custom tag handles

While standard YAML doesn't have any kind of "include" statement, some YAML loaders like PyYAML's support user-defined handlers for tag handles, which enables the use of constructs such as !include <filename> directly within a YAML file. Unfortunately, yaml-rust does not currently support this, and even if it did, such a construct wouldn't be reliably portable (which would make the symbol tables harder to use). Furthermore, direct textual includes wouldn't fit very well with resymgen because of the separation between the function and data lists; separating out a subregion containing both functions and data would require the use of two separate !includes in different places. It would also hinder error reporting, since inclusion would happen at the YAML loader layer, which would hide it from the resymgen layer.

Metalcape · Answer 1 · Mon May 09 2022 20:53:03 GMT+0800 (China Standard Time)

I agree that the includes solution would be the best one, all things considered. The only slightly annoying thing would be exported CSV symbols getting scattered across multiple files in case you need to do some manual edits (like descriptions, because Ghidra doesn't support exporting comments as far as I know). We could work around this by allowing merge to output onto a new .yml file while also checking the address range from the main file, so that you can do a first merge, then do manual edits, and finally a second merge to place symbols in the appropriate subregions.

Marco Köpcke · Answer 2 · Mon May 09 2022 21:31:20 GMT+0800 (China Standard Time)

It might be a bit overkill in this case but I wrote a library for merging nested YAML documents:

It's written in Rust and meant as a Python library. I could add a no-python build for resymgen:

https://github.com/theCapypara/configcrunch

UsernameFodder · Answer 3 · Mon May 09 2022 23:51:15 GMT+0800 (China Standard Time)

I agree that the includes solution would be the best one, all things considered. The only slightly annoying thing would be exported CSV symbols getting scattered across multiple files in case you need to do some manual edits (like descriptions, because Ghidra doesn't support exporting comments as far as I know). We could work around this by allowing merge to output onto a new .yml file while also checking the address range from the main file, so that you can do a first merge, then do manual edits, and finally a second merge to place symbols in the appropriate subregions.

This isn't really the point of the merge command IMO. I think it's unlikely merge will give you bad data, so you shouldn't really have to check anything, and even if you want to, git diff can be used. The main strength of merge is that it does deduplication/conflict resolution, so you can, e.g., copy a whole symbol table from a Ghidra project and easily merge it into an existing YAML file, ignoring things that don't fit or are already in the YAML file. I think it'd be easier to add descriptions in-situ rather than needing to transplant each entry into different YAML files manually. Having a two-stage merge seems complicated; how would resymgen decide which stage to run when invoked? You could still accomplish the "separate file" thing by just merging into a blank YAML file anyway.

UsernameFodder · Answer 4 · Mon May 09 2022 23:57:25 GMT+0800 (China Standard Time)

It might be a bit overkill in this case but I wrote a library for merging nested YAML documents:

It's written in Rust and meant as a Python library. I could add a no-python build for resymgen:

https://github.com/theCapypara/configcrunch

Hmm this does seem a bit much. And it seems like it might share some downsides with a !include approach, i.e. it would be hard to surface errors from the original file when running checks (which could only be done on the crunched file because the subfiles would no longer be standalone resymgen YAML). Though, I'm not familiar with what the API looks like, so maybe I'm misunderstanding the behavior.

UsernameFodder · Answer 5 · Tue May 10 2022 00:17:16 GMT+0800 (China Standard Time)

After more thought, I think a stricter, Rust-style module structure would make more sense here than C/C++-style file inclusion. Updated the issue description to reflect this.

Support file imports in resymgen

Context

Proposed solution

subregions field on blocks

Considerations

Additional repository changes

Alternative solutions

C/C++ style includes

Directories as aggregate entities

New file format

YAML inclusion with custom tag handles

`subregions` field on blocks

C/C++ style `includes`