greghendershott / pdb

Multi-file check-syntax database

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This is WIP exploring the idea of storing, for multiple source files, the result of running drracket/check-syntax, plus some more analysis.

The main motivation is to support multi-file flavors of things like "find references" and "rename".

The intent is this could enhance Racket Mode, as well as Dr Racket and other tools.

Database

For each analyzed source file:

  1. Fully expand, accumulating some information even if expansion fails (as used by e.g. Typed Racket):
  • direct calls to error-display-handler
  • online-check-syntax logger messages
  1. Run check-syntax, recording the values from various syncheck-annotations<%> methods.

After accumulating information in various fields of a struct, finally the struct is serialized, compressed, and stored in a sqlite table.

We extend the check-syntax analysis in various ways:

  • In addition to syncheck:add-definition-target, which identifies definitions, we identify and record exports from fully-expanded #%provide forms.

  • In addition to syncheck:add-arrow/name-dup/pxpy, which identifies lexical and import arrows, we identify and record some other flavors of arrows:

    • import-rename-arrows, as from rename-in etc.
    • export-rename-arrows, as from rename-out, etc.

    Also we enhance the check-syntax import-arrows to store the "from" and "nominal-from" information from identifier-binding. Following the nominal-from values to the exports in other files, and vice versa, is how we can identify rename-sites across multiple files.

  • We build a map from positions to submodule name paths (where () means no submodule, i.e. the outermost, file module) as well as whether the module sees its parent's bindings (as with module+). This map supports various functionality:

    • A tool can implement a "run/enter submodule at current position" command without assuming (as does "classic" Racket Mode) s-expression surface syntax to discover the submodule name path.

    • Knowing from which module(s) to itemize imported symbols for completion candidates (as described in the next bullet point).

  • For each module, we record each #%require in a normalized format (module path, whether it is the module language, any prefix, and any exceptions). Later this can be "cashed in" for a list of the imported symbols, to be used by a tool like a source code editor for completion candidates. In some cases we can obtain the export symbols from our own database; as a fallback we use module->exports.

You want to jump where, in what size steps?

In Racket a definition can be exported and imported an arbitrary number of times before it is used -- and can be renamed at each such step.

In general, the definition graph elides that and expresses "big, direct jumps" among files. Which is wonderful when you want to e.g. "visit/find/jump to definition" in another file.

By contrast the "name introduction and use" graph cares about the chain of exports and imports, and considers steps where a rename occurs. A motivation is to support multi-file rename commands. For that to work, every occurrence of the "same" name must be known, including uses in provide and require forms, and considering clauses like rename-out, prefix-ix, rename-in, prefix-out, and so on.

For example, if user wants foo to be renamed bar, then sites like (provide foo) must be changed. Furthermore, sites like (provide (rename-out [foo xxx])) are inflection points where the graph ends. If some other file does (require (rename-in mod [xxx foo])), that "foo" is not the same and should not be in the same set of sites to be renamed as the "foo" in the exporting file.

use->def vs. def->uses

For either type of graph, it is simple to proceed from a use to its source. When the source is in some other file, we know which other file: The identifier-binding "from" or "nominal-from" information always says in which other file to look. If that file isn't yet in the database (or is outdated), we analyze it, and so on transitively. Furthermore it is a 1:1 relation; even when there are multiple steps (such as hopping through a contract wrapper to the wrapped definition), each step is 1:1.

On the other hand, proceeding from a definition to its uses is a 1:many relation, transitively (each of the many uses may in turn have many uses). Furthermore we can't discover absolutely all uses -- unless absolutely all using files have already been analyzed. There exists only a set of known uses, which is limited by the set of already-analyzed files.

This is another motivation to save analysis results for multiple files in a database. One or more directory trees, each for some project the user cares about, can be analyzed proactively. (Thereafter a digest mismatch can trigger an automatic re-analysis of a changed file.) This enables discovering all uses, at least within the scope of those projects.

Disposition

Racket Mode

Status quo, Racket Mode's back end runs check-syntax and returns to the front end racket-xp-mode the full results for each file. The entire Emacs buffer is re-propertized. For example mouse-overs become help-echo text properties.

How exactly would Racket Mode's back end use this pdb project.

Roadmap step 1: Still all results at once

Initially, Racket Mode's back end could use this pdb project the same way: Get the full analysis results, and re-propertize the entire buffer.

That alone is no improvement. But we could add new Racket Mode commands that query the db, such as multi-file xref-find-references or renaming.

Furthermore, I think we could eliminate the back end's cache of fully expanded syntax. For example find-definition no longer needs to walk fully-expanded syntax looking for a site. We already did that, for all definitions, and saved the results; now it's just a db query.

(I'm not sure about find-signature: Maybe we could add a pass to walk pre-expanded surface syntax, finding all signatures, as the status quo back end does one by one.)


Status: Done as an initial sanity check, then discarded. I modified racket-xp-mode and the Racket Mode back end to use pdb when available, and use the same propertize-all-buffer approach. It performed about the same as before; having multi-file rneame was nice. Although that's still in the commit history, I wanted to move on past that to the next step.

Step 2: Query results JIT for spans

A bigger change: The front end would query just for various spans of the buffer, as-needed.

This would improve how we handle larger files like class-internal.rkt, not to mention eenormous files like the example provided by samth.

Status quo, Emacs doesn't block while the analysis is underway, but after it completes, for a sufficiently large buffer and analysis results, it takes a very long time to marshal the results and to re-propertize the entire buffer; Emacs can noticeably freeze.

Admittedly doing limited, JIT queries doesn't magically transform drracket/check-syntax itself to a "streaming" or incremental approach. The entire analysis would still need to complete (still taking about 10 seconds for class-internal.rkt, and 60 for the example provided by samth!) before any new results were available. However the results could be retrieved in vastly smaller batches. IOW there would still be a large delay until any new results were available, but no update freezes.


Status: Done. Still dog-fooding. I quickly realized that modifying racket-xp-mode to work in both the "classic" and new ways was going to be messy. Instead I made a fresh racket-pdb-mode. This works by doing a query to the db whenever point (Emacs jargon, a.k.a. the caret) moves. The back end and pdb return values only pertaining to point and the currently visible span (the window-start through window-end positions, in Emacs jargon). I'm still dog-fooding this, looking for problems or mis-features.

Other tools

Of course this could become a package to be used in various other ways.

We could offer any of:

  • A CLI (e.g. a new raco tool).

  • A stable API for Racket programs.

  • An equivalent API via HTTP.

One issue here is that some tools might prefer or need line:column coordinates instead of positions. [Effectively drracket/check-syntax and our own analysis use syntax-position and syntax-span, ignoring syntax-line and syntax-column.] Either we could try to store line:col-denominated spans, also, in the db when we analyze (at some cost in space). Or we could just synthesize these as/when needed by such an API, by running through the file using port-count-lines! (at some cost in time).

Known limitations and to-do

  • The #%provide clauses all-defined, all-defined-except, prefix-all-defined, and prefix-all-defined-except are not yet supported by our analysis that finds exports. (Note that provide clauses like all-defined-out do not actually expand into these, and are supported. So this limitation isn't as big as it seems. But if some handwritten code or other macro expansion uses these specific #%provide clauses, the exports won't be identified.)

  • The rename-sites command currently returns a hash-table value with all results. For renames involving a huge number of files and sites, a for-each flavor might be preferable.

Tire kicking

If you want to kick the tires on this in its current state, I recommend looking at the tests in example.rkt, as called from the tests submodule.

As the functions work in terms of 1-based positions, just like Racket syntax-position and Emacs buffer positions, it's annoying to keep typing C-x = to see the position at point while in the example files. You might find it handy to add something like the following to your Emacs mode-line-position definition:

(:propertize (:eval (format "%s" (point)))
             face (:slant italic))

Also remember that M-g c will let you jump to a position.


You probably want to avoid, however, the very-many-files-example submodule -- unless you want to wait hours for very many files to be analyzed:

  (require pdb)
  (for ([d (in-list (list* (get-pkgs-dir 'installation)
                           (get-pkgs-dir 'user)
                           (current-library-collection-paths)))])
    (when (directory-exists? d)
      (time (add-directory d #:import-depth 32767))))
  (require (submod pdb/store maintenance))
  (displayln (db-stats))

On my system -- with the non-minimal Racket distribution installed, and about a dozen other packages:

--------------------------------------------------------------------------
Analysis data for 8124 source files: 183.5 MiB.

596394 nominal imports of 149866 exports: 3.2 MiB.
7667 interned paths: 0.6 MiB.

Total: 187.2 MiB.
Does not include space for integer key columns or indexes.

/home/greg/.racket/pdb/pdb-main.sqlite: 219.4 MiB.
Actual space on disk may be much larger due to deleted items: see VACUUM.
-------------------------------------------------------------------------

Also, if you use Emacs, you could try the new pdb branch from the racket-mode repo. In this case you probably to change your racket-mode-hook to use racket-pdb-mode instead of racket-xp-mode. Be aware that sometimes you'll need to git pull from both this pdb repo as well as the pdb branch on the racket-mode repo -- in other words sometimes I'll make a breaking change that requires you to pull from both repos. At this stage things are still evolving, sometimes drastically, so unfortunately it's not yet worth preserving backward compatibility.

About

Multi-file check-syntax database

License:GNU General Public License v3.0


Languages

Language:Racket 99.5%Language:Makefile 0.5%