[Proposal] SQL preprocessing hooks

Question

[Proposal] SQL preprocessing hooks

micimize opened this issue 4 years ago · comments

Michael Joseph Rosenthal commented 4 years ago

It would be nice to have a preprocessors option that would allow the user to provide an array of (sql: string) => string transformers to be run before commit. The saved .sql files under committed could be the final result, thus avoiding rewrite error opaqueness issues. The api could be implemented in .gmrc like so:

{
  "preprocessors": ["./directiveBoltons.js", "some-module", "@graphile/gm-always-idempotent"]
}

Use cases

bolt-on features such as parsing -- @description: blah comments or directives into actual comment on ___ is 'blah'` (POC)
bolt-on idiomatic grant generation via @directives
idempotency enforcement / sugar (FUNCTION => CREATE OR REPLACE FUNCTION)
here are some "macros" I've made use of with a shell-based schema builder (namely "mixins" and function templates)

Drawbacks

While i think it'd have a relatively small surface area, and there is already some preprocessing in the form of :placeholders, a full rewriting system opens the door to users some custom DSL, which arguably goes against the "Migrations should be written in SQL" principle:

@description: A task defined by our good user
TABLE task (
  @omit: read, write
  id                 UUID PRIMARY KEY DEFAULT uuid_generate_v1mc()

  @omit: read, write
  user_id            UUID NOT NULL

  @deprecated
  updated            finite_datetime NOT NULL DEFAULT NOW()

  @virtual
  created(task task) finite_datetime => cast(uuid_timestamp(task.id) AS finite_datetime)

  lifecycle          task_lifecycle default 'TODO'
  closed             finite_datetime

  title              TEXT CHECK (char_length(title) < 280)
  description        TEXT
)

Benjie · Answer 1 · Mon Apr 06 2020 21:12:08 GMT+0800 (China Standard Time)

I'm open to this idea; here's a few thoughts.

generatePlaceholderReplacement

The function with this name is effectively a built in pre-processor that we already have. We could make this the default preprocessor list, but allow you to override it via configuration; if you do so it's up to you whether or not you re-include it in the list (if you don't then it will be skipped).

NOTE: this function runs both against current.sql and committed migrations, so it must be fast.

Async

The solution should support async callbacks so extra data can be read from files/network/etc. This, however, could make it (or allow it to become) slow.

Current vs committed

It may make sense that some of these transforms are applied at the commit stage so that the transforms don't need to run when migrating production, only when running current.sql. This would also mean that committed migrations would be unaffected by further changes to the transforms, which might be desirable.

Applying transforms against committed migrations could be useful though, e.g. to work around syntax differences if you upgrade your PostgreSQL version, or even to apply some kind of fix to an old (hashed) commit rather than having to re-hash the entire stack.

Cacheable

We should be able to skip calling perprocessors if the input value is unchanged.

THIS IS PROBLEMATIC. If you change the preprocessor, then you'd want graphile-migrate to run the new SQL. But, running all the async preprocessors just to determine that nothing has changed is a little expensive.

Hash the raw input

To make hash checks fast we should hash the raw input rather than the result of performing the transforms. This is already how we do things w.r.t. generatePlaceholderReplacement, so it's natural to continue this way.

Inputs to the function

Clearly the function needs to receive the input text; but there's also a lot of other things that might be relevant:

config options, e.g. the placeholder values
the shadow connection string or similar so you could e.g. compare against the status quo
the previous text, e.g. if you're iterating current.sql you might want to undo changes in the previously saved current.sql that you no-longer have in the current version

Lots to think about here. I don't think we'll move this forward until I've had time to let these ideas slosh around in my head for a bit. Let me know if you have further thoughts!

Michael Joseph Rosenthal · Answer 2 · Fri Apr 10 2020 01:37:54 GMT+0800 (China Standard Time)

Hmm - there is indeed a lot I hadn't considered!

It seems we'd need a spec to differentiate between precache and runtime preprocessors
Maybe cacheOutput: false - or maybe they should be two different options,
like { preprocessors: { precache, runtime } }. Two options makes more sense to me because anything that runs after the first runtime processor can't be cached.

Hashing will mostly depend on how flexible the preprocessors can be, i.e. the allowed inputs.
I'm thinking of a signature like

hash(inputs: { currentMigration: string, precache: Preprocessor[], runtime: Preprocesor[] })

Where Preprocessor is: { package: string, version: string } | { literalCode: string }. We'd want to record which runtime preprocessors we used in the header comment.

I think at least initially the precache input should be constrained to current.sql / current/ and possibly the previous schema. I think it'd be better to handle getting the previous schema outside of preprocessors, as it can be cached.
If a user wants non-sql config / input for a precache processor, then it can go in current/, as it is a dependency of the current migration. If they need to configure it with sensitive info, it probably needs to be a runtime preprocessor anyways.

That's assuming precache processors are simple babel-like transforms, which is what I had in mind. Not sure anything complex enough to access network would be appropriate as a preprocessor - perhaps from a lack of domain exposure 😅

Benjie · Answer 3 · Fri Apr 10 2020 17:38:10 GMT+0800 (China Standard Time)

That's assuming precache processors are simple babel-like transforms, which is what I had in mind. Not sure anything complex enough to access network would be appropriate as a preprocessor - perhaps from a lack of domain exposure sweat_smile

e.g. @dropFunctions foo might connect to the database, find all of the overloaded functions called foo, and insert the relevant drop statements into the migration e.g.

DROP FUNCTION public.foo();
DROP FUNCTION app_public.foo(user_id int);
DROP FUNCTION app_public.foo(user_id int, bar text);

Of course this could be done synchronously if we were to inspect the database state before the transform ran, but I think doing it async opens up more possibilities.

To be clear I'm not saying that we should implement this async functionality from the start, I'm saying that for a static preprocessor we should consider these features to make sure we add it in such a way it's extensible later.

Benjie · Answer 4 · Fri Aug 21 2020 18:30:10 GMT+0800 (China Standard Time)

I'm going to close this for now; I think this can be achieved by having a separate pre-processor that writes its output to current.sql.