ValeLang / Vale

Compiler for the Vale programming language - http://vale.dev/

Home Page:https://vale.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Potential improvements for perfect replayability

Verdagon opened this issue · comments

To make replaying compliant with privacy in production: you could have a HTTP server by default run without tracing, but if you see a request with a special flag (indicating "this user is trying to help reproduce a bug, and fully consents to data recording as part of that)" come in, you enable recording from there, capturing all known state and also a trace from there until the request is done, where you stop recording, and output the file, to be sent off to a logging server

(will need to finish figuring out how to record certain areas / threads / time slices of programs)

When recording mode is letting a whitelisted FFI call go through, let's use a checksum to make sure that the incoming data is actually the same as when the recorded run happened.

(Thanks to 5225225 for these ideas!)

If you're thinking about privacy, it's probably a good idea to check out this nice feature called "Sensitive Value Attribute" brought into PHP 8.2.

I can easily see a lot of other languages adopting their idea in the near future.

Example taken from: https://stitcher.io/blog/new-in-php-82#redact-parameters-in-back-traces-rfc.

function login(
    string $user,
    #[\SensitiveParameter] string $password
) {
    // …
    
    throw new Exception('Error');
}
 
login('root', 'root');

Stack Trace:

Fatal error: Uncaught Exception: Error in login.php:8
Stack trace:
#0 login.php(11): login('root', Object(SensitiveParameterValue))
#1 {main}
  thrown in login.php on line 8

Note the $password variable appears as Object(SensitiveParameterValue) instead of it's real value in the logs.

Huh! I've occasionally seen some special type-system usage for untrusted user input, but I never seen it for sensitive data, and never considered making the language aware of it so it could do things like that. That's a really interesting notion!

<thinking out loud>

Marking them as sensitive would indeed let us strip them out of the logs, but that wouldn't quite let us do the recording. Some code paths might depend on the values of the sensitive data, I would think. But I might be wrong. Most sensitive data is things like SSNs, passwords, etc. which is pass-through most of the time.

There's a related idea (not sure where) for replayability which will let us represent "opaque types" in the language, whose values are only moved around but never read for any calculations (cant add them, hash them, etc). Opaque types are elided from the recording, and because they never factor into any calculations, we can deterministically replay the entire recording without knowing their values.

So, we could have a special Sensitive which basically acts like an opaque str.

It could be complicated to completely strip this, because normally replayability records anything crossing the FFI boundary, and things will probably cross the network boundary before we can even put them into a Sensitive. I wonder if we could mark the entire incoming buffer as Sensitive, to move the boundary slightly and record anything coming out of it.

Come to think of it, representing boundaries is literally what regions are for. Perhaps we can have a sensitive' region for anything sensitive. We would record anything that transmigrates across the boundary from sensitive' into anything else, similar to how we treated the FFI/host region. The user would just need to be careful to not explicitly bring anything from the sensitive' region into the other regions, but that's explicit so it should be auditable.

</thinking out loud>

I think we can do something with this!

In today's replayability, we record anything going from the host' region (FFI, from C/Rust/etc.) to our main regions.

Let's add a global "sensitive'" region, and adjust that rule a bit:

  • We don't record anything going from host' to sensitive'.
  • We record anything going from host' or sensitive' into our non-sensitive' main regions.

This could let us run recordings on the client without recording any PII. There's some open questions about how deserializing functions will move data from host' to a target region in a general way that can also move data into sensitive' regions... but this sounds like a really tractable problem now.

Thanks a bunch @spartanatreyu for the intel and inspiration!