Recording library runtime for bug analysis by developers

Question

Recording library runtime for bug analysis by developers

yacovm opened this issue 3 years ago · comments

Background: While the consensus library is well tested, rare bugs must still exist and manifest in user environments.
Although running with logging at debug level greatly helps troubleshooting and helps the developers of the library understand what went wrong, it might still not be enough to analyze an occurrence of some bugs.
The reason is that the complexity of the library makes it difficult to keep track of the state changes of its internal components from just reading the logs.

A straightforward way for the library developer to reproduce a bug of a user of the library is to have the user give the developer access to its environment or even ship a snapshot of the environment itself to the developer.
Clearly, both options are often not viable in case of a production environment, as it is imperative that the customer data that is part of the environment remains confidential.

Goal: The goal of this issue is to describe a mechanism that on the one hand allows a user of the library to easily convey to the developer how to reproduce a bug it is experiencing with the library, but on the other hand, does not leak any customer data to the developer.

Proposed idea: I propose to augment the library with the ability to record its runtime in a manner that later allows to run an instance of the library from the recording, so that bugs are reproduced. More specifically, the library shall record the initial state, the messages received, and even (indications of) messages sent into a file, so later on that file could be used to instantiate the library in a development environment from the initial state, and replay the messages and reproduce the bug.

Proposed implementation technique: The Consensus object contains all interactions of the library's internal components with the outside world. All that is required is to intercept and record messages received, messages sent, and record the initial state. Then, the recording can be read iteratively and the events described in it (message reception, synchronization, etc.) can be returned to the library instead of using the real interface implementation of the dependency.
By making interaction between the recording implementation and the rest of the library reside in the consensus object (instead of the internal BFT implementation) we minimize code change impact and bugs.

Confidentiality and data sanitation: Every object will be sanitized to remove sensitive fields prior to recording. The format of the recording shall be in human friendly format (JSON), so that a user may inspect the recording and see that it indeed doesn't contain sensitive data before passing it to a library developer.

A simple demonstration of how the recording and playback capability can be applied is available here. The only dependency that is implemented there so far is the synchronization dependency.
As seen from the implementation, when running in recording mode, the result from the sync is written to the file. Then, when the library is run in a mode that reads the aforementioned recording transcript, it retrieves it from the file and returns the result instead of calling the real synchronization dependency.