Improve out-of-band change detection
dlqqq opened this issue · comments
Introduction
An out-of-band (abbrev. OOB) change is an edit of a collaborative file not done through JupyterLab's UI. For example, suppose user A runs git reset --hard <other-branch>
while another user B is editing a collaborative document within the same repository. A's git reset
deletes all edits made by B and sets the contents of that document to some fixed value. Then the YStore table containing B's edits needs to be dropped, and this event must be somehow communicated to all Yjs clients connected to the document. Here, we say that an out-of-band change occurred on B's document.
Description
The current implementation of OOB change detection results in more data loss than necessary. Fixing this issue will likely reduce instances of user data loss such as the one reported in #219. I've written this issue in two parts below.
Context: how OOB changes are detected currently
OOB change detection is implemented via polling in FileLoader
. Here are the relevant portions of its source:
class FileLoader:
"""
A class to centralize all the operation on a file.
"""
def __init__(
self, ...
) -> None:
...
self._subscriptions: dict[str, Callable[[], Coroutine[Any, Any, None]]] = {}
self._watcher = asyncio.create_task(self._watch_file()) if self._poll_interval else None
self.last_modified = None
def observe(self, id: str, callback: Callable[[], Coroutine[Any, Any, None]]) -> None:
"""
Subscribe to the file to get notified about out-of-band file changes.
Parameters:
id (str): Room ID
callback (Callable): Callback for notifying the room.
"""
self._subscriptions[id] = callback
async def _watch_file(self) -> None:
"""
Async task for watching a file.
"""
...
while True:
try:
await asyncio.sleep(self._poll_interval)
try:
await self.maybe_notify()
except Exception as e:
self._log.error(f"Error watching file: {self.path}\n{e!r}", exc_info=e)
except asyncio.CancelledError:
break
...
async def maybe_notify(self) -> None:
"""
Notifies subscribed rooms about out-of-band file changes.
"""
do_notify = False
async with self._lock:
# Get model metadata; format and type are not need
model = await ensure_async(self._contents_manager.get(self.path, content=False))
if self.last_modified is not None and self.last_modified < model["last_modified"]:
do_notify = True
self.last_modified = model["last_modified"]
if do_notify:
# Notify out-of-band change
# callbacks will load the file content, thus release the lock before calling them
for callback in self._subscriptions.values():
await callback()
Here, self.last_modified
is a value that is set by FileLoader
's' filesystem methods, FileLoader.load_content()
and FileLoader.maybe_save_content()
. FileLoader.maybe_notify()
is repeatedly called to check the latest last_modified
time (mtime) of the file against its previously recorded mtime. If the latest mtime is greater, then the file was probably modified directly by the server, and an OOB change probably occurred. Each parent is then notified of this OOB change by their callback, previously appended by calling FileLoader.observe()
.
In jupyter_collaboration
, DocumentRoom
is the parent initializing the FileLoader
. DocumentRoom
subscribes to FileLoader
with a callback after it is initialized:
class DocumentRoom(YRoom):
"""A Y room for a possibly stored document (e.g. a notebook)."""
def __init__(
self, ...
):
super().__init__(ready=False, ystore=ystore, log=log)
...
self._file.observe(self.room_id, self._on_outofband_change)
Issue part 1: Using mtime alone is unreliable
When using mtime alone, false negatives (i.e. OOB change going undetected) occur because:
-
A file's mtime can have an imprecision of up to a few seconds (depending on the filesystem). Within that interval, the server can modify a file without updating its mtime, resulting in a false negative.
-
The POSIX standard allows for some system calls like
mmap()
to update the content of a file without updating its mtime.
When using mtime alone, false positives (i.e. an OOB change event is emitted without an OOB change) occur because:
- Writing identical content to a file usually updates the
mtime
but leaves the content unaffected. If an OOB change is signaled here, then a false positive occurs, and collaboration is halted for no reason.- This can be triggered by anything that saves a file to disk. For example, if a user opens a file with
vim
and immediately exits by running:wq
, that would cause a OOB change to be wrongly signaled.
- This can be triggered by anything that saves a file to disk. For example, if a user opens a file with
Both false positives and false negatives result in data loss of at least one client's intent. If we consider the previous example with A and B, a false positive results in A's git reset
going ignored, and a false negative results in halting collaboration on B's notebook no reason. In both cases, either A or B's intended content is ignored.
False positives are especially dangerous because of how easily they can be triggered and how they completely halt collaboration on a document. I suspect that there are specific common user scenarios that trigger these OOB false positives, which contributes towards a higher frequency of data loss issues for users.
Issue part 2: Redundant way of signaling OOB changes
FileLoader
has a redundant way of signaling OOB changes to its parent object, but only when a file is saved through FileLoader.maybe_save_content()
:
async def maybe_save_content(self, model: dict[str, Any]) -> None:
"""
Save the content of the file.
Parameters:
model (dict): A dictionary with format, type, last_modified, and content of the file.
Raises:
OutOfBandChanges: if the file was modified at a latter time than the model
### Note:
If there is changes on disk, this method will raise an OutOfBandChanges exception.
"""
async with self._lock:
...
if self.last_modified == m["last_modified"]:
...
else:
# file changed on disk, raise an error
self.last_modified = m["last_modified"]
raise OutOfBandChanges
Here, it signals an OOB change having occurred by raising an OutOfBandChanges
exception. The redundant signaling of an OOB change causes DocumentRoom
to similarly provide a redundant handling of an OOB change:
async def _maybe_save_document(self, saving_document: asyncio.Task | None) -> None:
"""
Saves the content of the document to disk.
### Note:
There is a save delay to debounce the save since we could receive a high
amount of changes in a short period of time. This way we can cancel the
previous save.
"""
...
try:
...
await self._file.maybe_save_content(...)
...
except OutOfBandChanges:
... # <= the block here does not call `self._on_outofband_change()`
...
Proposed new implementation
- Use other file metadata from
os.stat()
to detect OOB changes more accurately. Changes in file size should be the deciding factor in whether a file was changed, not changes in mtime. - Replace the redundant signaling of OOB changes with the
FileLoader.observe()
signaling API.
- Use other file metadata from
os.stat()
to detect OOB changes more accurately. Changes in file size should be the deciding factor in whether a file was changed, not changes in mtime.
I agree that file modification time is not optimal to check if the file was saved by Jupyter or externally. But as I mentioned here, I think the best way to improve the situation would be to use the new hash
field of the contents model.
2. Replace the redundant signaling of OOB changes with the
FileLoader.observe()
signaling API.
Let me explain what you call "redundant signaling of OOB". There is a polling at regular intervals to check for file changes (regardless of whether Jupyter is saving), and a check that happens when Jupyter saves a file. The former is needed because we want users to have their document updated if an OOB happens. But since it's done by polling, we could easily miss an OOB change when saving. That's why the latter is also needed.
When you say "the block here does not call self._on_outofband_change()
", if you look closely the code here is the same as here, so it's equivalent. But I agree that code should not be duplicated.
Happy to review PRs for both points.
@davidbrochart Yes, I agree. Thanks for clarifying! 👍
I won't work on a PR for this just yet. I'll be focusing on opening issues to track our work, as requested by Brian & Sylvain.