mar-file-system / marfs

MarFS provides a scalable near-POSIX file system by using one or more POSIX file systems as a scalable metadata component and one or more data stores (object, file, etc) as a scalable data component.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multiple Simultaneous Writers

gransom opened this issue · comments

I've been discussing with Jeff how MarFS handles multiple simultaneous writers to the same file. Our belief is that the second writer to touch the metadata would trash the object being written to by the first before writing to its own new object. That seems fine, in principle, but we're concerned about how FUSE(through pipetool)/Pftool would behave if their object is trashed while being written to. This could likely leave us in a number of ugly states.
At the very least, Jeff has experienced pftool runs started against an object with an in progress write seem to hang, doing nothing.

[Dang it. The UI ate my edits.]

In tempfs, or even NFS, I can also create corrupt data by having two writers writing to the same file. It might be reasonable to promulgate the following advice:

"do not have two users writing to the same file."

Meanwhile, regarding MarFS, two writers that have the same file open are each writing to different objects, so, technically, they can't corrupt each other's "data". But they can seem to do so by virtue of race conditions on updates to the metadata.

By carefully selecting the order of operations in the debugger (tricky since gdb is crap for stepping two threads while allowing others to proceed), I can indeed get fuse to screw up two dueling closers. The elements that are updated without locks are:

  • truncation of the MD file to the proper size
  • installation of a sequence of system xattrs, including object-ID

It's possible for the final MD to end up with the size of one of the objects, but the object-ID of the other. Conceivably, we could even end up with a mix of xattrs. The crux is in fuse_flush -> marfs_flush, and the resulting call to save_xattrs().




FIX
One approach to fixing this would be to make the MD updates tot a temp file, then rename over the original. A downside is that the temp-file will be briefly visible to users (though I suppose we could have fuse hide it). We'd have to be careful to delete these temp-files in the event that the rename fails, etc.




FOLLOW-UP ISSUE
What I said about two different objects was true in my tests by virtue of the fact that they were opened at different times and thus generated different object-IDs by virtue of different "ctime" components. We have a "unq" component for precisely this reason: to assure that two objects that would otherwise have the same ID can be forced to have different IDs. It was unit-tested long ago. Let's make sure it still works.

We have recently addressed a similar issue in pftool, so my previous comment applies only to FUSE.

I believe this issue is broader than what was originally discussed here. Even with pftool performing its tmpfile/rename multi logic, I'm not sure that it completely avoids all issues. To first recap the original problem:

Original Problem (multiple file handles open for write)
This can result in races when updating metadata on the MDFS file. In the process of opening/writing/closing a marfs file, these metadata updates include operations such as truncate, stat, getxattr, setxattr, and (in some cases) actually writing chunk mappings into the metadata file. The gaps between all of these have the potential for race conditions. It is relatively easy to imagine situations in which this could lead to unanticipated behavior. For example, consider the creation of a file with user.marfs_objid="packed_obj_of_writer1" and user.marfs_post="offset_of_writer2" (such as if two writers, with at least one writing a packed obj, issue simultaneous marfs_close() ops). This would result in the file referencing an inappropriate offset within a packed object, potentially exposing data to users which should not have permission to access it. It is also easy to imagine situations in which data objects would be leaked (xattrs overwritten without being trashed).

Related Issue (path updates against open file handle)
A related issue involves races between readers or writers and unlink/rename/create operations. For example, if a client holds a file open for write and another renames over that file before the first issues marfs_close(), then the closing process will smash xattrs on the new destination file. If a file is unlinked and recreated while the unlinked version is open elsewhere, a similar situation occurs. This could result in a similar variety of situations to the original problem (improper data access, lost data objects).

Fixing?
As Jeff mentioned, performing all updates on temporary files is a potential solution, provided those files can never collide. The new implementation of marfs_rename() should prevent any leakage/problems during that final rename. However, I think there may still be edge cases this doesn't cover (overwrite of a file between stat/getxattr calls of a reader), though that doesn't sound nearly as problematic.

There might be a solution which would prevent any misuse of the marfs api. We can have every marfs_open() call, before issuing any other metadata ops, actually acquire a file descriptor for the MDFS file, with appropriate open() flags (via open_md(), or some such function). This helps address the original issue in that, at least should all clients call marfs_open() with 'O_CREATE & O_EXCL', we can guarantee that a new file is being written by only a single client at a time.
Considering the wide variety of race conditions and edge cases involved in multiple simultaneous writers and the fact that every marfs_open() for write of an existing file is implicitly a trash/overwrite of that file, I think it may be worth having marfs_open(O_WRONLY) always imply an open with 'O_CREATE & O_EXCL' of the MDFS file, even if we are secretly calling marfs_unlink(path) beforehand under the hood (such as I imagine O_TRUNC would have to be implemented). This would make all successful marfs_open(O_WRONLY) calls imply the existence of a single writable file handle. I'm not yet sure if this is feasible, as it will depend on how pftool handles multi-files. If it is, the only potential issues would be in marfs_open_at_offset(), which must of course allow multiple writers to the same file. However, I think we can be comfortable in saying that it is the client's responsibility to use that functionality correctly by not having two writers performing simultaneous in-place updates. Pftool's rename mechanism of multis should protect it from any problems with open_at_offset, and FUSE is incapable of making use of that functionality.

Issuing an MDFS open at the start of each marfs_open() also provides a relatively simple mechanism for avoiding the 'Related Issue' mentioned above. As each open marfs file handle must now be holding an underlying MDFS handle, all metadata updates can be issued against that MDFS file handle rather than against path name. In other words, the implicit truncate()/stat()/getxattr()/setxattr() calls involved in reads/writes all become ftruncate()/fstat()/fgetxattr()/fsetxattr() calls. As the underlying metadata file system (GPFS in our case) associates file descriptors with inodes, rather than path, this approach will allow an open marfs file handle to exclusively operate on the specific file which it opened, even if that file has since been overwritten/renamed/etc. Essentially, this allows marfs file handles to behave more like file handles of fully-featured POSIX filesystems, being tied to a specific metadata inode for their entire existence.

Unfortunately, there is still more to discuss on this, if we intend to allow data writes through FUSE (i.e. 'echo "test" > existing_marfs_file', which results in FUSE issuing marfs_open(path), marfs_truncate(path), marfs_write(original_file_handle)... which updates the trashed file ). Maybe we can discuss in person. I have typed too much.

There is nothing new under the sun. See Gary's comment from issue #89 :

"Files are associated with fds if Marfs is written to spec and thus most normal locking functions exist
When operating on a file, the mdfile should be open by the operating process even though you don't read or write into the mdfile
This gets us proper behavior for things like unlink"

... and rename/trashing/multiple-writers (potentially).
I live only to independently arrive at preexisting solutions.

Marfs open should always open the metadata file and keep it open the whole time.
Marfs file handle should be used to manipulate the metadata file.

Basically, the idea is to do all maintenance of xattrs, etc, through the MDFS file-handle, which is kept in the MarFS filehandle, rather than operating on the path. (I would've thought we'd already be doing that. No?)

Then, upon close, POSIX will take care of deciding who wins, and bringing all their MD changes atomically.