mar-file-system / marfs

MarFS provides a scalable near-POSIX file system by using one or more POSIX file systems as a scalable metadata component and one or more data stores (object, file, etc) as a scalable data component.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reconstruct MC object store files when a MC capacity unit fails

brettkettering opened this issue · comments

When a capability unit fails, we need to figure out what files in what namespaces are affected and reconstruct the missing parts.

Garret finished adding support to the library and Will has a new version that will rebuild all objects affected by a given component failure (the admin will have to specify which component failed). It now outputs failure statistics from the log-based rebuilds to tell admins where a lot of degraded objects were found. Also, it is multi-threaded so we should (hopefully) see decent performance with objects spread across multiple file systems/servers/jobds.

We may need a recursive checksum checker too. Check with the admin folks to see what else might be needed to ensure data integrity and reconstruct in the case of failure.

We have a rebuild utility implemented and tested that handles the following situations:

  • If a capacity unit was down during a write the objects that were written without a full stripe are logged to a file in gpfs. The utility will read these log files and repair the incomplete objects.
  • If a capacity unit goes down completely and all of its data is lost the utility will scan the remaining components to identify and repair the objects that were damaged by the failure.

The rebuild utility can also be used to scan all of the objects in a component and verify their checksums, rebuilding those that are damaged as it goes. It is unaware of the marfs metadata, and cannot identify marfs files that are damaged, and I am not sure why we need to link the damaged objects back to their namespace.

f8a10b3 adds a work-around for the NFS root squash problems. (NFS exports with no_root_squash give permission denied for programs with real uid 0 and effective uid non-zero). With this fix the scatter directories should be owned by a storage admin user, and the rebuilder can be invoked with -u storageadmin to de-escalate and run the rebuild as a user that has read permission in the scatter dirs.

With this change the rebuilder is functionally complete. We should discuss performance targets and whether they have been met before closing the issue.

For a single node doing a rebuild with 16 threads we see 1 object per second which corresponds to approximately 1.2GB/s read from all zpools. This is adequate performance for now.