Can you sync a directory?
fire opened this issue · comments
Wondering if you can sync a directory.
So far, no.
The intent is more to provide a library for the algorithm for binary data than a final tool though, and I believe it would be relatively simple to create one if you added meta data for multiple files (names, relative locations, index data, file properties and whatever permissions you want to try and replicate).
Currently looking at how tar implements its metadata for files.
http://golang.org/src/archive/tar/common.go?s=1401:2114#L36
I don't know if it's possible to directly use the tar format with 0 length files as the manifest. It seems like it's possible. Any idea if that's a good idea?
I'd personally initially avoid a binary format, and go for json for this - it would make it easier to extend and update while taking advantage of the marshalling support in golang (just using commonsense principles based on REST). It could also allow easier debugging, and could always be served gzip encoded from almost any proper web server.
The properties of the block mechanisms in r/zsync (and therefore gosync) are such that you really aren't going to benefit from it if you're moving a lot of small files (compared to your block size), so in that sort of case you would probably be better off compressing them into an archive and sending that anyway. Given that, I'm less inclined to be initially too worried about the extra space that JSON would take, and you could always add a binary format later on.
Taking inspiration from an existing tried and tested source like archive formats seems like a good approach though, rather than reinventing the wheel. I'm not sure, for example, how the userid/groupid of a file is reconciled across two different systems.
It appears that converting a go struct to a json file is possible. All the element of the tar Header are convertible to json and we can just use a json file of header structs.
From reading the specification, the tar format is several headers and contents concatenated together.
My plan is to walk through each folder from root, then pass the file object to FileInfoHeader() and convert into json. https://stackoverflow.com/questions/6608873/file-system-scanning-in-golang
From documentation: Because os.FileInfo's Name method returns only the base name of the file it describes, it may be necessary to modify the Name field of the returned header to provide the full path name of the file.
I want to code an example that can turn a tar archive and output a json document.
Referring to your question: tar just stores the username and group of the file. On extraction, it uses the username and group of the current user or as root write as the original user.
I've written an example program that goes walks through a directory and outputs a json document of all the tar headers as json.
This json output is of a real directory. https://gist.github.com/fire/574760be7bd153f0ed5d
So I generate manifests for both source and target directories and then for each element in target see if there's an element in source. If there is a difference, run go-sync on that difference else copy the target to your final.
Does this algorithm look reasonable?
Another addition would be to use xz format's integrated file integrity testing or the ability to concatenate the gosync files into one binary.
Off the top of my head, MD5 or equivalent on a whole file should be pretty fast for an initial comparison. File length, modification date would be other potential indicators.
There are three cases -
It's up to date in the place you want it: great!
It's not there at all: need to copy the whole file (preferably compressed, probably with multiple tcp connections to reduce the effect of latency)
It's there, but doesn't match: (potentially) use gosync to update the contents.
At the moment, I would recommend either rsync or zsync for these things - they're tried and tested.
RSync is for pushing (source has access to target, maybe through SSH)
ZSync is for pulling (target has access to a source that is potentially just an http server)
In the longer term, if I spend a lot more time on it, the go sync command line tools could do these things too. At the moment, you're better off using tools that are thoroughly tested in production.
Research suggests that cdns such as Maxcdn uses sftp rather than rsync for pushing to their network. The patch uploader could use https://godoc.org/github.com/pkg/sftp.
It is possible for the launcher to first use zsync and have an implementation of go-sync.
I would prefer a golang executable (with c++/c libraries) with the least amount of additional binaries.
You have to be careful there - I have some experience with using FTP as a way of populating CDN origins with data files, but I think it would probably be a poor way for them to distribute the files around the world due to FTP generally using a single tcp connection and potentially suffering from the bandwidth latency product. Doing multi-part uploads was an order of magnitude difference in time on one project that I worked on, and some companies use WAN optimizers to speed up traffic.
This was one of the major reasons that go-sync is written to be able to use multiple simultaneous connections.
Note I'm not using FTP, I'm using ssh's own protocol called SFTP. However, the performance impacts are unknown to me.
My experience with go libraries and ftp have been quite horrible.
The plan is that client use HTTPS to access the files.
It's worth having a look at the latest changes (particularly noticeable in patch.go). Most of the changes shouldn't be breaking (or should be quick fixes), but could make things significantly easier to use the library at a high level.
Can you sync a directory now?
I'm excited for that moment too.
RSync is for pushing (source has access to target, maybe through SSH)
ZSync is for pulling (target has access to a source that is potentially just an http server)
rsync is universal: while the most common use is for backup (push), it's also used in production for synchronising local outdated iso images with updated remote ones (pull) (e.g., Debian CD/DVD weekly build ISOs).