BurntSushi / same-file

Cross platform Rust library for checking whether two file paths are the same file.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Interest check: adding a serialize, deserialize impl for `Handle`

sanmai-NL opened this issue · comments

I need to compare Handle values between program runs.

Constraints

  • I can’t store file paths but do store metadata as contained in a Handle value (perhaps except the file field when it is Some).
  • The program is cross-platform. Handle values need not be comparable across platforms. Handle values do need to be constructed using deserialization, across platforms.
  • The theoretical issue that Windows file handles are only guaranteed unique as long as both are open is not prohibitive for me, assuming they do ‘tend’ to be unique and same-file has coded around this already using the file size metadata.

Proposal

Add a serde feature to enable this functionality.

Plan

After finishing discussion and getting a go-ahead, I’ll file a PR.

No, sorry. The whole point of this crate is to manage the fact that handles must be live in order to be as close to correct as possible. Without that constraint, the design of the crate might be entirely different. Adding serialization not only completely thwarts that problem, but exacerbates the issue.

@BurntSushi: I think there’s a bit of nuance in the point of the crate reading the tagline in README:

A safe and simple cross platform crate to determine whether two files or directories are the same.

That is the actual point of same-file, no? That Windows also imposes some specific constraints is not the sole or, based on your wording there, main reason to develop same-file. Rather, from an outsider perspective, same-file serves to allow file identity comparison with platform differences such as file metadata API abstracted away.

Since my use case is valid, I think, if you remain of the position that, beyond that apparent point, same-file must also exclusively be ‘as close to correct as possible’, that’d mean that I’ll have to fork it and add just the serde feature there, or create a new crate with a completely different design. Even though the code would be 90 % identical. Isn’t that very wasteful?

That Windows also imposes some specific constraints is not the sole or

I'm not sure what Windows has to do with this? Do you think Unix also doesn't reuse inode numbers for example? It's a cross platform issue.

Sorry, but I don't think it's worth directly supporting your use case in this crate. Serialization of the metadata for the express purpose of comparing files across program runs is explicitly encouraging something that has problems:

  1. The obvious problem is that since the file handle is no longer open, the metadata may no longer be accurate.
  2. It makes it possible to compare file meta across platforms/machines, which doesn't make any sense. You could probably fix this by generating some machine specific identifier, but that's another can of worms. We could also document that this is nonsensical to do, but it still does become possible, which is a downside IMO.

Since my use case is valid, I think, if you remain of the position that, beyond that apparent point, same-file must also exclusively be ‘as close to correct as possible’, that’d mean that I’ll would have to fork it and add just the serde feature there, or create a new project with a completely different design. Even through the code would be 90 % identical. Isn’t that very wasteful?

People disagree. It happens. It's fine. If you're sold on your method, then yes, you'd need to fork this crate or create something else entirely. My point here is that if I were OK with this serialization feature, then it would in turn imply that the design of the API exposed by this crate could be entirely different. The whole point of the API of this crate is to provide access to handles that guarantee that the file itself is still open, and therefore, provide a means of accurately comparing files for equality. (For example, you can't even reasonably implement Serialize or Deserialize for Handle since Handle doesn't represent something that is pure data; it represents a resource that is open.)

At a high level, yes, the crate is trying to solve the problem of file equality. But the entire design of the API is based on the fact that file handles need to remain open during the equality check. It's an implementation strategy that permeates every corner of this crate. If you reject that implementation strategy (which is what you're doing by using serialization across multiple program runs), then this crate no longer makes sense for your use case.

To be clear, I do think it is reasonable to reject this crate's implementation strategy. You might have a more restricted use case or a higher tolerance for errors. That just means you need to find some other way to solve your problem. I don't think this crate is it.

@BurntSushi:
The specific issues around Windows as related to this discussion: see your source code comment.
What I mean to say by referring to Windows in my comments: be sure that I do acknowledge there’s a nasty difference between Linux and Windows beyond FS API differences, that has had great influence on the design (constraints) of same-file.

Reusing the inode of file f_1 could happen after its deletion. That is different from possible invalidation of a previous f_1 inode when all handles to f_1 are invalidated, even as f_1 hasn’t been touched in any way. The former happens on Linux, the latter only on Windows, I understand from the referenced comment.

You do seem to misunderstand my intention wrt. comparing Handle values between platforms. I wrote in my opening post that I need not compare them. They can be considered inequal by necessity then. 100 % correct, so that meets your aim and there’s no issue.

I’m fine about starting a different crate, but I just hope that we put most effort in developing common OSS crates in the Rust community that serve practical, reasonably close needs, so on a higher level, yeah. Looking at the code I think it’s feasible to implement slightly different modes of operation. The optimally correct one and my, practical mode. So ultimately, IMO it’d serve programmer needs better if they could control how much correctness they desire in comparing files in cross-platform code bases. I’d be fine with false negatives in some corner cases: files are considered inequal where they actually are. I’m not fine with false positives, as you: files are considered equal where they aren’t. The feature I propose seems compatible with your design aims.

You do seem to misunderstand my intention wrt. comparing Handle values across platforms. I wrote in my opening post that I need not compare them. They can be considered inequal by necessity then. 100 % correct, so that meets your aim and there’s no issue.

I didn't misunderstand. What I meant was that if you can serialize/deserialize Handle values, then you become able to compare them across machines/platforms. Making those checks return false requires some additional meta data than what is in the crate today. What should that meta data be?

Reusing the inode of file f_1 could happen after its deletion. That is different from possible invalidation of a previous f_1 inode when all handles to f_1 are invalidated, even as f_1 hasn’t been touched in any way. The former happens on Linux, the latter only on Windows, I understand from the referenced comment.

Right, so if you serialize a handle and the corresponding file gets deleted, then when you load that handle back into memory it could actually now represent a different file altogether. This could wreak havoc on the results of comparing deserialized handles because the guarantee that the handle is open is no longer met.

Linux and Windows have the same issue here with respect to serializing metadata. The metadata can become out of date. Windows is indeed a bit more severe.

but I just hope that we put most effort in developing common OSS crates in the Rust community that serve practical, reasonably close needs

I'm all for that. But it doesn't have any applicability here. Part of OSS is that people disagree, and when people disagree, you have the freedom to go out and build your own solution. What I'm saying is that I think your use case shouldn't be solved by this crate. It is possible to design a crate that serves both use cases, I agree, but same-file isn't it. Its API is completely coupled to the notion that the file handles must be open during the equality check. I don't think you quite appreciate that yet. Look at the implementation of Handle itself and the methods available to it. They aren't pure-data objects, they're handles. Serializing them doesn't make sense. When you deserialize a handle, how are you going to implement the as_file method, for example? You can't, not without re-opening the file, which in turn requires even more meta data and isn't guaranteed to succeed. There is a gigantic impedance mismatch. So your next stop is to create a new pure-data type that can be deserialized/serialized that is distinct from Handle, and now you've really started to muddy the API.

Your use case is a fundamentally different problem than what this crate is trying to solve.

@BurntSushi:

Comparing deserialized Handle values between platforms: not a requirement

The Handle structs are completely different between platforms (for reader’s reference: Unix, Windows). Deserializing one such value into another platform’s counterpart will not be possible normally. It’s not a problem if a value can only be deserialized successfully on the platform it was serialized on. No comparison between incompatible Handle values can then be done at all by same-file API consumers. The added value: the API consumer’s code doesn’t need to have platform-dependent logic yet it can compile for multiple platforms.

Stability of Handle values over time

Right, so if you serialize a handle and the corresponding file gets deleted, then when you load that handle back into memory it could actually now represent a different file altogether. This could wreak havoc on the results of comparing deserialized handles because the guarantee that the handle is open is no longer met.

Differently from the kinds of applications you seem to have in mind, my application is to first screen two file handle representations for basic equality, when they are found equal, further checks are done to avoid false positives. Fact is, the exclusive functionality of same-file right now is checking whether two file ‘entries’ are (probably) the same, not e.g. whether the contents of all files the handles reference are equal. You wrote in the crate doc comment for the Windows implementation you’re checking file size equality too to increase reliability. I haven’t spotted that logic in the code there, though. AFAICT you’d have to use the nFileSize{High, Low} fields but I believe you only use nFileIndex{High, Low}.

Another discussion entirely, but doing deeper equality checking based on e.g. modification time and/or contents is a functionality that also should possible with same-file, from a higher-level viewpoint.

So in conclusion, as it stands, I just cannot envision that picture you’re painting in that last sentence. In what concrete way would havoc ensue then?

Serializing them doesn't make sense. When you deserialize a handle, how as you going to implement the as_file method, for example?

I wrote in the opening post that the file field of Handle is indeed something that may require an API change to the signature of a few methods. You call this muddying the API and you seem to want to preserve the current design of same-file to the exclusion of a crate feature. That’s your call of course, but note that I do appreciate that changing things as I propose for this feature would be a substantial change. But it’s not like the crate is huge and impossible to maintain. I don’t really understand why you reiterate developers can have different views, I don’t think we’re debating that.

I am losing my patience for this conversation. I've stated my position a few times now, and I'm not sure why you are continuing to press me on this. This conversation is becoming frustrating because it feels like we're talking past each other.

If after reading this comment you think I'm confused about your feature request, then I'd like to in turn request that you seek to present a more detailed specification for your request instead of going around in circles.

Comparing deserialized Handle values between platforms: not a requirement

I did not say that comparing handle values between platforms was a requirement. I said that it would become possible. I also included the possibility of comparing handle values across different machines. If I serialized a handle and sent it to you and you were running the same Linux OS as I was, then you'd be able to deserialize it and compare it with handles generated on your own machine. This form of comparison is always invalid. I don't care if this is or isn't a requirement, my point is that it becomes possible given your feature request unless we take extra steps to mitigate against it. If we don't mitigate against it, then you have an API that encourages invalid equality checks. If we do mitigate against it, then it adds implementation complexity.

So in conclusion, as it stands, I just cannot envision that picture you’re painting in that last sentence. In what concrete way would havoc ensue then?

At no point in this issue have you talked about false positive checks in addition to handle equality. Your initial feature request asked for serialization routines to be added to Handle. In pseudo code:

let x = Handle::from_path("foo");
x.serialize_to("/tmp/whatever");

// some indeterminate time later, in a different process
let x = Handle::deserialize_from("/tmp/whatever");
// This comparison no longer makes any sense.
// `foo` could be the same file. It could be different,
// and this equality check could result in a false
// positive.
x == Handle::from_path("foo")

You have suggested that your case will call for additional false positive checks. Do they live inside the equality check for Handle? If not, then the semantics of equality are now different based on whether Handle was created via deserialization or not (and is now something the caller must handle). If you do want to include additional false positive checks, then you now need to add additional data to the serialization of Handle that permits same-file to do the false positive check for you, and there is zero guarantee that it will work.

It's not clear what you want though because your feature request isn't precise enough. However, from where I'm standing, I don't like any point in the design space, which is mostly why I jumped straight to rejecting this feature request.

I wrote in the opening post that the file field of Handle is indeed something that may require an API change to the signature of a few methods.

You did? I didn't catch that. I'm still not sure I see it. You mention the file field, but you don't discuss any public API ramifications.

You call this muddying the API and you seem to want to preserve the current design of same-file to the exclusion of a crate feature.

This sort of comment is frustrating to hear, because it feels like you think I'm presenting an argument in bad faith. I will be crystal clear. Here are some things I value (but is not an exhaustive list):

  1. I value cohesive and obvious APIs.
  2. I value theoretical purity.
  3. I value practical utility.

These things are at odds with each other. They must be weighed against each other. Right now, this crate has an incredibly small public API surface: a single Handle type with a few methods and a single free is_same_file function. Because of its size, in my mind, expanding the API has a large marginal cost. For example, adding a single type doubles the number of types in the public API. While you have not made a more concrete proposal, I cannot see how to implement this feature without either removing functionality from the current API (such as exposing &File and &mut File from a Handle) or by adding an additional type (such as HandleData). I'm against the former on its face because that functionality is necessary for effective reuse of resources. I'm not against the latter, but instead, require strong justification for it because it doubles the API surface area. Moreover, exposing a HandleData type that can be compared and serialized changes the assumptions with which this crate's API was designed. I talked about that earlier and would rather not rehash it.

Everything comes with trade offs. Feature requests must be weighed against their cost. In this case, one of its costs is that it makes the API more complex. As I've stated repeatedly by this point, this is not its only cost. As the maintainer of this crate, I need to make a judgment call that cannot be objectively quantified. I've given you my reasons already.

but note that I do appreciate that changing things as I propose for this feature would be a substantial change.

This was not at all clear. If you appreciate this, then it should be much easier for you to accept that creating a different crate is quite reasonable. In general, if you're going to make a feature request against a library that would itself require significant changes, then it should be extremely reasonable for a response that says, "no, it would be better if you went and built your own thing." And then be done with it instead of dragging this out.

But it’s not like the crate is huge and impossible to maintain.

same-file is not the only crate I maintain, and this line of reasoning doesn't scale. By the same token, indeed it isn't that small, which means that creating an alternative should be quite reasonable.

I don’t really understand why you reiterate developers can have different views, I don’t think we’re debating that.

I'm reiterating it because you said things like "Even though the code would be 90 % identical. Isn’t that very wasteful?" What do you want me to say? It's a leading question, but I still tried to answer it honestly. My answer was that it is OK to disagree in the sense that the disagreement might manifest itself in the ecosystem. e.g., "There are two crates for detecting whether two files are the same or not. There is same-file and also sanmai-NL-same-file. Which should I use?" Disagreement is in and of itself a problem because it creates alternatives that might be hard to choose between by someone without domain knowledge, but it is also necessary because people will not always agree on the best way to solve a problem. In other words, no, I don't think creating an alternative would be wasteful because we disagree.

I'm just going to cut straight to the point, because this is going in circles.

I have /no investment at all/ in what assumptions same-file was originally designed around.

I don't either. You're missing my point. My point is that the current API is designed around that assumption (that file handles must be open during equality). If you remove that assumption, then the current API probably no longer makes sense. That in turn suggests a completely different API. That sounds like a phenomenal reason for diverging and building your own solution to the problem. That also sounds like a bad reason to just up and redesign an entire crate that is already at 1.0. This is a completely normal, healthy and reasonable approach to take to advancing the ecosystem.

I don’t agree my proposal is insufficiently detailed though. If you need something from me be assured I will provide it (as long as we are still communicating).

Here is what I need:

  1. A complete specification of the public API that you're proposing. Expressing this as a delta between the current API and your proposed API is OK, so long as it is exhaustive.
  2. Today, the semantics of equality on Handle are that the equality check is performed only while both sides of the equality comparison have an open file handle. The equality check may produce false positives on some platforms although they are generally not expected, but false negatives should never occur. This is stated in the documentation. Does your feature request change these semantics? If so, how?

Technically, I'd consider (2) to be part of (1) since it is part of today's public API. However, it is important enough on its own to call out as a separate discussion point.

(And yes, changing methods that return &File today to Option<&File> is what I'd consider havoc. It forces case analysis on every single caller to support an extremely niche use case that you've outlined. And even then, the caller likely can't do anything reasonable when None is returned. I don't like that API one bit.)