meilisearch / heed

A fully typed LMDB wrapper with minimum overhead 🐦

Home Page:https://docs.rs/heed

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Env open with `Flags::MdbNoSubDir` flag requires target file to exists

shedar opened this issue · comments

I'm trying to create a DB like "some_dir/custom.mdb". Where "custom.mdb" is a file, not a directory. I use Flags::MdbNoSubDir to achieve it.
The current implementation requires the target file to exist due to the use of canonicalize, even though underlying mdb_env_open works without creating the file.

It works if I create an empty file manually, but it would be great to not touch the DB file explicitly.

I had the same issue, as a new Rust user the error message was super confusing Error: Io(Os { code: 2, kind: NotFound, message: "No such file or directory" }).

Thank you @shedar and @missinglink for the feedback.

Unfortunately, heed tries to make LMDB safe to use, and one limitation of LMDB is that one process must not open the same environment twice. Therefore, heed ensures that the targeted path doesn't point to the same env by canonicalizing it.

I can propose different solutions to fix this issue:

  1. Keep the current behavior but document it on the Flags::MdbNoSubDir. This forces you to create the file. I could also improve the error returned.
  2. Introduce a new EnvOpenOptions::no_sub_directory that will make the open to create the file for you. I'm not too fond of this solution as it is unsafe, according to the LMDB documentation.

    Do not have open an LMDB database twice in the same process at the same time. Not even from a plain open() call - close()ing it breaks flock() advisory locking.

  3. Introduce an unsafe EnvOpenOptions::no_canonicalization or no_uniqueness_check that will make the open method not check if the path is already open but with another path e.g. using symlinks.

What do you think about those solutions? Do you have more solutions to propose?

@Kerollmops

I was thinking about something like this:

  • Try to canonicalize the path as it is now;
  • If there is a "No such file or directory" error and self.flags contains Flags::MdbNoSubDir => Try to canonicalize the path without the last component (without file name)
  • If there’s still an error - return it
  • if Flags::MdbNoSubDir in flags and the second canonicalize is successful (parent directory exists and canonicalized) => try to open the “{canonicalized_dir}/{orignial_file_name}” after acquiring write lock

This way, the lock prevents parallel attempts to create the file (probably with an additional check that the file doesn't exist after acquiring the lock). At the same time, we only do additional file operations within the lock when opening non-existing env, so it shouldn’t be a performance hit for regular flow. And it doesn't require additional options.

The edge case I see is the following scenario:

  • heed checks - file doesn’t exist
  • another process creates a symlink with the same file name before heed opens the file
  • later, the app opens another env using a target file of the symlink created by a different process above.

But I have a hard time coming up with a real-world scenario where this edge-case flow makes sense.

What do you think?

Your solution is perfect for me. I implemented it in #158, but only the great lines. Can you review it and test it? Be careful. It is on the main branch, part of the v0.20.0 version. A lot of breaking changes are coming for the best.

@Kerollmops Thank you for the quick implementation. I've tried the PR branch with my test project - works great.
Also, updating to v0.20.0 was easier than I expected.

BTW, do you have any plans for #41 any time soon? I see it's not in the v0.20.0 milestone for now.

BTW, do you have any plans for #41 any time soon? I see it's not in the v0.20.0 milestone for now.

The amount of work to implement it in a correct and idiomatic way would be too much, but could you propose something on #41 and help me? I also see an issue with the Poly/Database::get method. Should it be changed to return an Iterator, and therefore, should we introduce a new type?

Hi @Kerollmops, I'm new to Rust (but have used LMDB in a few other languages), this doesn't seem right to me 🤔

I believe it's still possible to open the same file twice, once using the NoSubDir option and once without?
Looking at the filesystems created using the two options below:

subdir
└── example.mdb
    ├── data.mdb
    └── lock.mdb
nosubdir
├── example.mdb
└── example.mdb-lock

I could be wrong here, but isn't only the directory ./subdir/example.mdb/ considered by canonicalise and not the actual data path ./subdir/example.mdb/data.mdb? this path could be opened again in the same process using NoSubDir?

The merged patch seems to use a fallback method, running canonicalize_path twice, this seems to be unnecessary as the file path is known to be either example.mdb/data.mdb or example.mdb based on the flags, so only one check is necessary?

My suggestion would be to only ever canonicalize the actual data file and never a directory.

Hey @missinglink,

Thank you very much for this feedback. So what you recommend could be something like this: If the NoSubDir option is set, pop the file name, canonicalize the path and store it with the file name. If the NoSubDir option is missing, then do the same but consider data.mdb as the file name.

This way, we will only canonicalize the path once and always store the environment under the data file itself. We can only open the same environment once with different options.

What do you think about this?

That sounds reasonable, I'm just unclear about the naming convention 'data.mdb'.

I wonder where this is defined and whether the extension is influenced by the user path 🤔

I can have a look through the C source tomorrow and post links.

That sounds reasonable, I'm just unclear about the naming convention 'data.mdb'.
I wonder where this is defined and whether the extension is influenced by the user path 🤔

It looks like the names are static and will always be the same when the NoSubDir option is not specified.

Yeah nice, FWIW checking the paths isn't ever going to be perfect, what you'd really need to do is check that the two paths refer to the same device and inode as per https://github.com/BurntSushi/same-file

It's a convoluted example I admit, but you can open a database as A.db move it in the terminal using mv to B.db and then open it again at the path B.db, which is something the inode comparison would pick up.

This is one of those things that's actually way trickier than it initially seems.