quark-zju / gitrevset

Revset DSL for Git

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Integration with commit evolution

arxanas opened this issue · comments

Hi @quark-zju, this is a super cool project. I would like to integrate it into my project at https://github.com/arxanas/git-branchless, which simulates the workflows at companies like Facebook. I had a few questions I was hoping you could help me with.

Data structures?

What data structures are used to implement the DAG? I found this remark:

See slides/201904-segmented-changelog/segmented-changelog.pdf for pretty graphs about how segments help with ancestry queries.

at https://docs.rs/esl01-dag/0.1.1/esl01_dag/struct.IdDag.html, but I didn't find the associated slides (are they publicly available?). What kind of performance can I expect for various operations?

What kind of correctness guarantees can I expect? If I query the DAG for a node which hasn't yet been observed, what happens? Can I use it in a multi-threaded or multi-process context?

How stable is the DAG API? To what degree can I rely on it?

Performance with reference updates?

The performance for initializing the DAG when running git-revs is quite good on the repository I'm testing with (maybe 30 seconds, compared to minutes when running git commit-graph instead, but I didn't even measure the time because it took so long). But subsequent invocations take two or three seconds at a minimum.

My guess is that it's because crawling all the references here: https://docs.rs/gitdag/0.1.2/src/gitdag/gitdag.rs.html#82. I think you mentioned somewhere in the documentation that it will be slow if there are a lot of references. In the case of git-branchless, we keep track of the commit graph heads ourselves, and we don't care about remote references, so I should be able to significantly speed it up. However, I can't pass in my own GitDag to this library. Should I change the API, and if so, what changes do you recommend?

Commit evolution?

git-branchless implements its own commit evolution feature, not based on the reflog (see https://github.com/arxanas/git-branchless/wiki/Architecture). So I don't want the reflog-based commit evolution implementation here: https://github.com/quark-zju/gitrevset/blob/master/src/mutation.rs#L13. Similarly to the above, if I want to swap out the implementation for this behavior with my own, do I need to change the git-revset API, and if so, what changes do you recommend?

add_heads_and_flush

For this function DagPersistent::add_heads_and_flush: https://docs.rs/esl01-dag/0.2.1/esl01_dag/namedag/struct.NameDag.html#method.add_heads_and_flush, why does it care about the difference between master names and non-master names? git-branchless relies on a main branch, but I don't see why the DAG itself cares about which branches are "main".

Hi, @arxanas! Thanks for the questions. I'm interested in supporting use cases in other git tools.

Here are the answers:

but I didn't find the associated slides (are they publicly available?)

Yes. The slides can be found here: https://github.com/facebookexperimental/eden/blob/master/eden/scm/slides/201904-segmented-changelog/segmented-changelog.pdf

What kind of performance can I expect for various operations?

You can find some benchmark data here: facebook/sapling@e12b6c8

What kind of correctness guarantees can I expect?

Correctness is taken seriously. There are unit tests for sure. Some core algorithms (such as gca, ranges) were fuzz-tested against other known correct algorithms (see https://github.com/facebookexperimental/eden/tree/master/eden/scm/lib/dag/fuzz). The dag crate has completely replaced the old revlog-based commit graph algorithms and been running in production for months.

If I query the DAG for a node which hasn't yet been observed, what happens?

Do you mean detached heads? The current design is to build up dag based on all references. So if you access something that's not referred by git references, it will not be found. In theory it's possible to have some logic to update the DAG on demand. But that's not implemented yet.

Can I use it in a multi-threaded or multi-process context?

As long as it compiles.

How stable is the DAG API? To what degree can I rely on it?

There was a big change, though. That is switching from sync to async Rust. It was done at upstream. If you want sync interface then I might add some sync adapters.

Aside from that, the core APIs (DagAlgorithm) are kind of stable practically speaking, although there was no explicit guarantees.

In the case of git-branchless, we keep track of the commit graph heads ourselves, and we don't care about remote references,

I can take a look exposing APIs to control heads instead of reading git references. It shouldn't be too hard.

Similarly to the above, if I want to swap out the implementation for this behavior with my own, do I need to change the git-revset API

I think it wouldn't be too hard to support customizing revset functions so you can define your own predecessorsand override the builtin one. I can take a look (in two weeks).

You can create separate issues if you want to track the above 2 features more precisely.

why does it care about the difference between master names and non-master names?

For better performance. Ideally ::master is just a single span (ex. 0:1000) instead of multiple spans (ex. 0:500 + 502:1001). So it pre-allocates a large chunk of ids for the master branch. See also https://github.com/facebookexperimental/eden/blob/master/eden/scm/newdoc/dev/internals/SegmentedChangelog.rst#groups

Thank you very much for your detailed response and for the references. This is really fantastic work by you and your team.

Correctness is taken seriously.

Sorry, I wasn't clear about what I meant, but it's great to hear that it's battle-tested. I meant: do the data structures give false positives/false negatives or under-/over-approximations in some cases (like e.g. bloom filters)? But it's clear now from your references that they give exact answers.

There was a big change, though. That is switching from sync to async Rust. It was done at upstream. If you want sync interface then I might add some sync adapters.

Is the async interface already checked in/published to crates.io? I don't see it in the API documentation. By "upstream", do you mean that it was checked into https://github.com/facebookexperimental/eden, but the dependencies haven't been updated yet for this crate? Regardless, I don't think using sync versus async will be a problem on my end.

bloom filters

There are no bloom filters being used. That said, keep in mind that dag only cares about the graph, not commit. dag does not store or handle commit message, date, user, etc. It's this crate that deals with date, user, messages and currently if you specify something like:

date(2020)

Then it iterates though all commits examining the date

Unlike git, the revset is a powerful language so it is possible to short-circuit the search, something like:

first(date(2020))::

Assuming date(2020) is lazy, first(date(2020)) will stop the date iteration once founding the first matching entry. Note that date is not actually lazy in this crate yet. But it can be made so, and in hg it is lazy.

By "upstream", do you mean that it was checked into https://github.com/facebookexperimental/eden

Yes. We'd like to set up publishing crates directly from there. However crates.io does not support 2fac, which is required to officially publish crates. So I was using esl01- as a workaround and that hasn't been updated.

I don't think using sync versus async will be a problem on my end.

That's good to know. I think I can skip the sync adapter interface then.

@martinvonz You might already be familiar with Eden, but if not, then this thread might interest you. The Eden DAG implementation turns out to be pretty easy to integrate with, and it implements a host of useful operations (enough to power the revset engine in this project).

If you don't already have a solution for complex graph operations for https://github.com/martinvonz/jj, you might want to integrate with this. Note that it's GPL-2; I considered it worth it to relicense git-branchless to use it.

Here's the example Git integration: https://github.com/facebookexperimental/eden/blob/e5527715b701a4f72a1d474be546226264289d37/eden/scm/lib/dag/gitdag/src/gitdag.rs. You just need to provide an initial set of leaf nodes and a function to get the parents of a given node.

Thanks, @arxanas! Yes, I'm familiar with Eden and I know Jun/quark-zju from when we both worked on Mercurial :)

I already have support for revsets in jj. I want to support both Git and my other backend there, so I thought it would be hard to use this project there (and I think I only heard about it after I started implementing my own revsets anyway).

It sounds like I should at least be able to use the dag::Dag, so I'll have to look into that some day. It would be nice to get the speed of the segmented changelog. One constraint I have is that I've worked hard to make sure that jj does not require locking, so the append-only-ness of the segmented changelog is concerning (I'm worried that two concurrent processes or threads race to append data to the file). I'm sure there's a solution to that, though.

Also, thanks for the heads up about the license. I'll have to check what Google's policy is.