ghuser-io / ghuser.io

:octocat: Better GitHub profiles

Home Page:https://ghuser.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Count committers and authors differently? // formerly: Phantom commits created when editing on GitHub.com

ocdtrekkie opened this issue · comments

Last week I created the https://github.com/ocdtrekkie/xrf_books repo. It has 13 commits, 10 submitted via the GitHub.com website.

It did add this repo to my ghuser.io profile, but bizarrely, specifies I made only 57% of the commits in the repo, which makes very little sense. Looking closely, ghuser.io seems to believe the repo has 23 commits (which it doesn't), and 13 of them (my actual 13 commits) are mine. So I posit that ghuser.io might be detecting some sort of additional phantom commit for each edit made directly in the GitHub.com web UI.

A known issue in which we count too many commits is when people "push force", because when crawling daily, if the last commit crawled yesterday has now been replaced, it's hard to know where we left off. I need to check if editing via GitHub's website is replacing/editing commits as well.

I'll have a closer look hopefully this week-end. Thanks for reporting!

I have this problem too.

OK I see. Sometimes, commits have a different author A and committer B, and we count them as if both A and B made a commit. It leads to a higher overall amount of commits, but:

  • it makes sure that both A and B get credit for their work, and
  • I think it doesn't happen too much, usually only on large repositories where some people are cherry-picking/merging/rebasing from one branch/release to another and it increases the amount of commits by 5-10% max. and no-one notices (it's just my gut feeling here).

So I like this "feature" but what I didn't think about is that when you do some work over GitHub's website, the committer is web-flow ( https://github.com/web-flow ) and you are the author, and we count the double amount of commits and it looks like you did only half of the work. And this is where this feature becomes a problem.

I'm now preparing a special handling for that special user so that we don't get these phantom commits. It should be quite easy.

IMHO, a commit should be attributed solely to it's author. A lot of hijinks can happen in the process of committing and merging code, and arguably the use of web-flow seems to indicate GitHub doesn't treat the committer identity as noteworthy from an authorship standpoint. I certainly don't think if someone cherry-picks my work they have done equal-weight work to my work in writing it.

Also, is the committer clearly revealed anywhere in the GitHub UI? I think if the goal is to have a clearly understood data source and a fairly predictable metric calculation, a system which occasionally creates double the commits for 5-10% of the commits, and attributes those extra commits to someone who is not the author who does not appear in the GitHub UI, to be an inconsistent magic that is likely to confuse and confound.

TL;DR true (and this is very useful feedback, see more below) but that's too painful to improve right now and it can cause other regressions, sorry 🙁

Details:

Also, is the committer clearly revealed anywhere in the GitHub UI?

yes, they appear with two avatars, you can see many of them here: https://github.com/brandon-rhodes/uncommitted/commits/master

FYI, the reasons why we crawl commits are:

  • the ratio of commits is used as an approximate way of determining whether you are the maintainer, a gold contributor, a silver contributor, etc. and there are already ideas about using other ways to determine this. The kind of contributor you are impacts the number of stars you get but the calculation is very generous already. Especially it has been designed by @brillout so that tiny amounts of commits won't significantly impact your score (i.e. ideally people shouldn't be tempted to create 5 more commits with the sole purpose of tweaking their score). There is room for quick improvements here, don't hesitate to comment on #156
  • we create a link to these commits if you're a small contributor.
  • we want to create nice time range badges and graphs in the future.

For that we have now crawled all the commits of 156000+ repositories and stored their amount per user per day, but we haven't stored whether author or committer. (Stupid, right? 😄 not entirely as we try to keep the DB small).

What you're saying makes sense and it's really useful to get this opinion (thanks!) but if we want now to keep only the authors, we need to re-crawl everything and with our current API rate limit it will take weeks. It will disturb the daily crawling (the API rate limit is our bottleneck) and we'll have to do some merging between the output of this long crawling and the daily one. I'd rather go through this painful process if there is at least another issue we're trying to solve at the same time or if this imprecision turns out to be very problematic. Also I'd need to think about this more, because I know other users who think that any contribution (documentation, user-support, marketing, design, code review, etc.) should be visible on ghuser. Depending on how you merge a PR, you can end up being committer and since you reviewed that PR, this is a contribution. If you cherry-pick a commit to an older branch (i.e. you do a backport of a feature) and you even need to solve a conflict, you will be committer and this is a contribution. Users having this in mind will consider the current mechanism as "better".

So here is what I'll do: I implement that special web-flow case I talked about for now and I keep this issue open for the more general issue of counting committers.

@AurelienLourot That all sounds quite reasonable, and fixing for web-flow will definitely remove the most visible case of confoundment. :)

Do note that when I express my "IMHOs", a big capital letter on the H part. Not expecting you to reinvent the wheel based on one person's opinion.

The web-flow phantom commits are now gone :)

xrf_books - XRF Library Module
this repo has 29 commits
ocdtrekkie wrote 29 commits (100% of all 29 commits)

(https://github.com/ocdtrekkie/xrf_books has 30 commits right now but at the moment it got crawled the last commit wasn't pushed yet)

Keeping this issue open for the more general issue of counting committers.