erikbern / git-of-theseus

Analyze how a Git repo grows over time

Home Page:https://erikbern.com/2016/12/05/the-half-life-of-code.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some contributors appear several times under a different name

vhoulbreque opened this issue · comments

I tried this project on https://github.com/vinzeebreak/ironcar

What I did:

git-of-theseus-analyze ironcar
git-of-theseus-stack-plot authors.json

And I get this:

stack_plot

But, several authors are the same person (and they appear under only one name in github's list of commits):

  • Houlbreque, Vincent Houlbrèque, Vinzeebreak and vinzeebreak
  • Hugo Masclet, Hugoo, Masclet Hugo

Shouldn't they appear under the same name ?

I have the same issue. I tried working with the .mailmap file, but there is no difference.

weird, i thought .mailmap would do the trick

feel free to investigate

Ok thx. What I found out is, if you just have one entry in your .mailmap, it will be recognized. Also my output with git shortlog -sne is coming out correctly with a full blown .mailmap.

weird, maybe gitpython doesn't parse .mailmap?

No, they don't: gitpython-developers/GitPython#764
But they also propose a solution...

feel free to commit a fix for this!

Does this problem persist? Any solution.
I didn't understand if the .mailmap must be added on the git repo or can be used at the plot generation step

Pretty sure the problem still exists, so feel free to try to fix it!

commented

Workaround:
Use this Javascript script to fix the authors.json file:

fix-authors.js

const fs = require("fs");
const authors = JSON.parse(fs.readFileSync("./authors.json"));

const labels = authors.labels;

const output = {
  ...authors,
};

const mailMap = {
  Houlbreque: "Hugo Masclet",
  "Hugo Masclet": "Hugo Masclet",
  Hugoo: "Hugo Masclet",
  "Masclet Hugo": "Hugo Masclet",
  "Vincent Houlbr\u00e8que": "Vincent Houlbr",
  Vinzeebreak: "Vincent Houlbr",
  adizout: "adizout",
  mathrb: "mathrb",
  srdadian: "srdadian",
  vinzeebreak: "Vincent Houlbr",
};

let memo = {},
  memoIndex = 0;

const map = labels.map((name, index) => {
  const toName = mailMap[name];

  if (!memo[toName]) {
    memo[toName] = memoIndex++;
  }
  return memo[toName];
});

output.y = output.y.reduce((output, item, index) => {
  const toMap = map[index];

  item.forEach((value, i2) => {
    output[toMap] = output[toMap] || [];
    output[toMap][i2] = output[toMap][i2] || 0;
    output[toMap][i2] += value;
  });

  return output;
}, []);

output.labels = Object.keys(memo);

fs.writeFileSync("./authors.out.json", JSON.stringify(output, null, 4));

Then you can plot with:

git-of-theseus-stack-plot authors.out.json --out stack.authors.png

I tried @dht 's script, but ended up with some authors getting mixed up.

I wrote a comparable script in Python, that could probably be converted into a PR without too much effort (I just ran out of time to figure out how to integrate file paths with the CLI and the complexities of the analyze function)

Expand to see full script (120 lines)
"""
Aggregates contribution data from the `authors.json` file generated
by the `git-of-theseus` tool using an `authors_map.json` file.

The `authors_map.json` file must have the following format:
{
    "authorA": ["aliasA", "aliasA2", ...],
    "authorB": ["aliasB", "aliasB2", ...],
}
"""
import json


def read_authors_map(path):
    with open(path, "r") as f:
        authors_map = json.load(f)
    return authors_map


def read_authors_json(path):
    with open(path, "r") as aj:
        authors_json = json.load(aj)
    return authors_json


def parse_raw_contributions(authors_json):
    """
    The `authors.json` has the following format
    {
        "y": [
            [<line_count1>, <line_count2>, ...],
            [<line_count1>, <line_count2>, ...],
            ...
        ],
        "ts": ["date1", "date2", ...]
        "labels": ["aliasA", "aliasB", ...]
    }

    Each author's line count over time is stored separately
    from the author list. The association is made by index.

    This function parses the `authors.json` into the following
    format:
    {
        "aliasA": [<line_count1>, <line_count2>, ...],
        "aliasB": [<line_count1>, <line_count2>, ...],
        ...
    }
    """
    raw_contributions = {}
    for idx, alias in enumerate(authors_json["labels"]):
        raw_contributions[alias] = authors_json["y"][idx]
    return raw_contributions


def aggregate_contributions(authors_map, raw_contributions):
    """
    Aggregates the contribution data from each `alias` in the
    `raw_contributions` based on the `authors_map`.

    Returns a dictionary of the following format:
    {
        "authorA": [<line_count1>, <line_count2>, ...],
        "authorB": [<line_count1>, <line_count2>, ...],
    }
    where the values of each `author` are the sum of the contribution
    data for each author's corresponding aliases in the `authors_map`.

    For example, if the author `authorA` has aliases `aliasA` and `aliasA2`,
    and the `raw_contributions` data looks like this:
    {
        "aliasA": [10, 20],
        "aliasA2": [5, 20],
    }
    then the aggregated contribution data will look like this:
    {
        "authorA": [15, 40],
    }
    """
    contributions = {}
    for author, aliases in authors_map.items():
        alias_contributions = [
            raw_contributions[a] for a in aliases if a in raw_contributions
        ]
        if len(alias_contributions) > 0:
            contributions[author] = [
                sum(ac[idx] for ac in alias_contributions)
                for idx in range(len(alias_contributions[0]))
            ]

    return contributions


def format_new_authors_json(authors_map, authors_json, contributions):
    """
    Formats the `contributions` data into the `authors.json` format.
    """
    return {
        "y": [
            contributions[author]
            for author in authors_map.keys()
            if author in contributions
        ],
        "ts": authors_json["ts"],
        "labels": [author for author in authors_map.keys() if author in contributions],
    }


def write_authors_json(path, authors_json):
    with open(path, "w") as f:
        json.dump(authors_json, f)


if __name__ == "__main__":
    authors_map = read_authors_map("authors_map.json")
    authors_json = read_authors_json("authors.json")
    raw_contributions = parse_raw_contributions(authors_json)
    contributions = aggregate_contributions(authors_map, raw_contributions)
    new_authors_json = format_new_authors_json(authors_map, authors_json, contributions)
    write_authors_json("authors.out.json", new_authors_json)

I think a mailmap file might resolve it, but I'm not sure

@erikbern I tried:

  • adding a .mailmap file
  • checking it in (not certain this would be a requirement)
  • re-running git-of-theseus-analyze (not certain this would be a requirement)

But, the created graphs still don't disambiguate between authors using what is specified in .mailmap. I.e., it doesn't seem to work.

@Whathecode It doesn't look like git-of-theseus currently considers a .mailmap when computing author statistics. I understood erikbern's comment to mean that he would prefer a solution based on parsing a .mailmap over my proposed solution which uses a custom JSON format.

I thought .mailmap would maybe work through the git library that git-of-theseus uses

I guess not? Would be nice to support .mailmap files!

Thanks for checking @Whathecode – really appreciate it!

I also just ran into this. The .mailmap issue is still unresolved at GitPython and apparently that repo is now in maintenance mode and no longer actively maintained.

Not sure if that means that dependency will ultimately need to be swapped out although I have no idea how big that job would be or what alternatives exist.

@owenlamont The maintainer of GitPython actively responds to PRs, including PRs for new features (I had one merged in a few months ago). If someone contributed .mailmap support to GitPython I'm reasonably confident it would be accepted.

Good to know, cheers. I kind of got mixed messages from the README as to how much it was still supported. I'll try to have a look at what is involved.