Some contributors appear several times under a different name

Question

Some contributors appear several times under a different name

vhoulbreque opened this issue 6 years ago · comments

Vincent Houlbrèque commented 6 years ago

I tried this project on https://github.com/vinzeebreak/ironcar

What I did:

git-of-theseus-analyze ironcar
git-of-theseus-stack-plot authors.json

And I get this:

But, several authors are the same person (and they appear under only one name in github's list of commits):

Houlbreque, Vincent Houlbrèque, Vinzeebreak and vinzeebreak
Hugo Masclet, Hugoo, Masclet Hugo

Shouldn't they appear under the same name ?

andilar · Answer 1 · Thu Nov 29 2018 19:00:10 GMT+0800 (China Standard Time)

I have the same issue. I tried working with the .mailmap file, but there is no difference.

Erik Bernhardsson · Answer 2 · Thu Nov 29 2018 21:29:41 GMT+0800 (China Standard Time)

weird, i thought .mailmap would do the trick

feel free to investigate

andilar · Answer 3 · Thu Nov 29 2018 22:54:19 GMT+0800 (China Standard Time)

Ok thx. What I found out is, if you just have one entry in your .mailmap, it will be recognized. Also my output with git shortlog -sne is coming out correctly with a full blown .mailmap.

Erik Bernhardsson · Answer 4 · Thu Nov 29 2018 23:51:27 GMT+0800 (China Standard Time)

weird, maybe gitpython doesn't parse .mailmap?

Thomas Vestergaard Trolle · Answer 5 · Thu Jun 06 2019 15:42:58 GMT+0800 (China Standard Time)

No, they don't: gitpython-developers/GitPython#764
But they also propose a solution...

Erik Bernhardsson · Answer 6 · Thu Jun 06 2019 22:25:42 GMT+0800 (China Standard Time)

feel free to commit a fix for this!

Martin Irigaray · Answer 7 · Tue Dec 31 2019 22:04:27 GMT+0800 (China Standard Time)

Does this problem persist? Any solution.
I didn't understand if the .mailmap must be added on the git repo or can be used at the plot generation step

Erik Bernhardsson · Answer 8 · Thu Jan 02 2020 02:03:24 GMT+0800 (China Standard Time)

Pretty sure the problem still exists, so feel free to try to fix it!

dht · Answer 9 · Sun Mar 27 2022 01:38:47 GMT+0800 (China Standard Time)

Workaround:
Use this Javascript script to fix the authors.json file:

fix-authors.js

const fs = require("fs");
const authors = JSON.parse(fs.readFileSync("./authors.json"));

const labels = authors.labels;

const output = {
  ...authors,
};

const mailMap = {
  Houlbreque: "Hugo Masclet",
  "Hugo Masclet": "Hugo Masclet",
  Hugoo: "Hugo Masclet",
  "Masclet Hugo": "Hugo Masclet",
  "Vincent Houlbr\u00e8que": "Vincent Houlbr",
  Vinzeebreak: "Vincent Houlbr",
  adizout: "adizout",
  mathrb: "mathrb",
  srdadian: "srdadian",
  vinzeebreak: "Vincent Houlbr",
};

let memo = {},
  memoIndex = 0;

const map = labels.map((name, index) => {
  const toName = mailMap[name];

  if (!memo[toName]) {
    memo[toName] = memoIndex++;
  }
  return memo[toName];
});

output.y = output.y.reduce((output, item, index) => {
  const toMap = map[index];

  item.forEach((value, i2) => {
    output[toMap] = output[toMap] || [];
    output[toMap][i2] = output[toMap][i2] || 0;
    output[toMap][i2] += value;
  });

  return output;
}, []);

output.labels = Object.keys(memo);

fs.writeFileSync("./authors.out.json", JSON.stringify(output, null, 4));

Then you can plot with:

git-of-theseus-stack-plot authors.out.json --out stack.authors.png

Joseph Hale, MS SE · Answer 10 · Sat Jul 09 2022 04:59:04 GMT+0800 (China Standard Time)

I tried @dht 's script, but ended up with some authors getting mixed up.

I wrote a comparable script in Python, that could probably be converted into a PR without too much effort (I just ran out of time to figure out how to integrate file paths with the CLI and the complexities of the analyze function)

Expand to see full script (120 lines)

"""
Aggregates contribution data from the `authors.json` file generated
by the `git-of-theseus` tool using an `authors_map.json` file.

The `authors_map.json` file must have the following format:
{
    "authorA": ["aliasA", "aliasA2", ...],
    "authorB": ["aliasB", "aliasB2", ...],
}
"""
import json


def read_authors_map(path):
    with open(path, "r") as f:
        authors_map = json.load(f)
    return authors_map


def read_authors_json(path):
    with open(path, "r") as aj:
        authors_json = json.load(aj)
    return authors_json


def parse_raw_contributions(authors_json):
    """
    The `authors.json` has the following format
    {
        "y": [
            [<line_count1>, <line_count2>, ...],
            [<line_count1>, <line_count2>, ...],
            ...
        ],
        "ts": ["date1", "date2", ...]
        "labels": ["aliasA", "aliasB", ...]
    }

    Each author's line count over time is stored separately
    from the author list. The association is made by index.

    This function parses the `authors.json` into the following
    format:
    {
        "aliasA": [<line_count1>, <line_count2>, ...],
        "aliasB": [<line_count1>, <line_count2>, ...],
        ...
    }
    """
    raw_contributions = {}
    for idx, alias in enumerate(authors_json["labels"]):
        raw_contributions[alias] = authors_json["y"][idx]
    return raw_contributions


def aggregate_contributions(authors_map, raw_contributions):
    """
    Aggregates the contribution data from each `alias` in the
    `raw_contributions` based on the `authors_map`.

    Returns a dictionary of the following format:
    {
        "authorA": [<line_count1>, <line_count2>, ...],
        "authorB": [<line_count1>, <line_count2>, ...],
    }
    where the values of each `author` are the sum of the contribution
    data for each author's corresponding aliases in the `authors_map`.

    For example, if the author `authorA` has aliases `aliasA` and `aliasA2`,
    and the `raw_contributions` data looks like this:
    {
        "aliasA": [10, 20],
        "aliasA2": [5, 20],
    }
    then the aggregated contribution data will look like this:
    {
        "authorA": [15, 40],
    }
    """
    contributions = {}
    for author, aliases in authors_map.items():
        alias_contributions = [
            raw_contributions[a] for a in aliases if a in raw_contributions
        ]
        if len(alias_contributions) > 0:
            contributions[author] = [
                sum(ac[idx] for ac in alias_contributions)
                for idx in range(len(alias_contributions[0]))
            ]

    return contributions


def format_new_authors_json(authors_map, authors_json, contributions):
    """
    Formats the `contributions` data into the `authors.json` format.
    """
    return {
        "y": [
            contributions[author]
            for author in authors_map.keys()
            if author in contributions
        ],
        "ts": authors_json["ts"],
        "labels": [author for author in authors_map.keys() if author in contributions],
    }


def write_authors_json(path, authors_json):
    with open(path, "w") as f:
        json.dump(authors_json, f)


if __name__ == "__main__":
    authors_map = read_authors_map("authors_map.json")
    authors_json = read_authors_json("authors.json")
    raw_contributions = parse_raw_contributions(authors_json)
    contributions = aggregate_contributions(authors_map, raw_contributions)
    new_authors_json = format_new_authors_json(authors_map, authors_json, contributions)
    write_authors_json("authors.out.json", new_authors_json)

Erik Bernhardsson · Answer 11 · Sat Jul 09 2022 05:35:24 GMT+0800 (China Standard Time)

I think a mailmap file might resolve it, but I'm not sure

Steven Jeuris · Answer 12 · Thu Jul 14 2022 22:04:39 GMT+0800 (China Standard Time)

@erikbern I tried:

adding a .mailmap file
checking it in (not certain this would be a requirement)
re-running git-of-theseus-analyze (not certain this would be a requirement)

But, the created graphs still don't disambiguate between authors using what is specified in .mailmap. I.e., it doesn't seem to work.

Joseph Hale, MS SE · Answer 13 · Fri Jul 15 2022 00:11:27 GMT+0800 (China Standard Time)

@Whathecode It doesn't look like git-of-theseus currently considers a .mailmap when computing author statistics. I understood erikbern's comment to mean that he would prefer a solution based on parsing a .mailmap over my proposed solution which uses a custom JSON format.

Erik Bernhardsson · Answer 14 · Fri Jul 15 2022 05:23:35 GMT+0800 (China Standard Time)

I thought .mailmap would maybe work through the git library that git-of-theseus uses

I guess not? Would be nice to support .mailmap files!

Thanks for checking @Whathecode – really appreciate it!

Owen Lamont · Answer 15 · Wed Feb 08 2023 19:46:07 GMT+0800 (China Standard Time)

I also just ran into this. The .mailmap issue is still unresolved at GitPython and apparently that repo is now in maintenance mode and no longer actively maintained.

Not sure if that means that dependency will ultimately need to be swapped out although I have no idea how big that job would be or what alternatives exist.

Joseph Hale, MS SE · Answer 16 · Thu Feb 09 2023 00:59:08 GMT+0800 (China Standard Time)

@owenlamont The maintainer of GitPython actively responds to PRs, including PRs for new features (I had one merged in a few months ago). If someone contributed .mailmap support to GitPython I'm reasonably confident it would be accepted.

Owen Lamont · Answer 17 · Thu Feb 09 2023 19:46:23 GMT+0800 (China Standard Time)

Good to know, cheers. I kind of got mixed messages from the README as to how much it was still supported. I'll try to have a look at what is involved.