igrigorik / gharchive.org

GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.

Home Page:https://www.gharchive.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PushEvents missing

tja opened this issue · comments

I am pulling data from the GH archive to extract SHA-1 commit hashes from PushEvent records. However, it seems that some PushEvent records are missing.

For example, the last two commits in this repository are:

  • 44d03b5 — feat: CATPPUCCIN
  • b145639 — fix: stuff

The commits were made within minutes of each other. It is unclear whether the commits were pushed at the same time or in separate pushes. However, the first commit shows up in 2023-08-02-8.json.gz:

{
  "id": "30838278409",
  "type": "PushEvent",
  "actor": {
    "id": 59457929,
    "login": "PassiHD2004",
    "display_login": "PassiHD2004",
    "gravatar_id": "",
    "url": "https://api.github.com/users/PassiHD2004",
    "avatar_url": "https://avatars.githubusercontent.com/u/59457929?"
  },
  "repo": {
    "id": 616647657,
    "name": "PassiHD2004/phoenixts.eu",
    "url": "https://api.github.com/repos/PassiHD2004/phoenixts.eu"
  },
  "payload": {
    "repository_id": 616647657,
    "push_id": 14531916434,
    "size": 1,
    "distinct_size": 1,
    "ref": "refs/heads/main",
    "head": "44d03b57a63c8e0306c8846f8fba130355360de1",
    "before": "5aa78df0c1e68abae8dce23f3746cd1f692cfb89",
    "commits": [
      {
        "sha": "44d03b57a63c8e0306c8846f8fba130355360de1",
        "author": {
          "email": "passihd2004@gmail.com",
          "name": "PassiHD"
        },
        "message": "feat: CATPPUCCIN\n\nSigned-off-by: PassiHD <passihd2004@gmail.com>",
        "distinct": true,
        "url": "https://api.github.com/repos/PassiHD2004/phoenixts.eu/commits/44d03b57a63c8e0306c8846f8fba130355360de1"
      }
    ]
  },
  "public": true,
  "created_at": "2023-08-02T08:55:40Z"
}

The second commit, or any other PushEvent record for the repository, is not included in any archive file up to 08/19/23, even though the commit is clearly visible on the GitHub website.

Since the missing commit was made at 10:59, could it have "fallen between the cracks" of two archives?

The missing event can be fetched from GitHub directly via https://api.github.com/repos/PassiHD2004/phoenixts.eu/events:

{
    "actor": {
        "avatar_url": "https://avatars.githubusercontent.com/u/59457929?",
        "display_login": "PassiHD2004",
        "gravatar_id": "",
        "id": 59457929,
        "login": "PassiHD2004",
        "url": "https://api.github.com/users/PassiHD2004"
    },
    "created_at": "2023-08-02T08:59:53Z",
    "id": "30838391059",
    "payload": {
        "before": "44d03b57a63c8e0306c8846f8fba130355360de1",
        "commits": [
            {
                "author": {
                    "email": "passihd2004@gmail.com",
                    "name": "PassiHD"
                },
                "distinct": true,
                "message": "fix: stuff\n\nSigned-off-by: PassiHD <passihd2004@gmail.com>",
                "sha": "b1456399949384acf2d38b57f50f18f8006b6006",
                "url": "https://api.github.com/repos/PassiHD2004/phoenixts.eu/commits/b1456399949384acf2d38b57f50f18f8006b6006"
            }
        ],
        "distinct_size": 1,
        "head": "b1456399949384acf2d38b57f50f18f8006b6006",
        "push_id": 14531968662,
        "ref": "refs/heads/main",
        "repository_id": 616647657,
        "size": 1
    },
    "public": true,
    "repo": {
        "id": 616647657,
        "name": "PassiHD2004/phoenixts.eu",
        "url": "https://api.github.com/repos/PassiHD2004/phoenixts.eu"
    },
    "type": "PushEvent"
}

Given that the event was created at 2023-08-02T08:59:53Z chances are it got lost between two archives.

what query did you use to extract the sha commit hashes?

what query did you use to extract the sha commit hashes?

I am not sure I understand the question.

what query did you use to extract the sha commit hashes?

I am not sure I understand the question.

sorry, i assumed you were using a dataset with sql-like queries
i was curious what did you use to inspect all the *.json.gz files and extract the sha commit hashes?

Ah, OK. I basically just used jq. For instance, to dump full PushEvents you can do:

curl -sSL https://data.gharchive.org/2023-08-02-8.json.gz | gunzip | jq 'select(.type == "PushEvent")'

Or, to dump all SHA commit hashes you can drill deeper:

curl -sSL https://data.gharchive.org/2023-08-02-8.json.gz | gunzip | jq -r 'select(.type == "PushEvent") | .payload.commits[].sha'

Finally, to dump all commits for a specific repository:

curl -sSL https://data.gharchive.org/2023-12-05-22.json.gz | gunzip | jq 'select(.type == "PushEvent") | select(.repo.name == "yt-dlp/yt-dlp") | .payload.commits[]'

Unfortunately, we can't and don't guarantee 100% coverage. The events API is bursty, and it's possible that we occasionally miss some events. There have also been downtime on both ends. It's hard to say why this particular set of commits is missing.