EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Meta data `file_name` in the GitHub part of The Pile a bit off

thomwolf opened this issue · comments

Hi,

Apologies if this is not the right place to note this but after downloading and exploring the preprocessed GitHub part of The Pile I've noted the metadata file_name are sometime a little off which can make it a bit harder to filter files based on file extension.

For instance here, in the first sample of data_114_time1601108762_default.jsonl downloaded from https://the-eye.eu/public/AI/pile_preliminary_components/, file_name is indicated to be jadx_termux.sh but this appears to be an extract from the changelog of the same repo.

Not sure how important this is for people here but maybe it should be mentioned somewhere?

{
 "text": "## [1.1]\n### Added\n- Added Update() for auto-update\n\n## [1.2]\n### Added\n- extra flag or option `-a` to use __aapt2__ instead of __aapt__.\n- Issue template\n### Changed\n- use getopts for parameters handling\n### Fixed\n- fix update()\n\n## [1.3]\n### Added\n- add aapt2 to bind()\n### Fixed\n- set `LD_LIBRARY_PATH` to avoid libraries access from termux i.e `$PREFIX/lib`\n\n## [1.4]\n### Added\n- patched binaries of aapt2 to skip invalid names while recompiling\n### Fixed\n- fixes #10\n\n## [1.5]\n### Changed\n- stick to alpine v3.10.2 instead of latest one\n\n## [1.6]\n### Added\n- custom path of framework directory\n- new flag `-V` to enable verbose mode for decompiling & recompiling only\n### Changed\n- update apktool to 2.4.1 \n- remove framework app __1.apk__ after each decompiling\n\n## [1.7]\n### Added\n- new option `--no-res` to decompile app except resources.\n- new option `--no-smali` to prevent disassembly of the dex file(s)\n\n## [1.8]\n### Added\n- new option `--no-assets` to prevent decoding of unknown assets files\n- `-z` for zipalign\n- `--frame-path` to specify framework directory\n- `-R` recompile + sign\n\n## [1.9]\n### Added\n- new option `--enable-perm` to enable all permissions automatically in binded or non binded payloads\n\n## [2.0]\n### Added\n- Kali support\n### Changed\n- remove option `-a` & defaults to `aapt2`\n\n## [2.1]\n### Added \n- jadx support\n- new option `--to-java` to decode [dex,apk,zip] to java sources\n- `--deobf` can use along with `--to-java`\n\n## [2.2]\n### Changed\n- now apksigner in termux is from sdk so a key ( PKCS12 ) is added.\n",
 "meta":
   {"repo_name": "Hax4us/Apkmod",
    "stars": "114",
    "repo_language": "Shell",
    "file_name": "jadx_termux.sh",
    "mime_type": "text/plain"}
}

I have the same issue after inspecting the data downloaded from http://eaidata.bmk.sh/data/github_small.jsonl.zst. It seems the value of the 'file_name' key is identical for every repo.

This is a bug caused by https://github.com/EleutherAI/github-downloader/blob/345e7c4cbb9e0dc8a0615fd995a08bf9d73b3fe6/download_repo_text.py#L201C25-L201C49

They append the reference to the same dict every time, so, only the name and the type of the last file is stored in meta.