EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.

Home Page:https://epistasislab.github.io/pmlb/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Clean up commit history

JDRomano2 opened this issue · comments

#30 addressed lack of Git LFS for the large dataset files. It makes sense to remove these from the commit history, as well. The main affect is reducing the size of the repository when cloned, but it also has other beneficial side effects such as making the commit history easier to browse and navigate.

Aside from removing large dataset files from the history, is there anything else we can/should clean up?

Used bfg-repo-cleaner to remove all blobs containing .gz and .html files from the history (the most recent commit is untouched).

For example, no .tsv.gz source file is present in the following directory: https://github.com/EpistasisLab/penn-ml-benchmarks/tree/51207e96ce3ccb047908fd0d2532344d77573fc6/datasets/1027_ESL

All users should re-clone the repository to avoid adding 'dirty' files back in when new features are merged into master. For the short future, new pull-requests should be inspected to make sure old database or profiling reports haven't been reintroduced (however, this should be fairly obvious).