Dataset integrity checks
ejhumphrey opened this issue · comments
Write a single mainfile to verify the integrity / consistency of the dataset
- File-wise checksums
- File durations over audio (pysoundfile)
- Feature shapes
- Compare distributions against known targets
haven't verified, but @bmcfee says this should generate the file-wise checksums.
(for i in `find . -name \*.ogg` ; do echo "$i,$(md5sum $i)" ; done) | sort > checksums.txt
One mild concern I'd have using this though is the script will probably be written in Python, so there could (possibly) be some kind of inconsistency in filesystem sorting / md5 hashing? a little digging should indicate whether this is worry-worthy
@bmcfee @ejhumphrey I can hop on that one but I never seen any mainfile with similar purposes. Could you link a project that does something similar?
distribution stuff is probably overkill for the time being, can improve later as needed