cosmir / openmic-2018

Tools and tutorials for the OpenMIC-2018 dataset.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dataset integrity checks

ejhumphrey opened this issue · comments

Write a single mainfile to verify the integrity / consistency of the dataset

  • File-wise checksums
  • File durations over audio (pysoundfile)
  • Feature shapes
  • Compare distributions against known targets

haven't verified, but @bmcfee says this should generate the file-wise checksums.

(for i in `find . -name \*.ogg` ; do echo "$i,$(md5sum $i)" ; done) | sort > checksums.txt

One mild concern I'd have using this though is the script will probably be written in Python, so there could (possibly) be some kind of inconsistency in filesystem sorting / md5 hashing? a little digging should indicate whether this is worry-worthy

@bmcfee @ejhumphrey I can hop on that one but I never seen any mainfile with similar purposes. Could you link a project that does something similar?

distribution stuff is probably overkill for the time being, can improve later as needed