MACH is a hash-based extreme multi-class classification package. This package supports both sparse datasets and dense datasets. The training process is implemented in Tensorflow and supports GPU acceleration. Inference process consists of two stages: prediction stage and merging stage. In prediction stage, MACH uses Tensorflow to perform prediction for each meta classifier. In merging stage, MACH uses Numpy to merge results from all meta-classifiers. The merging utilizes python's multi processing module to achieve multi-core parallelization. GPU acceleration for merging stage will be supported in the future.
- Python 3
- Tensorflow
- Numpy
Currently, MACH provides two demos: ODP and Imagenet. The following steps will show users how to download datasets and successfully run MACH on them.
- Download all files from link
- Download datasets by typing
make odp_train.vw.gzandmake odp_test.vw.gz:in shell. Then unarchive.gzfiles to obtain.vwfiles. - Open
odpfolder and use the following script to convert datasets fromvwformat totfrecordsformat:python3 save_tfrecords.py vwFileName outputFileName. To save the original training set astraining.tfrecords, simply typingpython3 save_tfrecords.py odp_train.vw training.tfrecords - After converting files to
tfrecordsformat, changeTRAIN_FILEandTEST_FILEfields inodp_demo.pyto the location of your ODP datasets. - To start training and predicting ODP dataset, simply typing
python3 odp_demo.py -b 32 -r 50. This line will start training for 50 meta-classifiers with 32 buckets. You may change the parameters to run different experiments.
- Download all files from link
- Download datasets by typing
training.txt.gzandmake testing.txt.gzin shell. Then unarchive.gzfiles to obtain.txtfiles. - Open
imagenetfolder and use the following script to convert datasets fromtxtformat totfrecordsformat:python3 save_tfrecords.py txtFileName outputFileName. To save the original training set astraining.tfrecords, simply typingpython3 save_tfrecords.py training.txt training.tfrecords. Both the source file and target file will be extremely large. Be sure to have enough disk space. - After converting files to
tfrecordsformat, changeTRAIN_FILEandTEST_FILEfields inimagenet_demo.pyto the location of your imagenet datasets. - To start training and predicting ODP dataset, simply typing
python3 imagenet_demo.py -b 512 -r 20. This line will start training for 20 meta-classifiers with 512 buckets. You may change the parameters to run different experiments.
- By modifying source codes in
odporimagenetfolders, users can run MACH on other large scale datasets.
- The ODP dataset used in demo is a sparse dataset and therefore all the codes in
odpfolder is designed for sparse datasets. - Because both training process and predicting process rely on Tensorflow and
tfrecordsformat, before running MACH, users need to first convert their datasets totfrecordsformat specified insave_to_tfrecordsfunction inodp/util.py. This function essentially reads sparse format data line by line, stores indices and values separately for each data entry, and writes results intotfrecordsformat. Feature index and label must starts from 0. - After the conversion finished, users will need to modify
NUM_FEATURES,NUM_CLASSES,TRAIN_FILE,TEST_FILEinodp_demo.pyto accommodate their datasets. If the user wishes to only perform training or predicting, the user can modify train_odp.py and predict_odp.py in a similar manner. - Running MACH will be similar to the tutorials shown in Quickstart section.
- The Imagenet dataset used in demo is a dense dataset and therefore all the codes in
imagenetfolder is designed for dense datasets. - Because both training process and predicting process rely on Tensorflow and
tfrecordsformat, before running MACH, users need to first convert their datasets totfrecordsformat specified insave_to_tfrecordsfunction inimagenet/util.py. This function essentially reads sparse format data line by line, creates an empty Numpy array, fill in values to corresponding indexes, and writes results intotfrecordsformat. The new file may be larger than the original file because the densifing operation. Feature index and label must starts from 0. - After the conversion finished, users will need to modify
NUM_FEATURES,NUM_CLASSES,TRAIN_FILE,TEST_FILEinimagenet_demo.pyto accommodate their datasets. If the user wishes to only perform training or predicting, the user can modifytrain_imagenet.pyandpredict_imagenet.pyin a similar manner. - Running MACH will be similar to the tutorials shown in Quickstart section.