barseghyanartur / elasticmsd

Transfer the Million Song Dataset (MSD) in an Elasticsearch index

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ElasticMSD

This project enables you to convert the Million Song Dataset into an Elasticsearch index.

Why?

Elasticsearch is a distributed, RESTful search and analytics engine that allows powerful text searches. Although MSD is an audio-featured focused dataset, it also contains metadata that one wants to make research with.

Installation

You need the Python elasticsearch and tables packages. I suggest you to work in a Python virtual environment, it's a good practice.

Set up your virtualenv:

pip install virtualenv
virtualenv ~/.env/elasticmsd
source ~/.env/elasticmsd/bin/activate

Install dependencies:

git clone https://github.com/deezer/elasticmsd
cd elasticmsd
pip install -r requirements.txt

Install hdf5_getters.py from from tbertinmahieux/MsongDB repository. You must then run a pt2to3 on this file (program shipped with tables package) even if you stay in Python2. hdf5_getters uses an old tables convention:

wget https://raw.githubusercontent.com/tbertinmahieux/MSongsDB/master/PythonSrc/hdf5_getters.py -O hdf5_getters_2.py
pt2to3 hdf5_getters_2.py > hdf5_getters.py
rm hdf5_getters_2.py

Download MSD summary file (~300Mo):

wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/msd_summary_file.h5 -O msd_summary_file.h5

Or Download the full MSD (~200Go) from OSCD:

rsync -avzuP publicdata.opensciencedatacloud.org::ark:/31807/osdc-c1c763e4/ /path/to/local_copy

If you need so, you can install a local instance of an Elasticsearch server via docker:

docker run --rm -p 9200:9200 -p 9300:9300 -d --name=local_elasticsearch elasticsearch:2.3

Usage

This command will browse the MSD summary file (a big h5 file) to an Elasticsearch index.

Note: If you want to browse the entire dataset and not just the summary, use the -d argument like -d /path/to/local/msd

python msd_to_es.py \
        -H localhost \
        -p 9200 \ 
        -i research_msd \ 
        -f \ 
        -m msd_summary_file.h5

Output logs will look like:

2018-03-13 11:01:13,702 Found 1000000 songs in summary file
2018-03-13 11:01:17,037 1000 files read. Bulk ingest.
2018-03-13 11:01:17,037 Last MSD id read: TRMMENV12903CDDA6A
2018-03-13 11:01:22,221 2000 files read. Bulk ingest.
2018-03-13 11:01:22,221 Last MSD id read: TRMWQUX12903CD7496

Parameters

python msd_to_es.py -h
usage: msd_to_es.py [-h] [-H ESHOST] [-p ESPORT] [-i ESINDEX] [-t ESTYPE]
                    [-m MSDSUMMARYFILE] [-f]

optional arguments:
  -h, --help            show this help message and exit
  -H ESHOST, --eshost ESHOST
                        Host of elasticsearch.
  -p ESPORT, --esport ESPORT
                        Port of elasticsearch host.
  -i ESINDEX, --esindex ESINDEX
                        Name of index to store to.
  -t ESTYPE, --estype ESTYPE
                        Type of index to store to.
  -m MSDSUMMARYFILE, --msdsummaryfile MSDSUMMARYFILE
                        MSD summary file (one h5 file for 1M songs)
  -d MSDDIRECTORY, --msddirectory MSDDIRECTORY
                        MSD directory strucutre (one h5 file per song)
  -f, --force           Force writing in existing ES index.

Document in ES

The Document in Elasticsearch will look like this:

{
    "msd_tempo" : 120.299,
    "msd_artist_name" : "Darrell Scott",
    "msd_artist_mbid" : "98063361-cdd8-4a9e-b95c-1f29bff780d6",
    "msd_title" : "Shattered Cross",
    "msd_artist_id" : "ARZKPUC1187B99052C",
    "msd_year" : 2006,
    "msd_duration" : 325.53751,
    "msd_mode" : 1,
    "msd_artist_location" : "London, KY",
    "msd_release" : "Transatlantic Sessions - Series 3: Volume One",
    "msd_key" : 9
}

About

Transfer the Million Song Dataset (MSD) in an Elasticsearch index


Languages

Language:Python 100.0%