fortuna / freebase-movies

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Overview

This project contains a series of command-line tools for processing the Freebase movie data into a data set for use by the Discovery Engine.

Freebase data is available under the creative commons attribution license. See this page for example HTML you can include to if you use their data on your web site.

Note that the TSV format that this tool uses appears to no longer be available.

See https://developers.google.com/freebase/data for the current data formats available.

Feeling lazy?

Many of these files are available from our public s3 bucket s3://t11e.datasets. You can download a complete changeset here. Freebase recently stopped hosting providing the tab-separated mini data dumps (e.g. http://download.freebase.com/datadumps/latest/browse/film.tar.bz2) We have an old snapshot of the film.tar.bz2 file in our public s3 bucket.

Prerequisites

  • Python 2.7 or newer
  • An internet connection
  1. To install the requisite python modules:

    easy_install elementtree # for facet_to_dimension.py, and json_to_tree_dimension.py
    easy_install google-api-python-client # for export_genres.py
  2. Obtain the latest freebase film data dump and extract it locally

    wget http://download.freebase.com/datadumps/latest/browse/film.tar.bz2
    tar --bzip2 --extract --verbose --file film.tar.bz2
  3. Process the film TSV files into a JSON intermediate form

    time ./parse_tsv.py film > film.jsons
  4. Optionally filter out pornographic movies

    time pv film.jsons | ./jsons_filter.py > filtered.jsons
  5. And then convert that into a Discovery Engine changeset. Note that if you do not have pv installed, use cat.

    time pv filtered.jsons | jsons_to_changeset.py | gzip -9 > changeset.xml.gz
  6. Or you can do the three steps above in one fell swoop (using tee to retain copies of the intermediate output)

    time ./parse_tsv.py film | tee film.jsons | ./jsons_filter.py | tee filtered.jsons \
    | jsons_to_changeset.py | tee changeset.xml | gzip -9 > changeset.xml.gz
  7. To export a keyword dimension to a tree dimension definition

    ./facet_to_dimension.py {keyword_dimension_id} {min_count_filter} | xmllint --format -
  8. To export a tree structure of film genres based on a MQL query of the live freebase data

  9. Go to https://code.google.com/apis/console/

  10. Create a project

  11. Create a new server API key. You will use this below.

  12. Enable the freebase API for the project

  13. Retrieve the data using the API

```sh
time ./export_genres.py API_KEY > genres.json
```
  1. Convert the genre dump to an XML tree dimension for hand editing and inclusion in your dimensions.xml

    cat genres.json | ./json_to_tree_dimension.py| xmllint --format -

About

License:BSD 3-Clause "New" or "Revised" License