bersace / stackexchange-dump-to-postgres

Python scripts to import StackExchange data dump into Postgres DB.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

StackOverflow data to postgres

This is a quick script to move the Stackoverflow data from the StackExchange data dump (Sept '14) to a Postgres SQL database.

Schema hints are taken from a post on Meta.StackExchange and from StackExchange Data Explorer.

Dependencies

Usage

  • Create the database stackoverflow in your database: CREATE DATABASE stackoverflow;
    • You can use a custom database name as well. Make sure to explicitly give it while executing the script later.
  • Move the following files to the folder from where the program is executed: Badges.xml, Votes.xml, Posts.xml, Users.xml, Tags.xml.
    • In some old dumps, the cases in the filenames are different.
  • Execute in the current folder (in parallel, if desired):
    • python load_into_pg.py -t Badges
    • python load_into_pg.py -t Posts
    • python load_into_pg.py -t Tags (not present in earliest dumps)
    • python load_into_pg.py -t Users
    • python load_into_pg.py -t Votes
    • python load_into_pg.py -t PostLinks
    • python load_into_pg.py -t PostHistory
    • python load_into_pg.py -t Comments
  • Finally, after all the initial tables have been created:
    • psql stackoverflow < ./sql/final_post.sql
    • If you used a different database name, make sure to use that instead of stackoverflow while executing this step.
  • For some additional indexes and tables, you can also execute the the following;
    • psql stackoverflow < ./sql/optional_post.sql
    • Again, remember to user the correct database name here, if not stackoverflow.

Loading a complete stackexchange project

You can use the script to download a given stackexchange compressed file from archive.org and load all the tables at once, using the -s switch.

You will need the urllib and libarchive modules.

If you give a schema name using the -n switch, all the tables will be moved to the given schema. This schema will be created in the script.

To load the dba.stackexchange.com project in the dba schema, you would execute: ./load_into_pg.py -s dba -n dba

The paths are not changed in the final scripts sql/final_post.sql and sql/optional_post.sql. To run them, first set the search_path to your schema name: SET search_path TO <myschema>;

Caveats and TODOs

  • It prepares some indexes and views which may not be necessary for your analysis.
  • The Body field in Posts table is NOT populated by default. You have to use --with-post-body argument to include it.
  • The EmailHash field in Users table is NOT populated.

Sept 2011 data dump

  • The tags.xml is missing from the data dump. Hence, the PostTag and UserTagQA tables will be empty after final_post.sql.
  • The ViewCount in Posts is sometimes equal to an empty value. It is replaced by NULL in those cases.

Acknowledgement

@madtibo made significant contributions by adding jsonb and Foreign Key support.

About

Python scripts to import StackExchange data dump into Postgres DB.

License:MIT License


Languages

Language:TSQL 56.5%Language:Python 43.5%