Import Discogs' data dumps into Postgres efficiently.
Uses Hexpat for parsing XML and Postgres' COPY to store the result.
Inspired by discogs-xml2db.
I wanted Discogs' data in a Postgres database and I had some issues with the existing solution:
- it was slow, I let it run for 5 hours and it didn't finish
- SAX events parsing for nested data is too difficult to maintain IMHO
- using INSERT is not efficient, and I didn't want UNLOGGED tables
- I wanted some data it didn't import and didn't want some it did
- performance, I wanted it to be fast on commodity hardware
The projects uses stack so it should be as easy as:
stack install
and adding stack's binary path to your PATH (if you haven't already).
Supposing we downloaded the 20150810 dump:
- discogs_20150810_artists.xml.gz
- discogs_20150810_labels.xml.gz
- discogs_20150810_masters.xml.gz
- discogs_20150810_releases.xml.gz
and we didn't decompress them, then we can run on the repo's directory:
$ createdb discogs
$ psql discogs < sql/tables.sql
$ discogs2pg -g -d 20150810 -c dbname=discogs
$ # wait an hour or two ...
$ psql discogs < sql/indexes.sql
discogs2pg can has two forms of operation, by file and by date.
You can import a single file with discogs2pg -c <blah> some_releases.xml
or
import releases, artists, labels and master for a date DATE
in the current directory with discogs2pg -c <blah> -d DATE
.
Also the -g | --gzip
option lets you import the compressed files directly.
If you want it to be even faster, and your computer can handle it, you can pass an optional --aggressive
to make it open all file in parallel. Consider that since we are using COPY it will open a
connection per table: that's 15 connections at once.
I don't have a Windows machine to build and test the binary, so I'd love it if someone could get into that.
Also, I can think of two things:
- someone with experience in DBs could improve the data structure and indexes/primary keys
- someone with experience in structuring Haskell projects could help with the types/typeclasses and modules
- the code uses lazy lists (for real), I'd like to refactor it with conduit or something
I'd be glad to hear suggestions and take pull requests, of course.