1. Download jan13dd.rtf from http://www.nber.org/data/cps_basic.html 2. RTF is a nasty format! Mangle it into plaintext with rtf2txt.sh 3. Edit this plaintext by hand to separate the categories and remove the header / footer 4. Parse into JSON with jsonify.py 5. Download may13pub.dat from http://www.nber.org/data/cps_basic.html 6. Convert to CSV with dat2csv.py More ideas: http://www.nber.org/data/progs/cps-basic/cpsbjan13.do defines labels for the factors, and with that the range of acceptable values. It's a regular language (Stata .do), so relatively easy to parse.