This repo contains resources for dealing with data problems. In particular, the tool Open Refine.
- Download Open Refine
- Workbench - includes much of Refine's functionality including clustering
- Dataproofer
- Data Wrangler (no longer supported)
- Gender API and Genderize.io (also available as an R package) can be used to classify names by gender
- Parserator can be used to split addresses
- I keep a list of data cleaning tools on Pinboard
- Numbers/dates treated as strings (often because of currency or percentage signs, or even spaces - try find and replace)
- Strings treated as numbers: e.g. company ‘numbers’, phone numbers and codes often have leading zeroes removed when they are an integral part of the code.
- Numbers and units combined in sentences structures
- Combined data (addresses)
- Different data in one column (country, region and authority, for example, with spaces or formatting used to indicate the difference)
- Variant spellings
- Inconsistently entered info (e.g. £5k vs £5,000)
- Different terms for same thing
- Mistypings - missing decimals etc.
- Merged cells
- Empty rows
- Headings across multiple rows
- Converted PDFs
- Missing information
- Duplicate information
- Format
- Need to extract information - e.g. first name/surname; street name/region; year/month
- Need to classify information - e.g. male vs female name
I keep a series of bookmarked materials on cleaning using Pinboard at https://pinboard.in/u:paulbradshaw/t:cleaning
See the dirtydata folder in this repo for examples of dirty data.
This sample dirty dataset can be used for basic data cleaning in Open Refine
The European Investment Bank database can be downloaded (look for Export to Excel near the bottom) and provides a useful example of data where dates are formatted as strings.
I also bookmark examples of dirty data at https://pinboard.in/u:paulbradshaw/t:dirtydata
For working with XML files try the ones that can be downloaded from the Food Standards Agency API page
For JSON files try petition.parliament.uk - go to any petition and look for the JSON link at the bottom of the page.
- Open Refine's 'common transforms' features for basic cleaning
- How to: clean up spreadsheet headings that run across multiple rows using Open Refine
- Converting JSON or XML into spreadsheets using Open Refine
- Converting sentences to numbers using a spreadsheet
- Tony Hirst explains Data Shaping in Google Refine – Generating New Rows from Multiple Values in a Single Column
- I've written about Scraping data with Google Refine
- Former MA Online Journalism student Ion Mates explains how he used some of Refine's more advanced functionality, and regex, in his post on cleaning a converted PDF
- Former MA Online Journalism student Cristian Giuletti explains How to: combine multiple rows in a dataset where text is split across them
- More at https://onlinejournalismblog.com/tag/google-refine/page/2/