biafarrugia / cleaning

Resources on dealing with dirty data problems and using Open Refine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dealing with dirty data problems (and using Open Refine)

This repo contains resources for dealing with data problems. In particular, the tool Open Refine.

Data cleaning tools

Typical data problems

  • Numbers/dates treated as strings (often because of currency or percentage signs, or even spaces - try find and replace)
  • Strings treated as numbers: e.g. company ‘numbers’, phone numbers and codes often have leading zeroes removed when they are an integral part of the code.
  • Numbers and units combined in sentences structures
  • Combined data (addresses)
  • Different data in one column (country, region and authority, for example, with spaces or formatting used to indicate the difference)
  • Variant spellings
  • Inconsistently entered info (e.g. £5k vs £5,000)
  • Different terms for same thing
  • Mistypings - missing decimals etc.
  • Merged cells
  • Empty rows
  • Headings across multiple rows
  • Converted PDFs
  • Missing information
  • Duplicate information
  • Format
  • Need to extract information - e.g. first name/surname; street name/region; year/month
  • Need to classify information - e.g. male vs female name

I keep a series of bookmarked materials on cleaning using Pinboard at https://pinboard.in/u:paulbradshaw/t:cleaning

Examples of dirty data

See the dirtydata folder in this repo for examples of dirty data.

This sample dirty dataset can be used for basic data cleaning in Open Refine

The European Investment Bank database can be downloaded (look for Export to Excel near the bottom) and provides a useful example of data where dates are formatted as strings.

I also bookmark examples of dirty data at https://pinboard.in/u:paulbradshaw/t:dirtydata

For working with XML files try the ones that can be downloaded from the Food Standards Agency API page

For JSON files try petition.parliament.uk - go to any petition and look for the JSON link at the bottom of the page.

Tutorials

A series of introductory guides to Open Refine can be found in the GitHub repo for one of my modules at Birmingham City University here

About

Resources on dealing with dirty data problems and using Open Refine