dedupeio / address-matching

Python script for matching a list of messy addresses against a gazetteer using dedupe.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

address-matching

Python script for matching a list of messy addresses against a gazetteer using dedupe. This also functions as a pseudo geocoder if your Gazetteer has lat/long information.

Part of the Dedupe.io cloud service and open source toolset for de-duplicating and finding fuzzy matches in your data.

Setup

Here's how to get this script working - without having dedupe already installed.

git clone git@github.com:datamade/address-matching.git
cd address-matching
pip install "numpy>=1.6"
pip install -r requirements.txt

Gazetteer

You will need a Gazetteer of all unique addresses in a given area. For this example, we used the Cook County Address Point shapefile.

List addresses you want to match

This program takes a list of addresses and matches them to individual records in the Gazetteer. For this example, we are using a messy list of early childhood education locations in Chicago. This file can have multiple entries referring to the same place.

Usage

Once you have a Gazetteer and a messy input file, run address_matching.py

python address_matching.py

You will be prompted to label some training pairs for dedupe to do its thing. More on this here.

The output will be saved to address_matching_output.csv

About

Python script for matching a list of messy addresses against a gazetteer using dedupe.


Languages

Language:Python 100.0%