mit-spatial-action / deduplicate-owners

This repository deduplicates property owners in Massachusetts using the MassGIS standardized assessors' parcel dataset and the Secretary of the Commonwealth's Corporate Database. The process extends that documented by Hangen and O'Brien (2022, in preprint).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deduplicate Owners

This repository deduplicates property owners in Massachusetts using the MassGIS standardized assessors' parcel dataset and the Secretary of the Commonwealth's Corporate Database. The process builds on Hangen and O'Brien's methods (2022, in preprint), which are themselves similar (though not identical) to methods used by Henry Gomory (2021) and the Anti-Eviction Mapping Project's Evictorbook (see e.g., McElroy and Amir-Ghassemi 2021). In outline...

  1. Prepare data using a large number of string-standardizing functions, some of which are place-based. (In other words, when adapting to non-Massachusetts locations, you'll want to consider how to adapt our codebase to your locale.)
  2. Perform naive deduplication on assessors' tables using concatenated name and address.
  3. Perform cosine-similarity-based deduplication on assessors' tables using concatenated name and address.
  4. Join parcels to companies using simple string matching. Note that here, when an owner fails to match within a cosine-similarity group that contains successful matches (see step 3), the owners that fail to match are assigned to the company id of one of the successful matches.
  5. Identify agents of companies that are companies themselves (distinguishing between law firms and other companies) and agents of companies that are individuals.
  6. Deduplicate individuals (including individual agents) associated with companies that match parcel owners using both naive and cosine similarity methods.
  7. Identify communities within corporate-individual networks. (This is done using the igraph implementation of the fast greedy modularity optimization algorithm.)

Getting Started

This library's dependencies are managed using renv. To install necessary dependencies, simply install renv and run renv::restore(). If you are using Windows, you'll probably have to install the Rtools bundle appropriate for your version of R.

Setting up .Renviron

Eviction filings are pulled down from a PostGIS database. As written, we expect PostgreSQL connection parameters to appear in an .Renviron file with the following environment variables defined:

DB_HOST="<host_location>"
DB_USER="<user_name>"
DB_PASS="<password>"
DB_PORT="<port>"
DB_NAME="<name_of_eviction_db>"

Running the Script

We provide an onmibus run() function in run.R. It takes two parameters:

  1. subset: If value is "test" (default), processes only Somerville. If value is "hns", processes only HNS municipalities. If value is "all", runs entire state. Otherwise, it stops and generates an error.
  2. return_results: If TRUE (default), return results in a named list. If FALSE, return nothing. In either case, results are output to delimited text and *.RData files.

In other words...

# Runs on Somerville.
run(subset = "test")
# Runs on Healthy Neighborhoods municipalities.
run(subset = "hns")
# Runs on entire state.
run(subset = "all")

If run.R is executed from a non-interactive environment (i.e., a terminal), it will run on the entire state. (In other words: don't do this unless you want to wait 8 hours for results.)

This function automatically saves its results to...

  • a simplified table of owners (by default, owners.csv, set using the OWNERS_OUT_NAME global variable at the top of run.R),
  • a table of matched companies (by default, corps.csv, set using the CORPS_OUT_NAME global variable at the top of run.R),
  • a table of individuals (by default, inds.csv, set using the INDS_OUT_NAME global variable at the top of run.R),
    • a table of assessors records, supplemened by owner-occupancy flag (by default, assess.csv, set using the ASSESS_OUT_NAME global variable at the top of run.R),
  • a simplified igraph community object (by default, community.csv, set using the COMMUNITY_OUT_NAME global variable at the top of run.R),

Data

The two databases necessary for this analysis are...

Acknowledgements

This work received grant support from the Conservation Law Foundation and was developed under the auspices of the Healthy Neighborhoods Study in the Department of Urban Studies and Planning at MIT.

References

About

This repository deduplicates property owners in Massachusetts using the MassGIS standardized assessors' parcel dataset and the Secretary of the Commonwealth's Corporate Database. The process extends that documented by Hangen and O'Brien (2022, in preprint).

License:MIT License


Languages

Language:R 97.6%Language:Cypher 2.4%