jdkato / openpdi

A Python 3 library for decentralized aggregation of data from the Police Data Initiative (PDI).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OpenPDI Build Status code style DOI PyPI - Python Version

OpenPDI is an unofficial effort to document and standardize data submitted to the Police Data Initiative (PDI). The goal is to make the data more accessible by addressing a number of issues related to a lack of standardization—namely,

  • File types: While some agencies make use if the Socrata Open Data API, many provide their data in raw .csv, .xlsx, or .xls files of varying structures.
  • Column names: Many columns that represent the same data (e.g., race) are named differently across departments, cities, and states.
  • Value formats: Dates, times, and other comparable fields are submitted in many different formats.
  • Column availability: It's currently very difficult to identify data sources that contain certain columns—e.g., Use of Force data specifying the hire date of the involved officer(s).

Getting Started

Installation
$ pip install openpdi
Usage
Dataset ID Source
Use of Force uof https://www.policedatainitiative.org/datasets/use-of-force/
import csv
import openpdi

# The library has a single entry point:
dataset = openpdi.Dataset(
    # The dataset ID (see the table above).
    "uof",
    # Limit the data sources to a specific state using its two-letter code.
    #
    # Default: `scope=[]`.
    scope=["TX"],
    # A list of columns that must be provided in every data source included in
    # this dataset. See `openpdi/meta/{ID}/schema.json` for the available
    # columns.
    #
    # Default: `columns=[]`.
    columns=["reason"],
    # If `True`, only return the user-specified columns -- i.e., those listed
    # in the `columns` parameter.
    #
    # Default: `strict=False`.
    strict=False)

# The names of the agencies included in this dataset:
print(dataset.agencies)

# The URLs of the external data sources inlcuded in this dataset:
print(dataset.sources)

# `gen` is a generator object for iterating over the CSV-formatted dataset.
gen = dataset.download()

# Write to a CSV file:
with open("dataset.csv", "w+") as f:
    writer = csv.writer(f, delimiter=",", quoting=csv.QUOTE_ALL)
    writer.writerows(gen)

Datasets

In an attempt to avoid unnecessary bloat (in terms of GBs), we don't actually store any PDI data in this repository. Instead, we store small, JSON-formatted descriptions of externally hosted datasets—for example, uof/CA/meta.json:

[
    {
        "url": "https://www.norwichct.org/Archive.aspx?AMID=61&Type=Recent",
        "type": "csv",
        "start": 1,
        "columns": {
            "date": {
                "index": 0,
                "specifier": "%m/%d/%Y"
            },
            "city": {
                "raw": "Richmond"
            },
            "state": {
                "raw": "CA"
            },
            "service_type": {
                "index": 1
            },
            "force_type": {
                "index": 10
            },
            "light_conditions": {
                "index": 8
            },
            "weather_conditions": {
                "index": 7
            },
            "reason": {
                "index": 2
            },
            "officer_injured": {
                "index": 6
            },
            "officer_race": {
                "index": 9
            },
            "subject_injured": {
                "index": 5
            },
            "aggravating_factors": {
                "index": 3
            },
            "arrested": {
                "index": 4
            }
        }
    }
]

This file describes a Use of Force (uof) dataset from Richmond, CA. Each entry in the columns array maps a column from the externally-hosted data to a column in the dataset's schema file (uof/schema.json).

flow

The schema.json file assigns a format to every possible column in a particular dataset, which is a Python function tasked with standardizing a raw column value (see openpdi/validators.py).

About

A Python 3 library for decentralized aggregation of data from the Police Data Initiative (PDI).

License:MIT License


Languages

Language:Python 100.0%