A geocoder that relies on offline TIGER/Line data useful for geocoding private health information.
Note that you can call an older version of the geocoder by specifying its version number in the docker
- the geocoder now uses 2019 TIGER/Line address range files
- performs some cleaning of address text (i.e., removes excess whitespace and non-alphanumerics)
- returns matched address components
- returns geocoding diagnostics and a summary printed to the console
- imprecise geocodes are filtered out of the output file
- output files names now include the version number
- suppress all startup messages from R
- import all columns as characters to prevent incidental changing of data through R
- initial release of DeGAUSS geocoder
- Other columns may be present, but it is recommended to only include
address
and an optional identifier column (e.g.,id
). Fewer columns will increase geocoding speed. - Address data must be in one column called
address
. - Separate the different address components with a space
- Do not include apartment numbers or "second address line" (but its okay if you can't remove them)
- ZIP codes must be five digits (i.e.
32709
) and not "plus four" (i.e.32709-0000
) - Do not try to geocode addresses without a valid 5 digit zip code; this is used by the geocoder to complete its initial searches and if attempted, it will likely return incorrect matches
- Spelling should be as accurate as possible, but the program does complete "fuzzy matching" so an exact match is not necessary
- Capitalization does not affect results
- Abbreviations may be used (i.e.
St.
instead ofStreet
orOH
instead ofOhio
) - Use arabic numerals instead of written numbers (i.e.
13
instead ofthirteen
) - Address strings with out of order items could return NA (i.e.
3333 Burnet Ave Cincinnati 45229 OH
)
If my_address_file.csv
is a file in the current working directory with an address column named address
, then
docker run --rm -v $PWD:/tmp degauss/geocoder:3.0.1 my_address_file.csv
will produce my_address_file_geocoded_v3.0.csv
with added columns including lat
, lon
, and geocoding diagnostic information.
Note: If you are using a Windows machine to run Docker, please review this page for Windows-specific changes that likely need to be made to successfully use DeGAUSS. You can ignore this if you are using macOS or linux.
The geocoder's output file includes the following columns:
-
matched_street
,matched_city
,matched_state
,matched_zip
: matched address componets (e.g.,matched_street
is the street the geocoder matched with the input address); can be used to investigate input address misspellings, typos, etc. -
precision
: The qualitative precision of the geocode. The value will be one of:-
range
: interpolated based on address ranges from street segments -
street
: center of the matched street -
intersection
: intersection of two streets -
zip
: centroid of the matched zip code -
city
: centroid of the matched city
-
-
score
: The percentage of text match between the given address and the geocoded result, expressed as a number between 0 and 1. A higher score indicates a closer match. Note that each score is relative within a precision method (i.e. ascore
of0.8
with aprecision
ofrange
is not the same as ascore
of0.8
with aprecision
ofstreet
). -
lat
andlon
: geocoded coordinates for matched address -
geocode_result
: A qualitative summary of the geocoding result. The value will be one of-
po_box
: the address was not geocoded because it is a PO Box -
cincy_inst_foster_addr
: the address was not geocoded because it is a known institutional address, not a residential address -
non_address_text
: the address was not geocoded because it was blank or listed as "foreign", "verify", or "unknown" -
imprecise_geocode
: the address was geocoded, but results were suppressed because theprecision
wasintersection
,zip
, orcity
and/or thescore
was less than0.5
. -
geocoded
: the address was geocoded with aprecision
of eitherrange
orstreet
and ascore
of0.5
or greater.
-
- Geocodes with a resulting precision of
intersection
,zip
, orcity
are returned with a missinglat
andlon
because they are likely too inaccurate and/or too imprecise to be used for further analysis. - By default,
lat
andlon
are also returned as missing if thescore
is less than0.5
(regardless of the precision). This threshold can be changed by including an optional argument in the docker call (docker run --rm -v $PWD:/tmp degauss/geocoder:3.0 my_address_file.csv 0.4
).
For detailed documentation on DeGAUSS, including general usage and installation, please see the DeGAUSS homepage.