cenpy-devs / cenpy

Explore and download data from Census APIs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incorrect values returned for >50 variable 2018 ACS queries

kdw2126 opened this issue · comments

Description:

When performing certain queries with more than 50 variables using the cenpy.remote.APIConnection.query() function from the 2018 ACS detailed tables across all MSAs or counties, incorrect results are returned.

More specifically, while each column of the resulting Pandas DataFrame does contain all of the relevant information for a given variable across MSAs, results are misaligned so the information contained in the "metropolitan statistical area" or "state/county" columns no longer correctly identifies information in a specific row.

One possible rationale for this issue is that the current version of the API_Connection_bigcolq() internal function splits variable requests with more than 50 variables into multiple separate chunks of 50 variables each, sends each chunk to the Census API, and then concatenates the results. This is problematic when making API requests using wildcards to fetch all MSAs, as queries to the same set of geographies may have a different row order corresponding to different geographic units.

Example

import pandas as pd
import cenpy

col_list = ["B01001_001E","B01001_002E","B01001_003E","B01001_004E","B01001_005E","B01001_006E","B01001_007E","B01001_008E","B01001_009E","B01001_010E","B01001_011E","B01001_012E","B01001_013E","B01001_014E","B01001_015E","B01001_016E","B01001_017E","B01001_018E","B01001_019E","B01001_020E","B01001_021E","B01001_022E","B01001_023E","B01001_024E","B01001_025E","B01001_026E","B01001_027E","B01001_028E","B01001_029E","B01001_030E","B01001_031E","B01001_032E","B01001_033E","B01001_034E","B01001_035E","B01001_036E","B01001_037E","B01001_038E","B01001_039E","B01001_040E","B01001_041E","B01001_042E","B01001_043E","B01001_044E","B01001_045E","B01001_046E","B01001_047E","B25087_007E","B25087_008E","B25087_009E","B25087_010E","B25087_011E","B25087_012E","B25087_013E","B25087_014E","B25087_015E","B25087_016E","B25087_017E","B25087_018E","B25087_019E","B25087_021E","B25087_022E","B25087_023E","B25087_024E","B25087_025E","B25087_026E","B25087_027E","B25087_028E","B25087_029E","B25087_030E","B25087_031E","B25087_032E","B25087_033E","B25087_034E","B25087_035E","B25087_036E","B25087_037E","B25087_038E","B25087_039E","B25081_002E","B25081_008E","B25063_002E","B25063_003E","B25063_004E","B25063_005E","B25063_006E","B25063_007E","B25063_008E","B25063_009E","B25063_010E","B25063_011E","B25063_012E","B25063_013E"]

cxn = cenpy.remote.APIConnection('ACSDT1Y2018')
df = cxn.query(cols=col_list, geo_unit="county:*")

print(df.loc[(df['state'] == "08") & (df["county"] == "001"), ['B01001_001E', 'B25063_013E']])

  B01001_001E B25063_013E
0      511868         227

Note in the example above that if we use the Census API to pull data, the figure above for B01001_001E (total population) in county with FIPS 08001 (Adams County, CO) is correct, while the second figure for B25063_013E (rent between $550 and $599) in this county is wrong. Instead, this rent corresponds to the appropriate value for B25063_013E in FIPS 01003.

(Note that the first row of this query corresponds to FIPS 08001, while the first row of this query corresponds to FIPS 01003.)

Interpreter Packages:

Package Version
appdirs 1.4.4
attrs 19.3.0
beautifulsoup4 4.9.1
bs4 0.0.1
cenpy 1.0.0.post2
certifi 2020.4.5.2
chardet 3.0.4
click 7.1.2
click-plugins 1.1.1
cligj 0.5.0
distlib 0.3.0
filelock 3.0.12
Fiona 1.8.13.post1
fuzzywuzzy 0.18.0
geopandas 0.7.0
idna 2.9
importlib-metadata 1.6.1
importlib-resources 2.0.1 4.2.2
Jinja2 2.11.2
libpysal 4.2.2
MarkupSafe 1.1.1
munch 2.5.0
numpy 1.18.5
pandas 1.0.5
pip 20.1.1
pipenv 2020.6.2
pyproj 2.6.1.post1
python-dateutil 2.8.1
pytz 2020.1
requests 2.24.0
Rtree 0.9.4
scipy 1.4.1
setuptools 47.3.1
Shapely 1.7.0
six 1.15.0
soupsieve 2.0.1
urllib3 1.25.9
virtualenv 20.0.23
virtualenv-clone 0.5.4
zipp 3.1.0

I will fix this and make a post-release to address this concern tonight!

This should be resolved in 1.0.0.post4. We're aiming to incorporate some new features before making a 1.1 release, soon.

Thank you for the quick update -- appreciate it!