cenpy-devs / cenpy

Explore and download data from Census APIs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

data columns not returned as numeric

dfolch opened this issue · comments

Previous versions of cenpy retuned data columns as numeric values. Running that older code today returns objects. This is a feature request to go back to the previous approach.

In [6]: api_conn = cen.remote.APIConnection('ACSDT5Y2018')                      

In [7]: data = api_conn.query(['B01003_001E'], geo_unit='tract', geo_filter={'state':'04', 'county':'005'})                                                                                                                                                                                     

In [8]: data.B01003_001E.dtype                                                                                                                                                             
Out[8]: dtype('O')

Yeah.... Iirc this was because of a change in pandas. They removed pandas.convert_objects() and pandas.infer_objects() has different behavior. Happy to use things like the _coerce function over in the products API or revisit the infer_objects approach.

Should be a very simple change!

>>> api_conn = cenpy.remote.APIConnection('ACSDT5Y2018')
>>> data = api_conn.query(['B01003_001E'], geo_unit='tract', geo_filter={'state':'04', 'county':'005'})
>>> data.B01003_001E.infer_objects().dtype
dtype('O')
>>> data.B01003_001E.convert_dtypes().dtype
StringDtype
>>> data.B01003_001E.astype(int).dtype
dtype('int64')

Why do neither of these functions (infer_objects() nor convert_dtypes()) return a Series of a numeric type but astype() does?

@ljwolf, it's looking to me like _coerce is the way to go.

Also, @dfolch, how do you get colored syntax highlighting in your markdown? Is it because you copied from a notebook?

Is there ever a case where you wouldn't want data columns to be of integer type? Of course, you never want the geography columns to be of a numeric type.

Yes, fips codes for geographic identifiers ought to be kept as strings

@ljwolf, how does one get the _coerce() from products.py to be used on remote.py's APIConnectionclass?

In remote.py, I've tried:

  • from .products import _coerce leads to ImportError: cannot import name 'APIConnection' from partially initialized module 'cenpy.remote' (most likely due to a circular import)
  • from products import _coerce leads to ModuleNotFoundError: No module named 'products'
  • from . import products as prod leads toImportError: cannot import name 'APIConnection' from partially initialized module 'cenpy.remote' (most likely due to a circular import)

Could move _coerse out of products.py into tools.py or possibly create a new utils.py file? If this is the route, then maybe move everything after line 886 out of products.py.

@dfolch I'm pretty sure this takes care of the import issues. Should all of the functions that are now in utilities.py still be private?

In products.py, can the import be from utilities import *? As far as I understand it, the namespace wouldn't change from how it currently is. Or is it better here to be more explicit and say where certain utilities.py functions came from (in products.py and remote.py?

Is it the case that you want all or no data columns to be converted to integers or do you want to convert all of the ones that can be converted?

I made _coerce private because it lived in products.py, and if you from cenpy import products, I wanted that to be very clean.

If coerce gets moved to utilities, then it's ok to become coerce, but when it's imported in products.py make sure to use from .utilities import coerce as _coerce.

@ljwolf I'm sorry, I didn't see your comment until after I made the PR. I will work on this.

What about the rest of the functions in utilities? Should they be made private as well? I'm not sure if I should be making comments here or on the PR.

some data columns should be float, not int -- most anything that has 'median', 'average' or 'rate' in the table name. I've used predicateType from the variables DataFrame to do conversions (although there's at least one case I've found where the Census API returns an incorrect value)

This gives some sense of the range of valid float values in the data and also flushes out the NaN where they creep in.
api_conn.variables[(api_conn.variables['predicateType'] != 'int') & (api_conn.variables['group'] != 'N/A')]

I've also since realized that the real problem is with the Census API, which returns numbers as quoted strings. JSON numbers shouldn't be quoted. See (and upvote) uscensusbureau/api#5

@JoeGermuska Would you still recommend using the predicateType to cast variables? It's an adaptive solution that caters to the Census API instead of casting everything to one type. This is of course assuming the predicateType provided is the correct value.

Here's a quick solution doing just that (staged inside cenpy.remote.APIConnection):

df = {some recently pulled data inside class ApiConnection}

type_dict = { 
    k: eval(self.variables.predicateType.loc[k.upper()]) 
    for k in df.columns
}   
df = df.astype(type_dict, errors='ignore')

Note: This would also require some data cleansing of the predicateTypes. There are two things would need to be addressed in the variables property:

  1. Convert string to str
  2. Convert np.nan to str
df.predicateType = df.predicateType.replace(['string', np.nan], 'str')