data columns not returned as numeric

Question

data columns not returned as numeric

dfolch opened this issue 4 years ago · comments

Previous versions of cenpy retuned data columns as numeric values. Running that older code today returns objects. This is a feature request to go back to the previous approach.

In [6]: api_conn = cen.remote.APIConnection('ACSDT5Y2018')                      

In [7]: data = api_conn.query(['B01003_001E'], geo_unit='tract', geo_filter={'state':'04', 'county':'005'})                                                                                                                                                                                     

In [8]: data.B01003_001E.dtype                                                                                                                                                             
Out[8]: dtype('O')

Levi John Wolf · Answer 1 · Thu Apr 09 2020 04:25:02 GMT+0800 (China Standard Time)

Yeah.... Iirc this was because of a change in pandas. They removed pandas.convert_objects() and pandas.infer_objects() has different behavior. Happy to use things like the _coerce function over in the products API or revisit the infer_objects approach.

Should be a very simple change!

Reilley Luedde · Answer 2 · Sat Jul 11 2020 22:02:09 GMT+0800 (China Standard Time)

>>> api_conn = cenpy.remote.APIConnection('ACSDT5Y2018')
>>> data = api_conn.query(['B01003_001E'], geo_unit='tract', geo_filter={'state':'04', 'county':'005'})
>>> data.B01003_001E.infer_objects().dtype
dtype('O')
>>> data.B01003_001E.convert_dtypes().dtype
StringDtype
>>> data.B01003_001E.astype(int).dtype
dtype('int64')

Why do neither of these functions (infer_objects() nor convert_dtypes()) return a Series of a numeric type but astype() does?

@ljwolf, it's looking to me like _coerce is the way to go.

Also, @dfolch, how do you get colored syntax highlighting in your markdown? Is it because you copied from a notebook?

Reilley Luedde · Answer 3 · Sat Jul 11 2020 22:09:44 GMT+0800 (China Standard Time)

Is there ever a case where you wouldn't want data columns to be of integer type? Of course, you never want the geography columns to be of a numeric type.

Levi John Wolf · Answer 4 · Sat Jul 11 2020 22:44:09 GMT+0800 (China Standard Time)

Yes, fips codes for geographic identifiers ought to be kept as strings

Reilley Luedde · Answer 5 · Sat Jul 11 2020 22:44:23 GMT+0800 (China Standard Time)

@ljwolf, how does one get the _coerce() from products.py to be used on remote.py's APIConnectionclass?

In remote.py, I've tried:

from .products import _coerce leads to ImportError: cannot import name 'APIConnection' from partially initialized module 'cenpy.remote' (most likely due to a circular import)
from products import _coerce leads to ModuleNotFoundError: No module named 'products'
from . import products as prod leads toImportError: cannot import name 'APIConnection' from partially initialized module 'cenpy.remote' (most likely due to a circular import)

David C. Folch · Answer 6 · Mon Jul 13 2020 03:22:57 GMT+0800 (China Standard Time)

Could move _coerse out of products.py into tools.py or possibly create a new utils.py file? If this is the route, then maybe move everything after line 886 out of products.py.

Reilley Luedde · Answer 7 · Mon Jul 13 2020 21:40:55 GMT+0800 (China Standard Time)

@dfolch I'm pretty sure this takes care of the import issues. Should all of the functions that are now in utilities.py still be private?

In products.py, can the import be from utilities import *? As far as I understand it, the namespace wouldn't change from how it currently is. Or is it better here to be more explicit and say where certain utilities.py functions came from (in products.py and remote.py?

Reilley Luedde · Answer 8 · Mon Jul 13 2020 21:53:09 GMT+0800 (China Standard Time)

Is it the case that you want all or no data columns to be converted to integers or do you want to convert all of the ones that can be converted?

Levi John Wolf · Answer 9 · Mon Jul 13 2020 22:59:57 GMT+0800 (China Standard Time)

I made _coerce private because it lived in products.py, and if you from cenpy import products, I wanted that to be very clean.

If coerce gets moved to utilities, then it's ok to become coerce, but when it's imported in products.py make sure to use from .utilities import coerce as _coerce.

Reilley Luedde · Answer 10 · Mon Jul 13 2020 23:06:40 GMT+0800 (China Standard Time)

@ljwolf I'm sorry, I didn't see your comment until after I made the PR. I will work on this.

What about the rest of the functions in utilities? Should they be made private as well? I'm not sure if I should be making comments here or on the PR.

Joe Germuska · Answer 11 · Thu Aug 06 2020 00:08:40 GMT+0800 (China Standard Time)

some data columns should be float, not int -- most anything that has 'median', 'average' or 'rate' in the table name. I've used predicateType from the variables DataFrame to do conversions (although there's at least one case I've found where the Census API returns an incorrect value)

This gives some sense of the range of valid float values in the data and also flushes out the NaN where they creep in.
api_conn.variables[(api_conn.variables['predicateType'] != 'int') & (api_conn.variables['group'] != 'N/A')]

Joe Germuska · Answer 12 · Mon Aug 17 2020 07:44:54 GMT+0800 (China Standard Time)

I've also since realized that the real problem is with the Census API, which returns numbers as quoted strings. JSON numbers shouldn't be quoted. See (and upvote) uscensusbureau/api#5

Ronnie Llamado · Answer 13 · Tue May 04 2021 08:12:54 GMT+0800 (China Standard Time)

@JoeGermuska Would you still recommend using the predicateType to cast variables? It's an adaptive solution that caters to the Census API instead of casting everything to one type. This is of course assuming the predicateType provided is the correct value.

Here's a quick solution doing just that (staged inside cenpy.remote.APIConnection):

df = {some recently pulled data inside class ApiConnection}

type_dict = { 
    k: eval(self.variables.predicateType.loc[k.upper()]) 
    for k in df.columns
}   
df = df.astype(type_dict, errors='ignore')

Note: This would also require some data cleansing of the predicateTypes. There are two things would need to be addressed in the variables property:

Convert string to str
Convert np.nan to str

df.predicateType = df.predicateType.replace(['string', np.nan], 'str')