data columns not returned as numeric
dfolch opened this issue · comments
Previous versions of cenpy retuned data columns as numeric values. Running that older code today returns objects. This is a feature request to go back to the previous approach.
In [6]: api_conn = cen.remote.APIConnection('ACSDT5Y2018')
In [7]: data = api_conn.query(['B01003_001E'], geo_unit='tract', geo_filter={'state':'04', 'county':'005'})
In [8]: data.B01003_001E.dtype
Out[8]: dtype('O')
Yeah.... Iirc this was because of a change in pandas. They removed pandas.convert_objects()
and pandas.infer_objects()
has different behavior. Happy to use things like the _coerce
function over in the products API or revisit the infer_objects
approach.
Should be a very simple change!
>>> api_conn = cenpy.remote.APIConnection('ACSDT5Y2018')
>>> data = api_conn.query(['B01003_001E'], geo_unit='tract', geo_filter={'state':'04', 'county':'005'})
>>> data.B01003_001E.infer_objects().dtype
dtype('O')
>>> data.B01003_001E.convert_dtypes().dtype
StringDtype
>>> data.B01003_001E.astype(int).dtype
dtype('int64')
Why do neither of these functions (infer_objects()
nor convert_dtypes()
) return a Series of a numeric type but astype()
does?
@ljwolf, it's looking to me like _coerce
is the way to go.
Also, @dfolch, how do you get colored syntax highlighting in your markdown? Is it because you copied from a notebook?
Is there ever a case where you wouldn't want data columns to be of integer type? Of course, you never want the geography columns to be of a numeric type.
Yes, fips codes for geographic identifiers ought to be kept as strings
@ljwolf, how does one get the _coerce()
from products.py
to be used on remote.py
's APIConnection
class?
In remote.py
, I've tried:
from .products import _coerce
leads toImportError: cannot import name 'APIConnection' from partially initialized module 'cenpy.remote' (most likely due to a circular import)
from products import _coerce
leads toModuleNotFoundError: No module named 'products'
from . import products as prod
leads toImportError: cannot import name 'APIConnection' from partially initialized module 'cenpy.remote' (most likely due to a circular import)
Could move _coerse
out of products.py
into tools.py
or possibly create a new utils.py
file? If this is the route, then maybe move everything after line 886 out of products.py
.
@dfolch I'm pretty sure this takes care of the import issues. Should all of the functions that are now in utilities.py
still be private?
In products.py
, can the import be from utilities import *
? As far as I understand it, the namespace wouldn't change from how it currently is. Or is it better here to be more explicit and say where certain utilities.py
functions came from (in products.py
and remote.py
?
Is it the case that you want all or no data columns to be converted to integers or do you want to convert all of the ones that can be converted?
I made _coerce
private because it lived in products.py
, and if you from cenpy import products
, I wanted that to be very clean.
If coerce
gets moved to utilities
, then it's ok to become coerce
, but when it's imported in products.py
make sure to use from .utilities import coerce as _coerce
.
@ljwolf I'm sorry, I didn't see your comment until after I made the PR. I will work on this.
What about the rest of the functions in utilities
? Should they be made private as well? I'm not sure if I should be making comments here or on the PR.
some data columns should be float, not int -- most anything that has 'median', 'average' or 'rate' in the table name. I've used predicateType
from the variables DataFrame to do conversions (although there's at least one case I've found where the Census API returns an incorrect value)
This gives some sense of the range of valid float
values in the data and also flushes out the NaN
where they creep in.
api_conn.variables[(api_conn.variables['predicateType'] != 'int') & (api_conn.variables['group'] != 'N/A')]
I've also since realized that the real problem is with the Census API, which returns numbers as quoted strings. JSON numbers shouldn't be quoted. See (and upvote) uscensusbureau/api#5
@JoeGermuska Would you still recommend using the predicateType
to cast variables? It's an adaptive solution that caters to the Census API instead of casting everything to one type. This is of course assuming the predicateType
provided is the correct value.
Here's a quick solution doing just that (staged inside cenpy.remote.APIConnection
):
df = {some recently pulled data inside class ApiConnection}
type_dict = {
k: eval(self.variables.predicateType.loc[k.upper()])
for k in df.columns
}
df = df.astype(type_dict, errors='ignore')
Note: This would also require some data cleansing of the predicateTypes
. There are two things would need to be addressed in the variables property:
- Convert
string
tostr
- Convert
np.nan
tostr
df.predicateType = df.predicateType.replace(['string', np.nan], 'str')