datacommonsorg / api-python

Python client library to access Data Commons

Home Page:https://www.datacommons.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Return list of lists instead of series

lukesmurray opened this issue · comments

Returning a series of lists from the api does not add much benefit to the user.

Example

# get state dcids 
state_dcids = dc.get_places_in(pd.Series(['country/USA']), 'State')[0]
# get state names
state_names = dc.get_property_values(pd.Series(state_dcids), 'name')
# naive user tries to use boolean mask to find the state alabama
state_names[state_names == ['Alabama']]   # failure since series are supposed to be 1d

Pandas series are designed to be a 1-d array so using them to store multidimensional data may throw some people off. A user can transform a list of lists to a pandas series easily if they want. The intended use case of returning pandas series was to support column assignment in dataframes. But lists of lists can be used to create dataframe columns without any conversion to a pandas series.

I suspect the majority of operations performed on the returned series will be iterating and indexing which lists support.

Results of discussion with @ACscooter @shifucun. These are the possible options.

  1. We keep the current API and potentially change the output from a series to a list of lists.
  2. We change the API so that the input is any iterable of str and the output is always a list of lists.
  3. We change the API so that the input is any iterable of str and the output can be returned as a dict if the user sets a parameter as_dict=True

Trade off and Differences between 2. and 3.

In option 2. the API has one path, it will always return a list of lists. If the user wants to get output as a dictionary the user can utilize the zip method and a dictionary comprehension.

state_dcids = dc.get_places_in(['country/USA'], 'State')[0]
state_names = dc.get_property_values(state_dcids, 'name')
state_dict = {k:v for k,v in zip(state_dcid, state_names)}  # dcid to name

The transformation to a dictionary may not be straightforward for beginners or non python programmers. Secondly, the list of lists has implied structure based on ordering. The user may prefer to have the more explicit structure of a dictionary.

As an example consider that a user gets state dcids in an IPython environment and wants to determine which dcid is associated with California. If we return a dictionary then the user can determine the dcid by scanning the output.

>>> state_dcids = dc.get_places_in(['country/USA'], 'State')[0]
>>> dc.get_property_values(state_dcids, 'name')
{ 'geoId/06': ['California'],
...
} 

We created option 3. to support dictionary output with minimal data transformation from the user. In option 3. an optional parameter as_dict can be set to True to return the output as a dictionary. The dictionary is produced using the zip construct explained earlier. The additional parameter has the benefit of being useful for programmers and approachable for beginners. The parameter does change the return type of the function which is uncommon.

Feedback from anyone is appreciated.

@lukesmurray , what do you expect the intended call pattern to be for Option 2?

Re: Option 3, it seems unnecessary for the API to implement the dictionary conversion to support random lookups (would one use the python API for that type of browsing vs the browser?) What other use cases did you have in mind that would require a dictionary response?

For option 2 the call pattern is essentially the same but less complicated and lower overhead.

Current Call Pattern

# create pandas series to get series as output. feels unnecessary
# get the first value since its a series of lists
counties = dc.get_places_in(pd.Series(['geoId/06']), 'County')[0]
# but the first value is a list so we need to make it a series
counties = pd.Series(counties)
# rest is ok
names =  dc.get_property_values(counties, 'name')
populations = dc.get_populations(counties, "Person", {})
observations = dc.get_observations(populations, "count", "measured_value", "2017", measurement_method='CenusACS5yrSurvey')
dc_data = pd.DataFrame({'dcid': counties, 'name': names, 'population': observations})

Proposed Call Pattern

# no unnecessary pandas objects. no reliance on pandas. code is shorter
counties = dc.get_places_in(['geoId/06'], 'County')[0]
names =  dc.get_property_values(counties, 'name')
populations = dc.get_populations(counties, "Person", {})
observations = dc.get_observations(populations, "count", "measured_value", "2017", measurement_method='CenusACS5yrSurvey')
dc_data = pd.DataFrame({'dcid': counties, 'name': names, 'population': observations})

Option 3 could also be used for rapid fire saving json objects with the input mapping encoded. For example get the names of all the states and save a mapping to a file. I personally have never used a dictionary output so I am not the best advocate for it. I have yet to come across an example where I needed one.

Still trying to understand the proposal. What's the difference between the proposed call pattern and the following code snippet?

# This snippet is currently supported by version 1.0.0
data = pd.DataFrame({'state': ['geoId/06']})

# Get places in California, and flatten the results out.
data['counties'] = dc.get_places_in(data['state'], 'County')
data = dc.flatten_data(data)

# Perform the rest of the calls
data['pop'] = dc.get_populations(data['counties'], 'Person')
data['obs'] = dc.get_observations(data['pop'], 'count', 'measuredValue', '2017', measurement_method='CenusACS5yrSurvey')

@ACscooter the proposed call pattern will look exactly the same.

The proposed call pattern is for the input to be an iterable. The example snippet passes Pandas Series to the API which are iterables. In the example, passing a different iterable is odd and requires an unnecessary cast but changing the type of iterable does not change the output.

The proposed call pattern is for the output to be a list of lists. In the example snippet we assign a series of lists to a Pandas column using __setitem__ (df['foo'] = ...). We could also assign a list of lists to a Pandas column and get exactly the same output. Internally Pandas calls pd.Series on the value passed to _setitem__. If we pass a series of lists to __setitem__ then the call to pd.Series is a no-op. If we pass a list of lists to __setitem__ then the call to pd.Series converts the list of lists to a pd.Series of lists and then sets the value to the converted series. Currently we do the conversion from a list to a series in the API.

Hopefully the fact that option 2 leads to similar code is seen as beneficial. If we view the snippet as a common example then we find that there is no need for series to be returned. The series will just be flattened and is simply acting as a bucket to store lists.

While writing this I realized we may want to amend the original proposal to allow for any sequence as input rather than any iterator as input. Since sequences will have inherent order and iterators may not.