SafeGraphInc / safegraph_py

Python code for common, repeatable data wrangling and analysis of SafeGraph data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

safegraph_py

safegraph_py is a Python library designed to make your experience with SafeGraph data as easy as possible.

These functions are demonstrated on SafeGraph data in this Colab Notebook.

Installation

Use the package manager pip to install safegraph_py.

pip install -q --upgrade git+https://github.com/SafeGraphInc/safegraph_py

Safegraph_py

Usage

from safegraph_py_functions import safegraph_py_functions as sgpy

sgpy.test_me() # returns 'Hello World' to ensure you have downloaded the library
sgpy.help() # returns a list of all active functions and their arguments in the safegraph_py library
sgpy.read_pattern_single(f_path) # returns a Pandas DF from a single patterns file
# etc. . . 

Functions

A quick note before delving into the functions. There are 2 types of JSON objects that SafeGraph uses, thus 2 different functions are required. The 'unpack_json' function is >designed specifically for key:value JSON object such as: 'visitor_home_cbgs', 'visitor_daytime_cbgs', etc... For columns that contain a JSON list of values, we have the >'explode_json_array' function. These columns only have a list of values (a list of 1 data type), such as: 'visits_by_day', 'visits_by_each_hour', etc... They cannot be used >interchangeably due to the nature of the JSON objects.

unpack_json(df_, json_column='visitor_home_cbgs', key_col_name='visitor_home_cbg', value_col_name='cbg_visitor_count')

The unpack_json function is used to explode JSON objects within pandas, vertically, into a new DF. The default for this function is set to the visitor_home_cbgs column, but can be set to any of the JSON columns you might come across in the SafeGraph data. NOTE: This should be used with Key:Value columns only -- i.e. The 'visitor_home_cbgs' column. The key:values of the 'visitor_home_cbgs' look as follows: {"360610112021": 603, "460610112021": 243, "560610112021": 106, "660610112021": 87, "660610112021": 51}

  • To change the column name where the Key from the Key:Value pair will go, simply add the argument 'key_col_name'
  • To change the column name where the Value from the Key:Value pair will go, simply add the argument 'value_col_name'

unpack_json_fast(df, json_column='visitor_home_cbgs', key_col_name='visitor_home_cbg', value_col_name='cbg_visitor_count', chunk_n= 1000)

Multithreaded version of unpack_json(), reference above for more details. The parameter 'chunk_n' is the size of one chunk. The dataframe is then split into len(df)//chunk_n chunks. These chunks are what is distributed across multiple threads.

unpack_json_and_merge(df, json_column='visitor_home_cbgs', key_col_name='visitor_home_cbg', value_col_name='cbg_visitor_count', keep_index=False)

The unpack_json_and_merge function is used to explode JSON objects within pandas, vertically, and then adds it back to the current DF. The default for this function is set to the visitor_home_cbgs column, but can be set to any of the JSON collumns you might come across in the SafeGraph data.

  • To change the column name where the Key from the Key:Value pair will go, simply add the argument 'key_col_name'
  • To change the column name where the Value from the Key:Value pair will go, simply add the argument 'value_col_name'

unpack_json_and_merge_fast(df, json_column='visitor_home_cbgs', key_col_name='visitor_home_cbg', value_col_name='cbg_visitor_count', keep_index=False, chunk_n=1000)

Multithreaded version of unpack_json_and_merge(), reference above for more details. The parameter 'chunk_n' is the size of one chunk. The dataframe is then split into len(df)//chunk_n chunks. These chunks are what is distributed across multiple threads.

explode_json_array(df_, array_column ='visits_by_day', value_col_name='day_visit_counts',place_key='safegraph_place_id', file_key='date_range_start', array_sequence='day', keep_index=False, zero_index=False)

The explode_json_array function is similar to the unpack_json functions, except it is designed to handle the arrays that are just Values (as opposed to Key:Value pairs). The default for this function is set to the 'visits_by_day' column, but can be set to any simple array column in the SafeGraph data by reassigning the array_column argument. NOTE: This function should only be used with JSON objects of a list of Values (as opposed to key:value pairs). For instance in the 'visits_by_day' we have a JSON list of values only. The column appears as follows: [33, 22, 33, 22, 33, 22, 22]

  • To change the column name where the array values will be displayed, simply add the argument value_col_name
  • To change the column name where the array sequence will be displayed (i.e. - days, months, hours, etc), simply add the argument array_sequence

explode_json_array_fast(df, array_column = 'visits_by_day', value_col_name='day_visit_counts',place_key='safegraph_place_id', file_key='date_range_start', array_sequence='day', keep_index=False, zero_index=False, chunk_n = 1000)

Multithreaded version of unpack_json_and_merge(), reference above for more details. The parameter 'chunk_n' is the size of one chunk. The dataframe is then split into len(df)//chunk_n chunks. These chunks are what is distributed across multiple threads.

read_core_folder(path_to_core, compression='gzip', args, kwargs)

The read_core_folder function is designed to take an unpacked core file and read in the 5 core values - thus creating a complete Core Files DF with specified datatypes. All Pandas arguments and keywords arguments can be passed into this function.

read_core_folder_zip(path_to_core, compression='gzip', args, kwargs)

The read_core_folder_zip is designed to take the raw zipped file you recieve directly from SafeGraph and create a complete Core Files DF with specified datatypes. All Pandas arguments and keywords arguments can be passed into this function.

read_geo_zip(path_to_geo, compression='gzip', args, kwargs)

The read_geo_zip is designed to take the raw zipped geo file you recieve directly from the SafeGraph shop and create a pandas DF. All Pandas arguments and keywords arguments can be passed into this function.

read_pattern_single(f_path, compression='gzip', args, kwargs)

The read_pattern_single function is designed to allow the user to read in a singular patterns file of any type (weekly or monthly) and create a pandas DF with specified datatypes. All Pandas arguments and keywords arguments can be passed into this function.

read_pattern_multi(path_to_pattern, compression='gzip', args, kwargs)

The read_pattern_multi function is designed to read in multiple pattern files and combine them into 1 DF with specified datatypes (Warning: if columns are not specified, you can run out of memory very quickly and have an error). All Pandas arguments and keywords arguments can be passed into this function.

merge_core_pattern(core_df, patterns_df, how='inner', args, kwargs)

The merge_core_pattern function is designed to take a patterns DF and cross examine it with a core DF. The resulting pandas DF will be a DF with all of the values from your patterns DF as well as the matching values from your core DF (merge done on 'safegraph_place_id'). All Pandas arguments and keywords arguments can be passed into this function.

merge_socialDist_by_dates(path_to_social_dist, start_date, end_date, args, kwargs)

The merge_socialDist_by_dates function is designed to merge the social distancing data from a given start_date to a given end_date. The resulting pandas DF will be a DF of all social distancing data from the start_date to the end_date. All Pandas arguments and keywords arguments can be passed into this function.

  • start_date and end_date are strings formated as: "year-month-day"

cbg_functions

Usage

from safegraph_py_functions import cbg_functions as sgpy

sgpy.test_me_cbg() # returns 'Hello World' to ensure you have downloaded the library
sgpy.help_cbg() # returns a list of all active functions and their arguments in the cbg_functions library
sgpy.get_cbg_field_descriptions(year) - This function creates a reference table of Census data columns and their definitions
# etc. . . 

Functions

get_drive_id(year, drive_ids)

This function is used to pull input files from Google Drive. It requires input of a year (of 2016, 2017, 2018, or 2019) and a dictionary of Google Drive IDs with the requisite respective data. This function is used automatically within other functions, so knowledge of said dictionary is not necessary.

pd_read_csv_drive(id, drive, dtype = None)

This function is used to pull input files from Google Drive into Pandas DataFrames. This function takes the output of the chosen year from the previous function, get_drive_id, as its first required input. The second required input is a Google Drive object, automatically created within the functions that use these functions.

get_cbg_field_descriptions(year=2019)

This function authenticates and creates a PyDrive client, and creates a Pandas DataFrame (via the previous two functions) providing descriptions of each Census column for user reference. Information is available for 2016 to 2019.

get_census_columns(columns, year)

This function authenticates and creates a PyDrive client, and creates a Pandas DataFrame (via the first two functions) providing Census data for every census block group present in the data for the selected columns in the selected year (years available are 2016-2019). The input columns must be in a list and match the names given in the reference table in the above function.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

You may also be interested in:

License

SafeGraph

About

Python code for common, repeatable data wrangling and analysis of SafeGraph data

License:Apache License 2.0


Languages

Language:Python 100.0%