A set of very useful tools for data wrangling and data processing that could be used with the Python library Pandas. This set of tools allows the user to give to a Pandas DataFrame any kind of complex structure, being able to arrange columns and rows as if they were part of a Rubik's cube.
Share rubik with all your panda friends!
- fillna_list()
- concat_to_list()
- ungroup_list()
- ungroup_dict()
- list_to_columns()
- groupto_list()
- groupto_tuple()
- groupto_sorted_tuple()
- groupto_dict()
- groupto_set()
- groupto_sorted_set()
- extend_column()
- table()
- flat_list()
- chunkify()
- fillna_dict()
To install rubik from the terminal, first create a virtual environment venv
, then use the pip install
command:
python3 -m venv venv
source venv/bin/activate
pip install git+https://github.com/josemariasosa/rubik
To make sure the installation was correct, check Rubik's version using the following command from the terminal:
python -c 'import rubik; print(rubik.__version__)'
# 2.1.0
Import the module using the rk
alias for rubik.
import rubik as rk
import pandas as pd
This is a list of the functions with very simple examples for it use.
import pandas as pd
from operator import itemgetter
pd.set_option('display.min_rows', 30)
pd.set_option('display.max_rows', 60)
pd.set_option('display.max_columns', 20)
After the Pandas Version 0.23, the used must explicitly specify the number of columns that will be printed in the standard output. When Pandas library is loaded the set_option
method set the default to 20.
rk.fillna_list(data_frame, column_name)
From any column in a DataFrame, replace the NaN
values with empty lists.
data_frame - The DataFrame we are going to work with.
column_name - A String with the column name we are going to modify.
The original table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | NaN |
1 | user-452 | [1] |
2 | user-21 | [5, 2] |
3 | user-621 | NaN |
4 | user-5512 | [3, 4] |
5 | user-25 | [1, 2, 3] |
The new table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | [ ] |
1 | user-452 | [1] |
2 | user-21 | [5, 2] |
3 | user-621 | [ ] |
4 | user-5512 | [3, 4] |
5 | user-25 | [1, 2, 3] |
The code is:
new = rk.fillna_list(original, 'Roles')
rk.concat_to_list(data_frame, column_list, column_new_name)
Concatenate multiple columns of a data frame into a single list.
data_frame - The DataFrame we are going to work with.
column_list - A List with the column names we are going to work with.
column_new_name - A String with the column name we are going to create.
The original table is:
Entry | Id | Role 1 | Role 2 |
---|---|---|---|
0 | user-123 | 1 | 2 |
1 | user-452 | 1 | 3 |
2 | user-21 | 5 | 2 |
3 | user-621 | 3 | 1 |
4 | user-5512 | 3 | 4 |
5 | user-25 | 1 | 3 |
The new table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | [1, 2] |
1 | user-452 | [1, 3] |
2 | user-21 | [5, 2] |
3 | user-621 | [3, 1] |
4 | user-5512 | [3, 4] |
5 | user-25 | [1, 3] |
The code is:
new = rk.concat_to_list(original, ['Role 1', 'Role 2'], 'Roles')
rk.ungroup_list(data_frame, column_name)
This function unnest a 'Series of Lists' in a Pandas data frame.
⚡️ Note that the number of rows for the result may increase.
data_frame - The DataFrame we are going to work with.
column_name - A String with the column name we are going to modify.
The original table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | [1, 2] |
1 | user-452 | [5, 7] |
2 | user-21 | [3] |
The new table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | 1 |
0 | user-123 | 2 |
1 | user-452 | 5 |
1 | user-452 | 7 |
2 | user-21 | 3 |
The code is:
new = rk.ungroup_list(original, 'Roles')
rk.ungroup_dict(data_frame, column_name, prefix=False)
This function flatten a data frame with dictionaries in a column.
data_frame - The DataFrame we are going to work with.
column_name - A String with the column name we are going to modify.
prefix - Use the prefix argument as follow:
- False: default, regular behavior, column names are the dict keys.
- True: Use as prefix the original column name followed by an underscore.
- String: The user can give any prefix.
The original table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | {"main": 1, "secondary": 2} |
1 | user-452 | {"main": 3, "secondary": 1} |
2 | user-21 | {"main": 7} |
3 | user-621 | {"main": 2, "secondary": 6} |
4 | user-5512 | {"main": 7, "secondary": 5} |
5 | user-25 | {"main": 3} |
The new table is:
Entry | Id | main | secondary |
---|---|---|---|
0 | user-123 | 1 | 2 |
1 | user-452 | 3 | 1 |
2 | user-21 | 7 | NaN |
3 | user-621 | 2 | 6 |
4 | user-5512 | 7 | 5 |
5 | user-25 | 3 | NaN |
The code is:
new = rk.ungroup_dict(original, 'Roles')
rk.list_to_columns(data_frame, column_name)
This function creates multiple columns from a single column with lists.
data_frame - The DataFrame we are going to work with.
column_name - A String with the column name we are going to modify.
The original table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | [1, 2] |
1 | user-452 | [1, 3] |
2 | user-21 | [5, 2] |
3 | user-621 | [3, 1] |
4 | user-5512 | [3, 4] |
5 | user-25 | [1, 3] |
The new table is:
Entry | Id | Roles_1 | Roles_2 |
---|---|---|---|
0 | user-123 | 1 | 2 |
1 | user-452 | 1 | 3 |
2 | user-21 | 5 | 2 |
3 | user-621 | 3 | 1 |
4 | user-5512 | 3 | 4 |
5 | user-25 | 1 | 3 |
The code is:
new = rk.list_to_columns(original, 'Roles')
rk.groupto_list(data_frame, column_list, column_name)
Group a variable (column_name) in to a single list in regards of agroup of variables (column_list).
data_frame - The DataFrame we are going to work with.
column_list - A List with the column names we are going to work with as pivot columns.
column_name - A String with the column name we are going to modify.
The original table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | 1 |
0 | user-123 | 2 |
1 | user-452 | 5 |
1 | user-452 | 7 |
2 | user-21 | 3 |
The new table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | [1, 2] |
1 | user-452 | [5, 7] |
2 | user-21 | [3] |
The code is:
new = rk.groupto_list(original, ['Entry', 'Id'], 'Roles')
rk.groupto_tuple(data_frame, column_list, column_name)
Group a variable (column_name) into a tuple in regards of a group of variables (column_list).
data_frame - The DataFrame we are going to work with.
column_list - A List with the column names we are going to work with as pivot columns.
column_name - A String with the column name we are going to modify.
The original table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | 1 |
0 | user-123 | 2 |
1 | user-452 | 5 |
1 | user-452 | 7 |
2 | user-21 | 3 |
The new table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | (1, 2) |
1 | user-452 | (5, 7) |
2 | user-21 | (3, ) |
The code is:
new = rk.groupto_tuple(original, ['Entry', 'Id'], 'Roles')
rk.groupto_sorted_tuple(data_frame, column_list, column_name, n=0)
Group a variable (column_name) in to a single tuple in regards of a group of variables (column_list). Sort a list of tuples by the first, second, or n-1 element.
data_frame - The DataFrame we are going to work with.
column_list - A List with the column names we are going to work with.
column_name - A String with the column name we are going to modify.
n - An integer with the index of the sorting value (default n = 0).
The original table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | 2 |
0 | user-123 | 1 |
1 | user-452 | 7 |
1 | user-452 | 5 |
2 | user-21 | 3 |
The new table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | (1, 2) |
1 | user-452 | (5, 7) |
2 | user-21 | (3, ) |
The code is:
new = rk.groupto_sorted_tuple(original, ['Entry', 'Id'], 'Roles')
rk.groupto_dict(data_frame, column_list, column_new_name)
Generate new column with dictionaries having values of othe columns.
data_frame - The DataFrame we are going to work with.
column_list - A List with the column names we are going to work with.
column_new_name - A String with the column name we are going to create.
The original table is:
Entry | Id | main | secondary |
---|---|---|---|
0 | user-123 | 1 | 2 |
1 | user-452 | 3 | 1 |
2 | user-21 | 7 | 3 |
3 | user-621 | 2 | 6 |
4 | user-5512 | 7 | 5 |
5 | user-25 | 3 | 3 |
The new table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | {"main": 1, "secondary": 2} |
1 | user-452 | {"main": 3, "secondary": 1} |
2 | user-21 | {"main": 7, "secondary": 3} |
3 | user-621 | {"main": 2, "secondary": 6} |
4 | user-5512 | {"main": 7, "secondary": 5} |
5 | user-25 | {"main": 3, "secondary": 3} |
The code is:
new = rk.groupto_dict(original, ['main', 'secondary'], 'Roles')
rk.groupto_set(data_frame, column_list, column_name)
Group a variable (column_name) in to a single set in regards of a group of variables (column_list).
The returned object type is List, not an actual set.
data_frame - The DataFrame we are going to work with.
column_list - A List with the column names we are going to work with.
column_name - A String with the column name we are going to modify.
The original table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | 2 |
0 | user-123 | 1 |
1 | user-452 | 7 |
1 | user-452 | 7 |
2 | user-21 | 3 |
The new table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | [2, 1] |
1 | user-452 | [7] |
2 | user-21 | [3] |
The code is:
new = rk.groupto_set(original, ['Entry', 'Id'], 'Roles')
rk.groupto_sorted_set(data_frame, column_list, column_name)
Group a variable (column_name) into a sorted set in regards of a group of variables (column_list).
The returned object type is List, not an actual set.
data_frame - The DataFrame we are going to work with.
column_list - A List with the column names we are going to work with.
column_name - A String with the column name we are going to modify.
The original table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | 2 |
0 | user-123 | 1 |
1 | user-452 | 7 |
1 | user-452 | 7 |
2 | user-21 | 3 |
The new table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | [1, 2] |
1 | user-452 | [7] |
2 | user-21 | [3] |
The code is:
new = rk.groupto_sorted_set(original, ['Entry', 'Id'], 'Roles')
rk.extend_column(data_frame, col_name_1, col_name_2, col_new_name)
Expand 2 Pandas Series with every element being lists into a single column with lists.
data_frame - The DataFrame we are going to work with.
col_name_1 - A String with the column name we are going to modify.
col_name_2 - A String with the column name we are going to modify.
col_new_name - A String with the column name we are going to create.
The original table is:
Entry | Id | Roles1 | Roles2 |
---|---|---|---|
0 | user-123 | [1, 2] | [ ] |
1 | user-452 | [3, 1] | [2] |
2 | user-21 | [7] | [5, 4] |
3 | user-621 | [2, 6] | [ ] |
4 | user-5512 | [7, 5] | [1] |
5 | user-25 | [3] | [4, 5] |
The new table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | [1, 2] |
1 | user-452 | [3, 1, 2] |
2 | user-21 | [7, 5, 4] |
3 | user-621 | [2, 6] |
4 | user-5512 | [7, 5, 1] |
5 | user-25 | [3, 4, 5] |
The code is:
new = rk.extend_column(original, 'Roles1', 'Roles2', 'Roles')
rk.table(_list)
This function works like table() in R. It returns a data frame with the frequency of the elements in a given list.
The response is a Pandas DataFrame.
_list - The List we are going to work with.
The original list is:
[100, 103, 555, 102, 100, 100, 100, 102, 103, 103]
The new table is:
value | freq |
---|---|
100 | 4 |
103 | 3 |
102 | 2 |
555 | 1 |
The code is:
original = [100, 103, 555, 102, 100, 100, 100, 102, 103, 103]
new = rk.table(original)
rk.flat_list(_list)
Flatten a list with nested lists.
_list - The List we are going to work with.
The original list is:
[[100,[103, [555]]], 102]
The new list is:
[100, 103, 555, 102]
The code is:
original = [[100,[103, [555]]], 102]
new = rk.flat_list(original)
print(new)
# [100, 103, 555, 102]
rk.chunkify(chunk_this_list, chunk_size)
Create smaller chunks in the same list.
chunk_this_list - The List we are going to work with.
chunk_size - An integer with the number of elements in a chunk.
The original list is:
[100, 103, 555, 102, 100, 100, 100, 102, 103]
The new list is:
[[100, 103], [555, 102], [100, 100], [100, 102], [103]]
The code is:
original = [100, 103, 555, 102, 100, 100, 100, 102, 103, 103]
new = rk.chunkify(original, 2)
print(new)
# [[100, 103], [555, 102], [100, 100], [100, 102], [103]]
rk.fillna_dict(data_frame, column_name)
From any column in a DataFrame, replace the NaN values with empty dictionaries.
data_frame - The DataFrame we are going to work with.
column_name - A String with the column name we are going to modify.
The original table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | NaN |
1 | user-452 | NaN |
2 | user-21 | {'r': 1} |
3 | user-621 | NaN |
4 | user-5512 | {'r': 2} |
5 | user-25 | NaN |
The new table is:
Entry | Id | Roles |
---|---|---|
0 | user-123 | { } |
1 | user-452 | { } |
2 | user-21 | {'r': 1} |
3 | user-621 | { } |
4 | user-5512 | {'r': 2} |
5 | user-25 | { } |
The code is:
new = rk.fillna_dict(original, 'Roles')
Get the code of the last version here.
-
version - 2.2.3 'Pareciera ser todo más oscuro acá abajo.'
flat_list
is more compatible with Pandas.- I removed the versions funny names from the code.
-
version - 2.2.2 'My guitar is not too loud!'
- Fixing edge case for the
flat_list
function.
- Fixing edge case for the
-
version - 2.2.1 'Never stop until the cube is done.'
- Fixing edge case for the
ungroup_dict
function using math. https://docs.python.org/3/library/math.html#math.isnan - New function. fillna_dict.
- Fixing edge case for the
-
version - 2.2 'Pandemic leisure.'
- Updating function. For
ungroup_dict
, the user may use a prefix for the new columns that will be created.
- Updating function. For
-
version - 2.1 'This is the end of a decade.'
- (deleted) New function. Expand a column with a list, into multiple columns.
- Updating function. chunkify receives now a list or a DataFrame.
-
version - 2.0. 'PyCon Latam 2019 - Puerto Vallarta.'
- New function names. Again! In compliance with PEP8.
- Create the rubik Package for git.
- pip install git+https://github.com/josemariasosa/rubik
-
version - 1.3.2. 'New job. New opportunities.'
- Displaying a DataFrame in the standard output in a pretty way.
- Once the display.max_rows is exceeded, the display.min_rows options determines how many rows are shown in the truncated repr.
- Displaying a DataFrame in the standard output in a pretty way.
-
version - 1.3.1. 'Just a little bit higher. Not too much.'
- Standardizing names and the format.
-
version - 1.3. 'I should not be high in classes.'
-
Improvements in the flatDict function.
- Avoid crashing names with the dictionary keys.
-
Adding the chunkify function.
-
-
version - 0. 'I am not the original one, but I'm old, thought.'