Explore and transform your datasets
Exploretransform is a collection of data exploration functions and custom pipline trasformers. It aims to streamline exploratory data analysis and extend some of scikit's data transformers.
For a complete guide on using this package, please refer to this article or examples.ipynb. Details about each function or class (docstrings) can be accessed using ?name.
Python PYPI:
!pip install exploretransform
Import the exploretransform package:
import exploretransform as et
Function / Class | Description |
---|---|
loadboston | loads the Boston housing dataset |
peek | returns dtype, levels, # of observations, and first five observations for a dataframe |
explore | provides various statistics on a dataframe (zeros, inf, missing, levels, dtypes) |
nested | takes a list, series or dataframe and returns the location of nested objects |
freq | for categorical or ordinal features, provides the count, percent, and cumulative percent for each level |
plotfreq | generates a bar plot using the data generated by freq |
corrtable | generates a table of all pairwise correlations and uses the average correlation for the row and column in to decide on potential drop/filter candidates |
calcdrop | analyzes corrtable output determines which features should be filtered/drop |
skewstats | returns the skewness statistics and magnitude for each numeric feature |
ascores | calculates various association scores (kendall, pearson, mic, dcor, spearman) between predictors and target |
ColumnSelect | custom transformer that selects columns for pipeline |
CategoricalOtherLevel | custom transformer that creates "other" level in categorical / ordinal data based on threshold |
CorrelationFilter | custom transformer that filters numeric features based on pairwise correlation |
More examples of using the exploretransform functions and classes are contained in examples.ipynb. Details about each function or class (docstrings) can be accessed using ?
?et.explore
df, X, y = et.loadboston()
et.explore(X)
variable | obs | q_zer | p_zer | q_na | p_na | q_inf | p_inf | dtype | |
---|---|---|---|---|---|---|---|---|---|
0 | town | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | object |
1 | lon | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
2 | lat | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
3 | crim | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
4 | zn | 506 | 372 | 73.52 | 0 | 0.0 | 0 | 0.0 | float64 |
5 | indus | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
6 | chas | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | category |
7 | nox | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
8 | rm | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
9 | age | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
10 | dis | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
11 | rad | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | category |
12 | tax | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | int64 |
13 | ptratio | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
14 | b | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
15 | lstat | 506 | 0 | 0.00 | 0 | 0.0 | 0 | 0.0 | float64 |
Column | Description |
---|---|
variable | name of variable |
obs | number of observations |
q_zer | number of zeros |
p_zer | percentage of zeros |
q_na | number of missing |
p_na | percentage of missing |
q_inf | number of infinity |
p_inf | percentage of infinity |
dtype | Python dtype |
- 1.0.0
- First release
- 1.0.1 - 1.0.7
- Minor adjustments to get package working correctly
Brian Pietracatella – bpietrac@gmail.com
Distributed under the MIT license. See LICENSE
for more information.