DataMunge is a python module that helps clean and organize data for analysis. It includes functions for handling missing values, removing outliers, encoding categorical variables, normalizing data, reducing dimensionality and removing duplicate rows. It can be used in data analysis projects to prepare data for further analysis.
pip install DataMunge
Remove outliers from a given column of a dataframe
Params:
data
: dataframecolumn
: string, column namethreshold
: int, threshold value
Returns:
- dataframe
Handle missing values in a dataframe
Params:
data
: dataframestrategy
: string, strategy for handling missing values (mean, median or mode)
Returns:
- dataframe
Encode categorical variables in a dataframe
Params:
data
: dataframecolumns
: list of strings, column names
Returns:
- dataframe
Normalize data in a dataframe
Params:
data
: dataframecolumns
: list of strings, column names
Returns:
- dataframe
Reduce dimensionality of a dataframe using PCA
Params:
data
: dataframen_components
: int, number of components
Returns:
- dataframe
Remove duplicate rows from a dataframe
Params:
data
: dataframe
Returns:
- dataframe
File Structure
|-- DataMunge/
|-- __init__.py
|-- functions.py
|-- tests/
|-- test_functions.py
|-- setup.py
|-- README.md
|-- LICENSE
Contribution: Contributions are always welcomed. If you have any ideas for new features or improvements, feel free to open an issue or submit a pull request.
Note:
This module is designed to be flexible and adaptable to different types of data and use cases.
It is important to understand the underlying assumptions and limitations of each function and how they apply to your specific data before using them.
It is also recommended to test the functions on a small subset of your data before applying them to the entire dataset.