Following descriptions are provided:
- Module: dataUtils.py
- Module: ModelAnalysis.py
- Module: pipelines.py
This module contains:
General | Value |
---|---|
Lines of Code (LoC) | 0231 |
Lines of documentation (LoD) | 0050 |
Empty lines (LoN) | 0042 |
Number of classes (NoC) | 0001 |
Number of functions (NoF) | 0000 |
|3| """ |4| # Author : Thomas Neuer (tneuer) |5| # File Name : dataUtils.py |6| # Creation Date : Don 23 Aug 2018 16:34:43 CEST |7| # Last Modified : Fre 26 Okt 2018 17:54:47 CEST |8| # Description : Common utilities needed in Data Science |9| """
Following packages are imported:
Package | Imported as | Imported objects |
---|---|---|
re | - | - |
pickle | - | - |
numpy | np | - |
pandas | pd | - |
xgboost | xgb | - |
matplotlib | plt | - |
sklearn | - | PCA |
This module contains following classes:
- FeatureImportance (187, 38)
Jump to:
General | Value |
---|---|
Start line (Start) | 0022 |
End line (End) | 0210 |
Lines of Code (LoC) | 0187 |
Lines of documentation (LoD) | 0038 |
Empty lines (LoN) | 0033 |
Number of methods | 0005 |
Number of Attributes | 0008 |
Number of parents | 0001 |
|23| """ Determine feature importance. |24| |25| Feature importance assessment is an important task in data science. This class |26| provides several methods to determine this importance. |27| 1) XGB: Boostes decision trees allow for easy in straight forward determination of |28| feature importance by counting the number of times the features was used to |29| perform a cut. |30| 2) PCA: Principal Components analysis determines which feature is responsible for the |31| most amount of explained variance. |32| """
This class inherits from:
This class contains following methods:
- __init__
- fit
- assess_importance
- xgb_importance
- pca_importance
General | Value |
---|---|
Start line (Start) | 0034 |
End line (End) | 0066 |
Lines of Code (LoC) | 0031 |
Lines of documentation (LoD) | 0009 |
Empty lines (LoN) | 0005 |
|35| """ |36| Parameters |37| ---------- |38| method : str |39| Has to be in ["xgb", "PCA"] or an error will be raised. See class documentation |40| for more information. |41| parameter_dict : dict |42| Keyword arguments for chosen method. |43| """
Arguments | Default |
---|---|
self | |
method | "xgb" |
parameter_dict | None |
*args | |
**kwargs |
General | Value |
---|---|
Start line (Start) | 0066 |
End line (End) | 0090 |
Lines of Code (LoC) | 0023 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0003 |
No documentation available
Arguments | Default |
---|---|
self | |
X | |
y | None |
**kwargs |
General | Value |
---|---|
Start line (Start) | 0090 |
End line (End) | 0114 |
Lines of Code (LoC) | 0023 |
Lines of documentation (LoD) | 0008 |
Empty lines (LoN) | 0004 |
|91| """ |92| Parameters |93| ---------- |94| plot : bool |95| If True, a figure object is returned, which can be plotted or saved. |96| n_best : int |97| Only used if plot==True, number of features shown in the importance plot. |98| """
Arguments | Default |
---|---|
self | |
plot | True |
n_best | 20 |
- Return 1: importance[0] & importance[1]
- Return 2: importance[0] & importance[1]
General | Value |
---|---|
Start line (Start) | 0114 |
End line (End) | 0156 |
Lines of Code (LoC) | 0041 |
Lines of documentation (LoD) | 0005 |
Empty lines (LoN) | 0008 |
|115| """ Supervised feature importance assessment for a specific goal. |116| |117| Uses the XGBoost algorithm to determine the best feature for a classification/ |118| regression task at hand. |119| """
Arguments | Default |
---|---|
self | |
plot | True |
n_best | 20 |
- Return 1: importance & fig
- Return 2: importance
General | Value |
---|---|
Start line (Start) | 0156 |
End line (End) | 0210 |
Lines of Code (LoC) | 0053 |
Lines of documentation (LoD) | 0006 |
Empty lines (LoN) | 0011 |
|157| """ Unsupervised feature importance assessment technique. |158| |159| Uses the scikit-learn Principal components analysis technique in order to find |160| the feature explaining the largest amount of variance. |161| Basically sorted by variance. |162| """
Arguments | Default |
---|---|
self | |
plot | True |
n_best | 20 |
- Return 1: importance & (fig1 & fig2)
- Return 2: importance
A list of the used attributes:
0 trained; 1 method; 2 available; 3 namesGiven; 4 n_instances; 5 model; 6 accuracy; 7 featureNames;
This module contains following functions:
This module contains:
General | Value |
---|---|
Lines of Code (LoC) | 0178 |
Lines of documentation (LoD) | 0047 |
Empty lines (LoN) | 0031 |
Number of classes (NoC) | 0000 |
Number of functions (NoF) | 0002 |
|3| """ |4| # Author : Thomas Neuer (tneuer) |5| # File Name : ModelAnalysis.py |6| # Creation Date : Fre 26 Okt 2018 17:59:05 CEST |7| # Last Modified : Sam 27 Okt 2018 22:15:47 CEST |8| # Description : Some utilities to help analyze a given model. |9| """
Following packages are imported:
Package | Imported as | Imported objects |
---|---|---|
pickle | - | - |
numpy | np | - |
pandas | pd | - |
keras | - | load_model |
matplotlib | plt | - |
This module contains following classes:
This module contains following functions:
- plot_confidence_in_true_label (66, 19)
- plot_confidence_in_predicted_label (66, 19)
General | Value |
---|---|
Start line (Start) | 0021 |
End line (End) | 0088 |
Lines of Code (LoC) | 0066 |
Lines of documentation (LoD) | 0019 |
Empty lines (LoN) | 0009 |
|22| """ Plot the confidence of the input separated by true class label. |23| |24| If predictions are given, the other three keyword arguments must be given in order |25| to be able to calculate these predictions. If predictions is not None, the other three |26| inputs are ignored. |27| |28| Arguments |29| --------- |30| predProba : np.ndarray [None] |31| Array of dimension (nr_examples, nr_classes) where each entry is the confidence |32| in the different classes. |33| data : pd.DataFrame or np.ndarray [None] |34| Contains data in shape (nr_examples, nr_features). Ignored if predictions are given. |35| labels : pd.Series, np.ndarray or list of integers |36| Contains the labels in shape (nr_examples, ) where each class is represented by an index, |37| e.g.: [0,1,0,0,3,4,1,...] for 4 (or more) classes |38| model : model [None] |39| Trained model, which has a method called predcit_proba to predict the class confidence. |40| """
Arguments | Default |
---|---|
labels | |
predProba | None |
data | None |
model | None |
- Return 1: fig
General | Value |
---|---|
Start line (Start) | 0088 |
End line (End) | 0155 |
Lines of Code (LoC) | 0066 |
Lines of documentation (LoD) | 0019 |
Empty lines (LoN) | 0009 |
|89| """ Plot the confidence of the input separated by predicted class label. |90| |91| If predictions are given, the other three keyword arguments must be given in order |92| to be able to calculate these predictions. If predictions is not None, the other three |93| inputs are ignored. |94| |95| Arguments |96| --------- |97| predProba : np.ndarray [None] |98| Array of dimension (nr_examples, nr_classes) where each entry is the confidence |99| in the different classes. |100| data : pd.DataFrame or np.ndarray [None] |101| Contains data in shape (nr_examples, nr_features). Ignored if predictions are given. |102| labels : pd.Series, np.ndarray or list of integers |103| Contains the labels in shape (nr_examples, ) where each class is represented by an index, |104| e.g.: [0,1,0,0,3,4,1,...] for 4 (or more) classes |105| model : model [None] |106| Trained model, which has a method called predcit_proba to predict the class confidence. |107| """
Arguments | Default |
---|---|
labels | |
predProba | None |
data | None |
model | None |
- Return 1: fig
This module contains:
General | Value |
---|---|
Lines of Code (LoC) | 0300 |
Lines of documentation (LoD) | 0061 |
Empty lines (LoN) | 0060 |
Number of classes (NoC) | 0007 |
Number of functions (NoF) | 0000 |
|3| """ |4| # Author : Thomas Neuer (tneuer) |5| # File Name : pipelines.py |6| # Creation Date : Mit 22 Aug 2018 18:29:57 CEST |7| # Last Modified : Son 28 Okt 2018 10:41:19 CET |8| # Description : |9| """
Following packages are imported:
Package | Imported as | Imported objects |
---|---|---|
pickle | - | - |
numpy | np | - |
pandas | pd | - |
collections | - | Counter |
sklearn | - | Pipeline, FeatureUnion |
sklearn | - | BaseEstimator, TransformerMixin |
sklearn | - | OneHotEncoder, StandardScaler, LabelEncoder |
This module contains following classes:
- multipleOneHot (32, 2)
- multipleStandardScalar (29, 2)
- Binarizer (50, 10)
- GroupTransformer (37, 13)
- Logtransform (21, 2)
- RareConstructor (78, 21)
- FeatureRemover (14, 2)
Jump to:
General | Value |
---|---|
Start line (Start) | 0023 |
End line (End) | 0056 |
Lines of Code (LoC) | 0032 |
Lines of documentation (LoD) | 0002 |
Empty lines (LoN) | 0007 |
Number of methods | 0003 |
Number of Attributes | 0003 |
Number of parents | 0002 |
|24| """ Basically a wrapper of sklearn.OneHotEncoder for multiple features. |25| """
This class inherits from:
- BaseEstimator
- TransformerMixin
This class contains following methods:
- __init__
- fit
- transform
General | Value |
---|---|
Start line (Start) | 0027 |
End line (End) | 0031 |
Lines of Code (LoC) | 0003 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0001 |
No documentation available
Arguments | Default |
---|---|
self | |
features | |
overwrite |
General | Value |
---|---|
Start line (Start) | 0031 |
End line (End) | 0042 |
Lines of Code (LoC) | 0010 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0002 |
No documentation available
Arguments | Default |
---|---|
self | |
X |
- Return 1: self
General | Value |
---|---|
Start line (Start) | 0042 |
End line (End) | 0056 |
Lines of Code (LoC) | 0013 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0003 |
No documentation available
Arguments | Default |
---|---|
self | |
X |
- Return 1: X
A list of the used attributes:
0 encDict; 1 overwrite; 2 features;
Jump to:
General | Value |
---|---|
Start line (Start) | 0056 |
End line (End) | 0086 |
Lines of Code (LoC) | 0029 |
Lines of documentation (LoD) | 0002 |
Empty lines (LoN) | 0007 |
Number of methods | 0003 |
Number of Attributes | 0003 |
Number of parents | 0002 |
|57| """ Basically a wrapper of sklearn.StandardScaler for multiple features. |58| """
This class inherits from:
- BaseEstimator
- TransformerMixin
This class contains following methods:
- __init__
- fit
- transform
General | Value |
---|---|
Start line (Start) | 0060 |
End line (End) | 0064 |
Lines of Code (LoC) | 0003 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0001 |
No documentation available
Arguments | Default |
---|---|
self | |
features | |
overwrite | False |
General | Value |
---|---|
Start line (Start) | 0064 |
End line (End) | 0073 |
Lines of Code (LoC) | 0008 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0002 |
No documentation available
Arguments | Default |
---|---|
self | |
X |
- Return 1: self
General | Value |
---|---|
Start line (Start) | 0073 |
End line (End) | 0086 |
Lines of Code (LoC) | 0012 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0003 |
No documentation available
Arguments | Default |
---|---|
self | |
X |
- Return 1: X
A list of the used attributes:
0 scalerDict; 1 overwrite; 2 features;
Jump to:
General | Value |
---|---|
Start line (Start) | 0086 |
End line (End) | 0137 |
Lines of Code (LoC) | 0050 |
Lines of documentation (LoD) | 0010 |
Empty lines (LoN) | 0011 |
Number of methods | 0003 |
Number of Attributes | 0004 |
Number of parents | 0002 |
|87| """ Performs a binary cut on certain discrete or continuous features. |88| |89| The input during initialization has to be a dictionary, where the keys indicate |90| a column name and the value is a list of 4 elements being (in that order): |91| - cut-off |92| - operation ("==", ">", ...) |93| - Positive name (given if operation is True) |94| - Negative name (given if operation is False) |95| |96| """
This class inherits from:
- BaseEstimator
- TransformerMixin
This class contains following methods:
- __init__
- fit
- transform
General | Value |
---|---|
Start line (Start) | 0098 |
End line (End) | 0103 |
Lines of Code (LoC) | 0004 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0001 |
No documentation available
Arguments | Default |
---|---|
self | |
binaryDict | |
overwrite | False |
integer_encoding | True |
General | Value |
---|---|
Start line (Start) | 0103 |
End line (End) | 0106 |
Lines of Code (LoC) | 0002 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0001 |
No documentation available
Arguments | Default |
---|---|
self | |
X |
- Return 1: self
General | Value |
---|---|
Start line (Start) | 0106 |
End line (End) | 0137 |
Lines of Code (LoC) | 0030 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0006 |
No documentation available
Arguments | Default |
---|---|
self | |
X |
- Return 1: X
A list of the used attributes:
0 binaryDict; 1 encoding; 2 overwrite; 3 labeler;
Jump to:
General | Value |
---|---|
Start line (Start) | 0137 |
End line (End) | 0175 |
Lines of Code (LoC) | 0037 |
Lines of documentation (LoD) | 0013 |
Empty lines (LoN) | 0007 |
Number of methods | 0003 |
Number of Attributes | 0002 |
Number of parents | 0002 |
|138| """ Groups together entries in a column with new name. |139| |140| Give a dictionary where the key is a column in the data. The value of the |141| dictionary is another dictionary where the keys are the new names and the value |142| is a list of values which should be replaced. |143| Example : replaceColumns = { |144| "education": {"HighEducation": ["Doctorate", "Prof-school", "Master"], |145| "Assoc": ["Assoc-acdm", "Assoc-voc"]}, |146| "marital-status" : {"Absent": ["Married-spouse-absent", "Separated", "Widowed"]} |147| } |148| |149| Every item in a list gets then substituted by its key in the corresponding column. |150| """
This class inherits from:
- BaseEstimator
- TransformerMixin
This class contains following methods:
- __init__
- fit
- transform
General | Value |
---|---|
Start line (Start) | 0151 |
End line (End) | 0161 |
Lines of Code (LoC) | 0009 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0001 |
No documentation available
Arguments | Default |
---|---|
self | |
replaceColumns | |
overwrite | False |
General | Value |
---|---|
Start line (Start) | 0161 |
End line (End) | 0164 |
Lines of Code (LoC) | 0002 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0001 |
No documentation available
Arguments | Default |
---|---|
self | |
X |
- Return 1: self
General | Value |
---|---|
Start line (Start) | 0164 |
End line (End) | 0175 |
Lines of Code (LoC) | 0010 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0003 |
No documentation available
Arguments | Default |
---|---|
self | |
X |
A list of the used attributes:
0 replacer; 1 overwrite;
Jump to:
General | Value |
---|---|
Start line (Start) | 0175 |
End line (End) | 0197 |
Lines of Code (LoC) | 0021 |
Lines of documentation (LoD) | 0002 |
Empty lines (LoN) | 0006 |
Number of methods | 0003 |
Number of Attributes | 0002 |
Number of parents | 0002 |
|176| """ Basic logtransforamtion on certain features |177| """
This class inherits from:
- BaseEstimator
- TransformerMixin
This class contains following methods:
- __init__
- fit
- transform
General | Value |
---|---|
Start line (Start) | 0179 |
End line (End) | 0183 |
Lines of Code (LoC) | 0003 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0001 |
No documentation available
Arguments | Default |
---|---|
self | |
features | |
overwrite | False |
General | Value |
---|---|
Start line (Start) | 0183 |
End line (End) | 0186 |
Lines of Code (LoC) | 0002 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0001 |
No documentation available
Arguments | Default |
---|---|
self | |
X |
- Return 1: self
General | Value |
---|---|
Start line (Start) | 0186 |
End line (End) | 0197 |
Lines of Code (LoC) | 0010 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0003 |
No documentation available
Arguments | Default |
---|---|
self | |
X |
- Return 1: X
A list of the used attributes:
0 overwrite; 1 features;
Jump to:
General | Value |
---|---|
Start line (Start) | 0197 |
End line (End) | 0276 |
Lines of Code (LoC) | 0078 |
Lines of documentation (LoD) | 0021 |
Empty lines (LoN) | 0009 |
Number of methods | 0003 |
Number of Attributes | 0004 |
Number of parents | 0002 |
|198| """ Groups categories in a column to a combined rare class. |199| |200| Some categircal features might have to many categories which are either |201| not important or too under represented to help. This pipelines helps to |202| remove those categories by combining them into a "rare" class. |203| """
This class inherits from:
- BaseEstimator
- TransformerMixin
This class contains following methods:
- __init__
- fit
- transform
General | Value |
---|---|
Start line (Start) | 0205 |
End line (End) | 0240 |
Lines of Code (LoC) | 0034 |
Lines of documentation (LoD) | 0015 |
Empty lines (LoN) | 0001 |
|206| """ |207| Arguments |208| --------- |209| features : List or string |210| Indicates which columns should be cut |211| cuts : List or None [None] |212| All categories with less counts than indicated by cuts are combined to the rare class. |213| If None for a feature, the counts are printed per category to the terminal and |214| the user can decide on a useful cut. |215| overwrite : bool [False] |216| Indicates wether the column in question should be replaced or a new column |217| is added. |218| new_cat : - |219| Value which gets substituted for categories below the threshold |220| """
Arguments | Default |
---|---|
self | |
features | |
cuts | None |
new_cats | "rare" |
overwrite | False |
General | Value |
---|---|
Start line (Start) | 0240 |
End line (End) | 0243 |
Lines of Code (LoC) | 0002 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0001 |
No documentation available
Arguments | Default |
---|---|
self | |
X |
- Return 1: self
General | Value |
---|---|
Start line (Start) | 0243 |
End line (End) | 0276 |
Lines of Code (LoC) | 0032 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0005 |
No documentation available
Arguments | Default |
---|---|
self | |
X |
- Return 1: X
A list of the used attributes:
0 new_cats; 1 overwrite; 2 cuts; 3 features;
Jump to:
General | Value |
---|---|
Start line (Start) | 0276 |
End line (End) | 0291 |
Lines of Code (LoC) | 0014 |
Lines of documentation (LoD) | 0002 |
Empty lines (LoN) | 0005 |
Number of methods | 0003 |
Number of Attributes | 0001 |
Number of parents | 0002 |
|277| """Drops columns from dataframe as a pipeline. |278| """
This class inherits from:
- BaseEstimator
- TransformerMixin
This class contains following methods:
- __init__
- fit
- transform
General | Value |
---|---|
Start line (Start) | 0280 |
End line (End) | 0283 |
Lines of Code (LoC) | 0002 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0001 |
No documentation available
Arguments | Default |
---|---|
self | |
features |
General | Value |
---|---|
Start line (Start) | 0283 |
End line (End) | 0286 |
Lines of Code (LoC) | 0002 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0001 |
No documentation available
Arguments | Default |
---|---|
self | |
X |
- Return 1: self
General | Value |
---|---|
Start line (Start) | 0286 |
End line (End) | 0291 |
Lines of Code (LoC) | 0004 |
Lines of documentation (LoD) | 0000 |
Empty lines (LoN) | 0002 |
No documentation available
Arguments | Default |
---|---|
self | |
X |
- Return 1: X
A list of the used attributes:
0 features;
This module contains following functions: