Select metrics that take into account the imbalance of classes for example Cohen kappa, Matthew Correlation Coefficent Choose a class balancing algorithm - oversampling: SMOTE and others, also we can try data generation approaches (GAN)
Requires a separate analysis, division into at least 2 features (name, device). Name and device statistics are unknown, according to names, the number of categories can be quite large, this can be a problem, we may need a feature reduction scheme, or their transformation for example PCA, autencoders or something similar.
This feature have many missing values, we need to think about handling this issue.
This feature have NANs, we need to think about processing this problem
This features is distributed unevenly, it is possible to exclude features with low frequency, grouping into one or more groups We can think about data generation (GANs or something similar).
This feature have NANs, we need to think about processing this problem
This features is distributed unevenly, it is possible to exclude features with low frequency, grouping into one or more groups We can think about data generation (GANs or something similar).
Click feature has low variability, better to exclude it from model building.
This feature have non-normal distribution, we need to consider this when working with classifiers that are sensitive to this requirement. Think about the method of normalizing this feature
We need to choose a scheme for converting categories to a number (OneHot, Embeddings etc), there are many approaches. There are open source libraries in python (CategoryEncoders).
Check correlations between features
The importance of features can be calculated in various ways. Chi-Squared, Mutual Information, Catboost feature importance, shap values, etc
The first model to try is CatBoost - it can work with categories out of the box, fast, many different settings, works with the GPU. Problems: overfit, so it is desirable to compare with logistic regression. If the difference in accuracy is small, then we can think about tweaking LogReg. In any case, it is better to compare two models, linear and non-linear.