mutilabelclassification imbalanced-classification smote-oversampler feature-engineering labelencoding target-encoding automl hackerearth-solutions machinelearningchallenge data-science

Solution-for-HackerEarth-Machine-Learning-challenge (26st place)

HackerEarth Machine Learning challenge: Of Genomes And Genetics link.

Why I took part in this competition?

Because:

I have wanted to practice some feature engineering techniques about tabular data, ensemble,... that I found on Kaggle link.
I have wanted to take part in a competition which was ongoing. That has made me feel like I'm taking part in the Olympics in which a lot of competitors have to compete against each other.

Time to complete

That time when I found this competition, It still had 1 week to close. So my work still had concentrated on EDA and Feature engineering. In this post, I will share how I did feature engineering.

Train and Test

When performing Label Encoding below, you must encode train and test together (reference from)

Feature Engineering Techniques

Label Encode ( Categorical features )

Features have 2 values:
- Genes in mother's side
- Birth defects
- History of anomalies in previous pregnancie
- Assisted conception IVF/ART
- H/O serious maternal illness
- Folic acid details (peri-conceptiona)
- Place of birth
- Heart Rate (rates/min
- Respiratory Rate (breaths/min)
- Follow-up
- Inherited from father
- Maternal gene
- Status
- Paternal gene From what values of features are? (Quantifier) then I have chosen values. For example: with Follow-up feature: High --> 2, Low ---> 1
Features have more than 2 values:
- It's the same before. But some new values such as: -, Not available, Not applicable,.. so I had to label them.
- With some text features like : Location of Institute, Institute Name, Family Name, Father's name I have extracted to had new features then encodes them:
  - Location of Institute: for examples: 125 PARKER HILL AV\nJAMAICA PLAIN, MA 02120\n(42.329611374844326, -71.10616871232227). I had created some features before:
    1. 1. JAMAICA PLAIN : district
    2. MA 02120 : POST CODE
    3. 42.329611374844326 : Latitude
    4. -71.10616871232227 : Longtitude
  - Then hash code: district, POST CODE, Family Name, Father's name

Transforming

Log transform some numerical features: 'Patient Age', 'Blood cell count (mcL)', "Mother's age", "Father's age", 'White Blood cell count (thousand per microliter)'
Interaction (ratio): create ratio columns like as in df['patient_per_mom'] = df['Patient Age']/df["Mother's age"] df['patient_per_dad'] = df['Patient Age']/df["Father's age"] df['age_per_bcc'] = df['Patient Age']/df['Blood cell count (mcL)'] df['age_per_wbcc'] = df['Patient Age']/df['White Blood cell count (thousand per microliter)'] df['wbcc_per_bcc'] = df['White Blood cell count (thousand per microliter)'] /df['Blood cell count (mcL)']
Coordinate features: lat = df["latitude"] lon = df["longtitude"] df["x_dimen"] = np.cos(lat) * np.cos(lon) df["y_dimen"] = np.cos(lat) * np.sin(lon) df["z_dimen"] = np.sin(lat)

Create IS_NULL some impact features

Target Encoding

USING SMOTE TO DEAL WITH IMBALANCE DATASET

MODEL

Because time was limit, so I had choose autoML for model. And this is source code about my solution.

About

HackerEarth Machine Learning challenge: Of Genomes And Genetics

mutilabelclassification imbalanced-classification smote-oversampler feature-engineering labelencoding target-encoding automl hackerearth-solutions machinelearningchallenge data-science