GVRQ / J2D_Data-Science_2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Results of Hackathon: Ranking #1 🥇

First Place

Connect with me:

Telegram_Alexander_Gavrilov_Data_Scientist Email_Alexander_Gavrilov_Data_Scientist  Linkedin_Alexander_Gavrilov_Data_Scientist

Air quality classification

Background

The Paris Agreement is an international treaty on climate change that was adopted by 196 Parties at COP21 in Paris. Its goal is to limit global warming to well below 2, preferably 1.5 degrees Celsius, compared to pre-industrial levels. To reach this long-term temperature goal, countries aim to peak greenhouse gas emissions as soon as possible to achieve a climate-neutral planet by mid-century. That is why the European Union is allocating large amounts of resources to the development of new technologies that allow the improvement of the fight against pollution. One of these is a new type of sensor based on laser technology that allows air quality to be detected based on different measurements.

We have two datasets (train.csv,test.csv) with two variables:

  • Features: The dataset contains 8 features in 8 columns, which are the parameters measured by the different sensors. These correspond to the different interactions that the laser beams have had when passing through the air particles.

  • Target: The target corresponds to the 'label' that classifies the quality of the air.

  • Target 0 = Good air quality
  • Target 1 = Moderate air quality
  • Target 2 = Dangerous air quality

Datasets:

  • train.csv: This dataset contains both the predictor variables and the type of air quality classification.
  • test.csv: This dataset contains the predictor variables with which the type of air quality will have to be predicted.

Problem

In order to predict the type of air quality in the test-dataset, we are going to make a predictive model using Random Forest.

Results

The results are in the 'predictions.csv' file.

Model: Random Forest with optimizations. The best result obtained with the selected model after training several models is 0.9 f1_score.

alt text alt text

Analysis

The dataset contains 8 features.

  • We've analyzed Target. The target is balanced. alt text

  • Analyzed Correlations between features alt text

  • Analyzed Features Importance alt text

Solution

After analyzing Correlations between features, we detected high correlation between Feature 5 & 6. We keep both because deleting one of them results in worse predictions. After analyzing Features Importance, we detected most important features: 3 & 6 and least important: 7 & 8. Features 7 & 8 were removed in order to reduce noise.

Params:

  • RandomForestClassifier(random_state = 1990)
  • 'bootstrap': True,
  • 'criterion': 'gini',
  • 'max_depth': 16,
  • 'max_features': 3,
  • 'max_leaf_nodes': 128,
  • 'n_estimators': 256,
  • 'n_jobs': 4,
  • 'cv': 5,
  • 'verbose': 4,

License

About


Languages

Language:Jupyter Notebook 99.8%Language:Python 0.2%