gathuruM / WorldQuant-Data-Science-Projects

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

⭐Applied Data Science Lab [2023]


The Applied Data Science Lab, offered by WorldQuant University, is an immersive online program that equipped me with practical skills in addressing real-world, intricate challenges.
Throughout the program, I engaged in a series of comprehensive data science projects that helped me develop proficiency in data wrangling, analysis, model-building and effective communication through hands-on experience.


  • Imported multiple CSV files from a private repository into a pandas DataFrame using for loops
  • Created preliminary and exploratory histograms, scatter plots, whisker plots and bar charts
  • Examined the relationship between variables by assessing Pearson correlation coefficients
  • Cleaned and wrangled raw data by creating a custom wrangle function

  • Built ML pipelines by means of Ridge, OneHotEncoder, SimpleImputer, LinearRegression and make_pipeline built-in sklearn functions
  • Applied L2 Regularization in order to prevent overfitting or underfitting in Linear Regression models
  • Created an interactive dashboard using ipywidgets library to module predictions based on different input features

Image Snapshot of Interactive Dashboard

  • Connected to a MongoDB server using pymongo library to localize and extract the required data, ETL.
  • Applied rolling average, autocorrelation and lag operations to Times Series data variables.
  • Utilized Train Test Split procedures to create proper train and test datasets for a Linear Regression model.
  • Built, explored and interpreted Partial/Auto Correlation Functions plots.
  • Using statsmodels modules, constructed Auto Regressive and ARMA models and validated them via Walk Forward optimization.
  • Tuned the number of lagged observations and moving avg. window size via GridSearchCV.
  • Detected an optimal balance between Model Performance and Computational Costs

  • Connected to a SQL database and wrangled data using magic commands and sqlite3 library
  • Executed randomized Train Test Split to create proper training, testing and validation datasets
  • Elaborated ML pipelines utilizing OrdinalEncoder, DecisionTreeClassifier, LogisticRegression and make_pipeline built-in sklearn functions
  • Besides computing and evaluating training and validation accuracy scores:
    • For Decission Tree algorithms, tuned the Tree’s depth and assessed its predictions by assessing the Gini importance of its features
    • For Logistic Regression algorithms, evaluated Odds ratios to explain its predictions
  • Reviewed the Ethics of Environmental and Social impact that Machine Learning models may lead to because of data biases

About


Languages

Language:Jupyter Notebook 100.0%