The Applied Data Science Lab, offered by WorldQuant University, is an immersive online program that equipped me with practical skills in addressing real-world, intricate challenges.
Throughout the program, I engaged in a series of comprehensive data science projects that helped me develop proficiency in data wrangling, analysis, model-building and effective communication through hands-on experience.
- Imported multiple CSV files from a private repository into a pandas DataFrame using for loops
- Created preliminary and exploratory histograms, scatter plots, whisker plots and bar charts
- Examined the relationship between variables by assessing Pearson correlation coefficients
- Cleaned and wrangled raw data by creating a custom wrangle function
- Built ML pipelines by means of Ridge, OneHotEncoder, SimpleImputer, LinearRegression and make_pipeline built-in sklearn functions
- Applied L2 Regularization in order to prevent overfitting or underfitting in Linear Regression models
- Created an interactive dashboard using ipywidgets library to module predictions based on different input features
Image Snapshot of Interactive Dashboard
- Connected to a MongoDB server using pymongo library to localize and extract the required data, ETL.
- Applied rolling average, autocorrelation and lag operations to Times Series data variables.
- Utilized Train Test Split procedures to create proper train and test datasets for a Linear Regression model.
- Built, explored and interpreted Partial/Auto Correlation Functions plots.
- Using statsmodels modules, constructed Auto Regressive and ARMA models and validated them via Walk Forward optimization.
- Tuned the number of lagged observations and moving avg. window size via GridSearchCV.
- Detected an optimal balance between Model Performance and Computational Costs
- Connected to a SQL database and wrangled data using magic commands and sqlite3 library
- Executed randomized Train Test Split to create proper training, testing and validation datasets
- Elaborated ML pipelines utilizing OrdinalEncoder, DecisionTreeClassifier, LogisticRegression and make_pipeline built-in sklearn functions
- Besides computing and evaluating training and validation accuracy scores:
- For Decission Tree algorithms, tuned the Tree’s depth and assessed its predictions by assessing the Gini importance of its features
- For Logistic Regression algorithms, evaluated Odds ratios to explain its predictions
- Reviewed the Ethics of Environmental and Social impact that Machine Learning models may lead to because of data biases