Using Machine Learning to Predict Alcohol and Drug Use with Personality Traits and Socio-Demographic Characteristics
EXAMINING HEALTH OUTCOMES USING MACHINE LEARNGING
- Purpose of Project
- Project Summary
- List of key files
- Python Libraries
- Data Description
- Methodology
- Results
- Conclusion
- Creator & Credits
The current project explores health outcomes and machine learning for my Udacity Data Scientist Nanodegree Capstone project. The overall goal was to explore if machine learning models could help better predict drug use from personality and socio-demographic variables.
This repository will contain the literature review, code, analysis, and visualizations. Three classes of drugs are examined: stimulants, depressants, and hallucinogens. Machine-learning models were applied to predict the consumption of each class of drug based on the risk factors.
For the full write up, please see my three-part blog on Medium.com, The links are the following:
For Part 1 (A brief literature review on drug use, personality and previous methods used to study the associations between the two, the study methodology, explanation of the methods used and chosen metrics): Use this Link. For Part 2 (The project's exploratory analysis, Data-Cleaning/Pre-Processing phase, Feature Selection and Implementing the machine learning models): Use this Link. For Part 3 (The project's results, discussing how the research questions were answered and concludes with looking at possible future improvements): Use this Link.
Introduction:
Alcohol consumption, illicit drug use and non-illicit drug use represent a major source of mortality and morbidity, as well as social and economic costs in Canada(Canada, 2019; Thomas & Davis, 2007). Studies have highlighted the complex relationship between drug use and a range of factors, including psychological, geographical, environmental, socio-economic, and individual risk factors (Cleveland, Feinberg, Bontempo, & Greenberg, 2008; Ventura, de Souza, Hayashida, & Ferreira, 2015). Improving our understanding of how these risk factors contribute to alcohol and drug use can create more effective individual-level treatments and help health agencies craft policies and programs that provide harm-reduction and enable prevention-based approaches.
Dataset and Research Questions:
The current data set includes 1885 respondents and their usage of eighteen different illicit and non-illicit drugs (Dua & Graff, 2019). These eighteen different drugs can be groupd into three major classes of substances: Stimulants, Depressants and Hallucinogens.
Respondent demographics and five dimensions of personality following the five-factor model (Terracciano, Löckenhoff, Crum, Bienvenu, & Costa, 2008) are also included in the dataset which provide the opportunity to make use of machine-learning approaches to better understand the association between socio-demographic and personality traits with alcohol and drug use. The following two research questions are addressed:
-
What are the personality traits and demographic variables that best predict each drug use outcome?
-
Can we determine which machine-learning approaches/methods is the most effective for predicting consumption?
In the exploratory analysis and data mining phase of the project, there will be additional opportunities to unearth new questions that we can also answer.
Methods & Analysis:
To obtain a detailed understanding of the respondents, exploratory analysis was performed, including descriptive statistics, visualizations, bivariate regressions and data mining. A preliminary analysis of the data showed that drug use is captured in frequency categories, indicating that a classification approach would be most appropriate in analyzing the data. There was also the potential to group drug use labels into binary outcomes or leave them as multi-class problems. Following exploratory analysis, to improve the sample size and reduce the complexity of the problem the eighteen drug outcomes were collapsed into three categories: stimulants, depressants, and hallucinogens. The frequency class labels for the drug outcomes were also collapsed from seven labels to three (non-user, infrequent user, and high use).
Following data preparation, initial models were created using the following algorithms:
- Logistic Regression
- Multinomial Logit Regression (Softmax Regression)
- Support Vector Machines
- Decision Trees/Random Forests/Gradient Boosted Trees
- K-Nearest Neighbours classifier
- Linear Discriminant Analysis (LDA)
After further modelling, four models were selected:
- Logistic Regression
- Support Vector Machines
- Random Forests
- Neural Network
Imbalanced data was then fixed with oversampling using SMOTE before the models were then re-calibrated. A third binarized approach was then used to nest the four classifiers into the OneVsRest classifier and re-calibrate the models to allow the AUC evaluation metric to be used.
Finally, the best set of hyperparameters and features were searched for using GridSerachCV and applied to each of the models for one last calibration before examining the evaluation metrics to determine the best model. Feature importance from the random forest classifier and coefficients from multinomial logistic regressions were used to assess significant predictors of drug use outcomes.
Tools and Techniques:
Python will be used as the primary tool for this project. Data preparation, exploratory analysis and data modelling code will be saved in Jupyter notebooks or .py files.
Evaluation Metrics:
To assess the performance of the models in predicting drug use from the risk factors, the following metrics will be calculated:
- Accuracy
- Precision/Specificity
- Recall/Sensitivity
- AUC
- nltk
- sklearn
- joblib
- sqlqlchemy
- time
- numpy
- re
- pandas
- scipy
- seaborn
- matplotlib
- statsmodels
The dataset for the study was obtained from the from the open source repository - UCI Machine Learning Repository. See the following link:
http://archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29#
Alternatively, a cleaned version can also be found in this repository:
In this drug consumption dataset, there are five sociodemographic variables (Age, Gender, Education, Country, and Ethnicity), seven measures of personality (NEO-FFI-R/FFM: Neuroticism, Extraversion, Openness, Agreeableness, Conscientiousness; Impulsiveness, Sensation Seeking) and one row identifier (ID). Only ID will not be used as an input feature as it provides nothing more than a unique row count. The table below provides descriptive statistics for the 12 + 1 (ID) independent variables.
The data set contains eighteen outcome variables of drug use that are all categorical and are labelled with the same output classes. The table below outlines all the class labels and the respective definitions. One potential transformation might be having to group up the classes into only two or 3 classes depending on how imbalanced the classes are (see Data Preparation Transformation Column)
For this study, the eighteen different drugs will also be grouped into three broader classes of drugs. The groupings will be:
Data Preparation – Outcome variables Transformation
Major Class of Drug | Drug |
---|---|
Stimulants | amphetamines, nicotine, cocaine powder, crack cocaine, caffeine, and chocolate |
Depressants | alcohol, amyl nitrite, benzodiazepines, tranquilizers, solvents and inhalants, heroin and methadone/prescribed opiates |
Hallucinogens | cannabis, ecstasy, ketamine, LSD, and magic mushrooms |
The original distributions for eighteen drugs are as follows:
The methodology for the current study is presented in the following diagram:
In carrying out the study the neural network was determined to be the most accurate algorithm in predicting stimulant drug use via F1-score. For depressants and hallucinogens, the random forest algorithm was able to produce the best estimates with both drug outcomes. While the findings are not all significant and caution needs to be taken, this study was able to demonstrate that traits like neuroticism, agreeableness, impulsiveness, and sensation seeking behaviour were predictive of drug use outcomes, along with other sociodemographic variables.
The first section of results tables summarizes the feature imporance for each drug use outcome to answer rsearch question one. The second section of tables provides the model metrics for each of the four models used by drug use outcome. A best model is indicated for each drug use outcome, with the overall goal of using these metrics to maximize the accuracy score of the predictions, while balancing precision and recall scores.
Top Features by Feature Importance - Stimulants
Feature | Impurity-based Importance |
---|---|
NEO_Neuroticism | 0.212085 |
Ethnicity | 0.195933 |
Gender | 0.143130 |
Top Features by Feature Importance - Depressants
Feature | Impurity-based Importance |
---|---|
Ethnicity | 0.431104 |
Sensation Seeking | 0.264706 |
Impulsiveness | 0.126786 |
NEO_Agreeableness | 0.136938 |
Top Features by Feature Importance - Hallucinogens
Feature | Impurity-based Importance |
---|---|
Country | 0.295012 |
Age | 0.273682 |
Sensation Seeking | 0.206501 |
Education | 0.137931 |
Initial Models for Stimulants
Support Vector Machine (SVM) | Logistic Regression | Random Forest | Neural Network | |||||
---|---|---|---|---|---|---|---|---|
Imbalanced | Balanced | Imbalanced | Balanced | Imbalanced | Balanced | Imbalanced | Balanced | |
Mean Accuracy | 0.2068 | 0.3050 | 0.9938 | 0.60212 | 0.9934 | 0.8820 | 0.9934 | 0.9125 |
Initial Models for Depressants
Support Vector Machine (SVM) | Logistic Regression | Random Forest | Neural Network | ||||||
---|---|---|---|---|---|---|---|---|---|
Imbalanced | Balanced | Imbalanced | Balanced | Imbalanced | Balanced | Imbalanced | Balanced | ||
Mean Accuracy | 0.8488 | 0.44694 | 0.85013 | 0.42175 | 0.8488 | 0.63395 | 0.85145 | 0.67771 |
Initial Models for Hallucinogens
Support Vector Machine (SVM) | Logistic Regression | Random Forest | Neural Network | |||||
---|---|---|---|---|---|---|---|---|
Imbalanced | Balanced | Imbalanced | Balanced | Imbalanced | Balanced | Imbalanced | Balanced | |
Mean Accuracy | 0.4204 | 0.24270 | 0.69496 | 0.66047 | 0.70689 | 0.67374 | 0.67241 | 0.684350 |
Final Models for Stimulants
Metrics | Support Vector Machine (SVM) | Logistic Regression | Random Forest | Neural Network** |
---|---|---|---|---|
Overall Accuracy | 0.873468 | 0.755832 | 0.906682 | 0.946224 |
Precision | 0.862176 | 0.742619 | 0.926640 | 0.944476 |
Recall | 0.849348 | 0.676157 | 0.866251 | 0.933719 |
F1 - Score | 0.855166 | 0.688689 | 0.887821 | 0.938777 |
AUC | 0.91564 | 0.8575 | 0.9892 | 0.9801 |
Timing | 0.035 s | 0.143 s | 1.505 s | 2.415 s |
** Best Model
Final Models for Depressants
Metrics | Support Vector Machine (SVM) | Logistic Regression | Random Forest** | Neural Network |
---|---|---|---|---|
Overall Accuracy | 0.717602 | 0.685148 | 0.844039 | 0.706108 |
Precision | 0.683079 | 0.724233 | 0.851524 | 0.667796 |
Recall | 0.633874 | 0.533807 | 0.790399 | 0.664131 |
F1 - Score | 0.640769 | 0.475795 | 0.809779 | 0.665795 |
AUC | 0.7293 | 0.70741 | 0.9281 | 0.73894 |
Timing | 0.0689 s | 0.08719 s | 1.536 s | 0.6619 s |
** Best Model
Final Models for Hallucinogens
Metrics | Support Vector Machine (SVM) | Logistic Regression | Random Forest** | Neural Network |
---|---|---|---|---|
Overall Accuracy | 0.753274 | 0.693705 | 0.783270 | 0.756232 |
Precision | 0.736559 | 0.757249 | 0.791162 | 0.735555 |
Recall | 0.675539 | 0.545944 | 0.703105 | 0.685678 |
F1 - Score | 0.687642 | 0.497090 | 0.720170 | 0.697635 |
AUC | 0.79457 | 0.74284 | 0.866084 | 0.8045 |
Timing | 0.0695 s | 0.5953 s | 6.0910 s | 4.1214 s |
** Best Model
This project contributes to the literature and work on predicting drug use outcomes by demonstrating the use of machine learning algorithms in predicting drug use accurately. Being able to more accurately predict drug use can help to inform health interventions and to improve their clinical effectiveness. Understanding the drivers of drug use also provide opportunities for public health officials and policy makers to enact programs and policies that help provide supportive and safe environments that aid individuals in reducing or completely stopping their usage of drugs. This study also provided insight on personality traits and drug use, which has not been a commonly studied association. This work supports the existing body of work and understanding of drug use. The use of neural networks and support vector machines to this dataset were novel approaches. As well, the addition of the OneVsRest classifier to augment the four algorithms to deal with the drug outcomes as multi-class outcomes instead of just binary outcomes was an approach not found previously in the literature. In summary, this study helps to highlight the potential that machine learning algorithms have in assisting in health research and drug use prediction.
Future work should focus on gathering larger amounts of data to address the sample size issues and imbalanced classes found in this study. Additionally, a much larger dataset would help maximize the performance abilities of machine learning algorithms such as neural networks and random forests in producing a more robust and accurate estimate. A larger number of features also need to be retrieved to help provide more predictors to help improve the models and decrease overall model bias. Once the three border outcomes of stimulants, depressants and hallucinogens can be predicted accurately, future studies can then look to predicting very specific drug outcomes, as was found in original dataset.
- Andrew Leung
- Udacity