This project is an home assignment given to me by
The problem is stated as:
We want to build a tool that scrapes the Stackoverflow website and for every question (generally a question reflects a symptom of a problem) it predicts two things:
- The root cause of the problem
- The solution (answer) of the problem
Given the following constraints:
- Build the basics of a long term project, your code needs to be easily understood and reusable.
- Code does not need to be optimal in terms of performance but needs to be fast enough to provide a good user experience.
- Code needs to be trusted: it does what it is meant to do.
- Code will be peer reviewed and released on Github.
- A few hours to develop a 1st solution.
Bonus point:
- Deploy the solution on the cloud.
- Document the code.
Given the constraints, I decided to focus my efforts on Questions that have an accepted answer:
The rational being that if an answer is accepted, it is more likely that it mentions a solution to the problem and/or a root cause.
I identified 3 potential approaches to solve the problem on this type of questions:
-
Building a Q&A system:
- Given a passage sentence (a StackOverFlow answer) and a question (either "How to solve the problem?" or "Why is this happening?"), extract the entities from the passage sentence.
- That being said, we only ask 2 types of questions ("how to" and "why"), which means a Q&A system might be overkill for this problem.
-
Building a YOLO-like model but for NLP:
- YOLO is a system for detecting objects in images. It both predicts the class of the object and its position (by giving a bounding box).
- We could build a model inspired by this approach that predicts the "bounding box" (
start_at
,end_at
indices) within the text and the class (solution
,root_cause
). - That being said, a quick look-up on Google did not give any link towards a YOLO-like model applied to NLP.
-
Building a simple 3-classes classifier at the sentence level:
- A simpler approach would be to build a classifier at the sentence level that predicts if the sentence belongs to one of the classes
solution
,root_cause
,other
. - This approach would give a full sentence as a result to the initial problem without completely filtering the sentence from the content not relevant to the solution.
- A simpler approach would be to build a classifier at the sentence level that predicts if the sentence belongs to one of the classes
To develop this project I followed these steps:
step | functionnality | description | url |
---|---|---|---|
1 | WebCrawler | - crawler to collect questions & answers data - relies on StackOverflow API - can automatically generate an API key - useful for creating a training set | crawler |
2 | Preprocessing | - methods to preprocess the html content of an answer - cleans text (remove tags, lemmatizes, tokenizes...) - vectorizes (custom word2vec trained on the dataset) | preprocessing |
3 | Data Labelling | - interface (CLI) to label the data more easily | labelling |
3bis | Unsupervised exploration | - tried (but failed) to analyze the dataset with Kmeans - aim was to identify potential clusters that would include solution , root_cause and avoid hand labelling - interactive visualizations |
kmeans |
4 | Classifier | - gradient-boosted classifier that classifies each sentence in solution , root_cause or other - little hyperparameter-tuning was done, but the model performance were estimated using cross-validation and hold-out |
classifier |
5 | Interface | - interface (CLI) to use the model & play with it - pass an url or a question id and visualize the results of the algorithm | interface |
A few examples of the features that were developed during this project:
- Interactive visualisation of the KMeans results. The approach did not work at the end (I identified clusters related to the technologies mentionned in the questions e.g. numpy, tensorflow, docker, etc. instead of
solution
orroot_cause
)
- Labelling interface
The definition of root_cause
and solution
is a bit fuzzy, and while labelling the data it was not always very clear if a sentence really belonged to one class or another.
Therefore, the performance of the model depends a lot on the quality of this labelled data ...
I analyzed the performance of the model:
- with a 5-fold cross-validation using
macro
metrics, - and split the data in 70%/30% train-test to generate a classification report and a confusion matrix
As I did not use any technique to handle the class imbalance, it appears that the most common class (other
) is the best predicted. Performance on the root_cause
class is acceptable, while the prediction on the solution
class is not very powerful...
One potential explanation is that root_cause
explanations are often longer than a simple solution and may contain more signal.
A few examples of predictions (the first 2 seem to work, the last one is a failure) using the interface:
Still under construction (code is not 100% reproducible as of today)
Some commands to execute / scripts to run (you might need to add PYTHONPATH=.
before executing the command):
Collect data: python stack_under_flow/adhoc_scripts/collect_data.py
Train the word2vec model: python stack_under_flow/adhoc_scripts/train_embeddings.py
Train the classifier: python stack_under_flow/adhoc_scripts/train_classifier.py
Launch the Labelling CLI: python stack_under_flow/labelling_tool/labelling_cli.py -n 5
Launch the Labelling CLI: python stack_under_flow/stack_under_flow_cli.py -id 46826218
Run the tests: py.test