About the Project || Data Dictionary || Initial Hypotheses/Thoughts || Project Plan || How to Reproduce
We are future data scientist testing our knowledge of Natural Language Processing. For this project, we will be scraping data from over nearly 600 GitHub repository READMEs' using a variety of methods and build machine learning models from the data we pooled.
GitHub is a subsidiary of Microsoft which provides hosting for software development and version control using Git. As of January 2020, GitHub reports having over 40 million users and more than 190 million repositories(including at least 28 million public repositories), making it the largest host of source code in the world.
Acknowledgement:The dataset was mined from github.com
Our goal for this project is to build a model that can predict the programming language of a repository, given the text from its README file. We will deliver the following in a github repository:
- A clearly named final notebook. This notebook will be what will contain plenty of markdown documentation and cleaned up code.
- A README that explains what the project is, how to reproduce the work, and our notes from project planning
- Python modules that automate the data acquisistion and preparation process. These modules will be imported and used in the final notebook.
- A set of google slides suitable for a general audience that summarize your findings.
Features | Definition |
---|---|
readme | text of the readme file |
words | data that has been cleaned and seperated by words |
watchers | users who are watching the repository |
stars | users who have starred the repository |
forks | users who have forked the repository |
commits | number of times the owner has added content to the repository |
Target | Definition |
---|---|
language | the main programming language used throughout the github repository |
- We could add a new feature?
- Should I turn the categorical variables into booleans?
acquire
- Web scraped X amount of README’s, watchers, forks, stars and commits from Y topics from request library
- Decide on 4 programming languages:
- C++, Java, Javascript, Python
- Use an number of different topics to introduce variety:
- Sports, Biology, Artificial Intelligence, Data Engineering
- create an acquire.py to automate the process
- create a json file for future use
prepare
- clean language column by removing % number at the end
- change columns to numeric types as needed
- normalize, tokenize, stem, lemmatize and remove stop words
- split into train, validate, and test
- create a prepare.py to automate the process
- create a csv file for future use
explore
- split words into sets of 1, 2 and 3
- determine significance both visually and statistically
- document and consider the results for modeling
model and evaluation
- find which features are most influential
- try different algorithms:
- Ridge Classifier
- Random Forest
- Gradient Boost
- evaluate on train
- evaluate on validate
- select best model and test to verify
- create a preprocessing.py and model.py to automate the process
conclusion
- summarize findings
- provide next steps
- Download data csv from here or use the acquire.py functions
- Prepare the data with the prepare.py functions or install the prepped csv here
- Run a jupyter notebook importing the necessary libraries and functions.
- Follow along in the summary notebook or forge your own exploratory path.