NLP-Darden-Project-Team-6/NLP-3

natural-language-processing data data-science word-cloud ngrams-graphs text-analysis

Am I Speaking Your Language?

Natural Language Processing Project

Authors: Chris Ortiz, Matthew Knight and Gilbert Noriega

About the Project || Data Dictionary || Initial Hypotheses/Thoughts || Project Plan || How to Reproduce

About the Project

We are future data scientist testing our knowledge of Natural Language Processing. For this project, we will be scraping data from over nearly 600 GitHub repository READMEs' using a variety of methods and build machine learning models from the data we pooled.

Background

GitHub is a subsidiary of Microsoft which provides hosting for software development and version control using Git. As of January 2020, GitHub reports having over 40 million users and more than 190 million repositories(including at least 28 million public repositories), making it the largest host of source code in the world.

Acknowledgement:The dataset was mined from github.com

Goals

Our goal for this project is to build a model that can predict the programming language of a repository, given the text from its README file. We will deliver the following in a github repository:

A clearly named final notebook. This notebook will be what will contain plenty of markdown documentation and cleaned up code.

A README that explains what the project is, how to reproduce the work, and our notes from project planning

Python modules that automate the data acquisistion and preparation process. These modules will be imported and used in the final notebook.

A set of google slides suitable for a general audience that summarize your findings.

back to the top

Data Dictionary

Features	Definition
readme	text of the readme file
words	data that has been cleaned and seperated by words
watchers	users who are watching the repository
stars	users who have starred the repository
forks	users who have forked the repository
commits	number of times the owner has added content to the repository

Target	Definition
language	the main programming language used throughout the github repository

back to the top

Thoughts

Thoughts

We could add a new feature?

Should I turn the categorical variables into booleans?

back to the top

Project Plan: Breaking it Down

acquire

Web scraped X amount of README’s, watchers, forks, stars and commits from Y topics from request library

Decide on 4 programming languages:

C++, Java, Javascript, Python

Use an number of different topics to introduce variety:

Sports, Biology, Artificial Intelligence, Data Engineering

create an acquire.py to automate the process

create a json file for future use

prepare

clean language column by removing % number at the end

change columns to numeric types as needed

normalize, tokenize, stem, lemmatize and remove stop words

split into train, validate, and test

create a prepare.py to automate the process

create a csv file for future use

explore

split words into sets of 1, 2 and 3

determine significance both visually and statistically

document and consider the results for modeling

model and evaluation

find which features are most influential

try different algorithms:

Ridge Classifier

Random Forest

Gradient Boost

evaluate on train

evaluate on validate

select best model and test to verify

create a preprocessing.py and model.py to automate the process

conclusion

summarize findings

provide next steps

back to the top

How to Reproduce

Download data csv from here or use the acquire.py functions

Prepare the data with the prepare.py functions or install the prepped csv here

Run a jupyter notebook importing the necessary libraries and functions.

Follow along in the summary notebook or forge your own exploratory path.

back to the top

About

Scrape data from nearly 600 GitHub repository READMEs' and build a machine learning model to predict the repository's programming language.

natural-language-processing data data-science word-cloud ngrams-graphs text-analysis

Languages

Language:Jupyter Notebook 99.8%Language:Python 0.2%