nykolai-d / teilur_wordcount

This notebook identifies the most common words in five large datasets covering the following themes: data engineering, data analytics, data science, software engineering and business analytics, as well as the most common words for the five joined datasets as a whole.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This notebook identifies the most common words in five large datasets covering the following themes: data engineering, data analytics, data science, software engineering and business analytics, as well as the most common words for the five joined datasets as a whole. Datasets come in the form of csv documents, built from the webscraping of different webpages: GitHub, Documentation, Glassdoor and specific content sites (techical blogs and other similar sources). Some preprocess has already been put into the datasets, consisting mainly in the gathering and organizing of the obtained texts into one single csv file per category. The total amount of words in the five datasets is $3.204.121$

We use a variety of libraries and packages including NLTK, collections, wordcloud, pandas, matplotlib and openpyxl. This report shows the steps that were followed, starting with the uploading of the datasets up until the writing of the excel files with the most common words per category. It includes the following sections:

  1. Introduction
  2. Data Preprocess
    1.1 Data Engineering Dataset
    1.2 Software Engineering Dataset
    1.3 Data Science Dataset
    1.4 Data Analytics Dataset
    1.5 Business Analytics Dataset
  3. Get Most Common Words
  4. Appendices
    3.1 Dataframes of Clean Datasets
    3.2 Write the Excel Files
    3.2 Generate the Wordclouds

About

This notebook identifies the most common words in five large datasets covering the following themes: data engineering, data analytics, data science, software engineering and business analytics, as well as the most common words for the five joined datasets as a whole.


Languages

Language:Jupyter Notebook 100.0%