PaulinaPacyna / IML-Project

Exploring trends in CS/Math papers for a university NLP project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

IML Project

Project for the subject 'Introduction to Machine Learning'

Prerequisites

Packages used in this project are mentioned in requirements.txt. You can install them by pip install -r requirements.txt

Quick Roadmap

To scrape and preprocess your dataset place it in root directory, go to the src directory and run first scraper.py, then date_formatter.py, then lemmatizer.py and finally preprocessing.py.

Notebooks/scripts:

  • timeseries lda analysis - lda.py / lda_modified.ipynb
  • clusters analysis - clusters.ipynb
  • words as time series analysis - time_series_clustering.ipynb
  • amount of mathematics in computer science - math_in_cs.ipynb

Data

Link to clean data : https://drive.google.com/file/d/1pBihRBnGs6VlFalr4BuxMwXw5xL5ZjY6/view?usp=sharing (367 MB zipped) | (1.22 GB decompressed)

Link to LDA models and results: https://drive.google.com/drive/folders/1fG-yuzZq_vhh8hk_PJw1vTZMTMAn_xjH

Final Report

The final report is included in IMLReport.pdf. All details and achieved results are presented in the report.

About

Exploring trends in CS/Math papers for a university NLP project


Languages

Language:Jupyter Notebook 80.9%Language:JavaScript 10.3%Language:Python 8.7%Language:CSS 0.1%