NLP project of the subject Intelligent Systems, part of the Master Universitario en Ingeniería Informática, that uses the Space News data set, which contains more than 17.000 articles related to the space industry covering news, commercial, civil, launches, military, and also opinion articles, to find the keywords of each article.
This articles are not classified based on their topic and do not give a small list of keywords to the reader, so we try to solve that on this project.
We have performed a series of operations to clean the text and check its properly encoded, normalized, etc. to find the most frequent words and find the keywords of each article using Term Frequency and Inverse Document Frequency techniques.
The file scapenews-classifier.Rmd contains the code and a brief analysis of each operation, which can be executed using RStudio.
The file archive.zip contains the data set in ZIP format, which does not require manual unzip, since the code already does that.