The purpose of this repository is to answer the assigned questions in the Homework 3 of the Algorithms for Data Mining course, 2022.
The data collection is performed on the Atlas Obscura website. The repository consists of the following files:
main.ipynb
: a Jupyter Notebook that answers the questions:
- Data collection
- Search Engine
- Define a new score!
- Visualizing the most relevant places
- BONUS: More complex search engine
- Theoretical question
CommandLine.sh
: The code used to provide the answers to the Commmand Line questions.Files
Folder that contains:
html_page
: The folder that contains the html files of each page, it is composed of 400 subfolders, each folder has 18 html files.tsv_files.zip
: The Tsv files of each place.places_lists.txt
: The text file containing the urls of the places.places.tsv
: The tsv file that contains the data that we have collected.inverted_index.pkl
: The file that contains the documents where each word appear.inverted_index_tfidf.pkl
: The file that contains the tfidf for each couple word-document. (too big for github, link to the drive: inverted_index_tfidf)vocabulary.pkl
: The file that contains the mapping of every word in the descriptions.RankingList1.txt
: The text file resulted from solving the theoretical question using the first algorithm.RankingList2.txt
: The text file resulted from solving the theoretical question using the second algorithm.RankingList3.txt
: The text file resulted from solving the theoretical question using the third algorithm.RankingList4.txt
: The text file resulted from solving the theoretical question using the mapreduce algorithm.