This repository contains code and analysis for ADM Homework 3 on building a search engine for master's degree programs.
The repository contains the following key files and folders:
-
main.ipynb
: Jupyter notebook containing the code and analysis for the homework, including data collection, preprocessing, search engine implementation, new scoring function, and map visualization. -
GeneratedFiles
: Folder containing generated output frommain.ipynb
, including:courses_data_processed.tsv
: Preprocessed datasetfolderTSV
: Folder containing parsed TSV filesinverted_index.json
: Inverted index for search enginetfidf_inverted_index.json
: TF-IDF inverted indexurls.txt
: List of degree webpage URLsvocabulary.txt
: Vocabulary mapping words to IDsMap.html
: Interactive map visualizationDescription.md
: Metadata markdown filemerged_courses.tsv
: Merged courses data
-
CommandLine.sh
: Bash script containing solution for command line question -
crawler.py
: Python module containing web scraping functions -
functions.py
: Python module containing utility functions -
parser.py
: Python module containing data parsing functions -
searchEngine.py
: Python module containing search engine implementation -
searchEngineNew.py
: Python module containing new scoring function
The analysis focused on:
- Web scraping degree pages to build a dataset
- Preprocessing text data including stopword removal and stemming
- Implementing an inverted index search engine for conjunctive queries
- Ranking search results by TF-IDF and cosine similarity
- Defining a custom scoring function to rank results
- Visualizing results on a geographic map colored by cost
The repository contains all code and output to replicate the analysis described in the homework.
- Ambar Chatterjee
- Himel Ghosh
- Erika Ioana Zetu
- Alessandra Colaiocco