duzx16 / SearchTHU

Course project for Fundamentals of Search Engine Technology in Tsinghua University

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SearchTHU

Source code of the course project for Fundamentals of Search Engine Technology in Tsinghua University by Zhengxiao Du and Zhouxing Shi.

Requirements

  • Python 3.5+
  • Java 1.8
  • Node 10+
  • Npm 6+
  • Tomcat 9.0.21

Deployment Guide

Data

Data of format .html, .pdf, .docx or .doc are crawled with Heritrix. They are too large (>20G) and thus not included here, but can be provided upon request.

Preprocessing

To preprocess files under the <dir> directory, run:

cd preprocess
pip install -r requirements.txt
python DocParser.py <dir>

For each file with path <path>, it outputs a processed JSON file <path>.json.

Backend

First build the index by running java -jar backend/out/artifacts/Indexer_jar/SearchTHU.jar <data_dir> <index_dir>

Then put backend/out/artifacts/SearchTHU_war/SearchTHU_war.war at webapps/ of the Tomcat directory, and then start the Tomcat service.

The API path is http://hostname:port/SearchTHU_war/. See the API details at doc/api.md.

Frontend

To start the frontend service:

cd frontend
npm install
npm start

And then go to http://localhost:8080 to start using the application.

About

Course project for Fundamentals of Search Engine Technology in Tsinghua University


Languages

Language:Python 51.4%Language:OpenEdge ABL 48.1%Language:Java 0.2%Language:JavaScript 0.2%Language:Vue 0.2%Language:HTML 0.0%