projectestac / edu365-text-search

Full-text search utilities for sites with static pages (work in progress!)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

edu365-text-search

Full text search utilities for websites with static pages.

This project was initially created for edu365.cat, a portal with educational content promoted by the Department of Education of the Government of Catalonia. The site was originally built with static HTML pages, and therefore did not have any search engine. "edu365-text-search" extracts text content of columns of a shared Google spreadsheet and provides a simple search API..

You will need NodeJS installed on your system in order to build the main application.

The project was made upon the following components:

How it works

The search engine uses Google Spreadsheets to organize information. Basically, it will list the records that contain some text in some columns.

Specifically, it uses 2 Google Spreadsheets:

  • Source: the spreadsheet edited by the user and used by the system to extract information from and generate the auto-generated spreadsheet. It must have a concrete format.
  • Auto-generated: it is auto-generated by the system from the source spreadsheet and it contains only the information the search engine needs in the proper format. User should not modify this spreadsheet.

Google spreadsheet data source

This spreadsheet is the data source for the Search Engine and is the user who is intended to create and modify it.

It must follow some format rules:

  • At least it must contain 4 pages: 'INFANTIL', 'PRIMĂ€RIA', 'ESO', 'BATXILLERAT', '+EDU'. Other pages would be ignored.

  • Each page must contain, at least, 4 columns: 'Url', 'Activitat', 'Area', 'Descriptors'. Other columns will be ignored.

The configuration values related to the Google spreadsheet data source are:

  • EDU_MAP_SPREADSHEET_ID: Google spreadsheet ID
  • SEARCH_OPTIONS.keys: Array of the name of the columns used by the search engine to search values.

Auto-generated Google spreadsheet

The system will automatically generate other Google spreadsheet extracting only the relevant information from the Google spreadsheet data source. It is not intended to be modified by the user.

This spreadsheet should be regenerated each time the source one changes by calling the 'build-index-page' endpoint.

The configuration values related to the auto-generated Google spreadsheet:

  • AUTO_SPREADSHEET_ID: Google spreadsheet ID for the auto generated spreadsheet
  • SPREADSHEET_PAGE: Spreadseet page name

Configuration

  • Environment specific configuration can be done using .env files.
  • Other configurations should be done using config.js.

Credential settings

You must obtain a set of OAuth2 credentials from the Google API console. These credentials should be downloaded in a file named credentials.json and stored on the credentials folder. You must also enable the Google Sheets API for a user having read and write rights on this sheet.

The next step will be to make a duplicate of the file .env-example, calling it .env.

Edit .env and set the value of EDU_MAP_SPREADSHEET_ID to the identifier of your spreadsheet (the part between /spreadsheets/d/ and /edit of the spreadsheet URL). You should also write a random text on AUTH_SECRET. Other settings like the APP_PORT, LOG_LEVEL or LOG_FILE are optional.

NOTE: In order to obtain a "refresh token" for indefinite time, you must first cancel the currently issued permissions for your project. This can be done by visiting: https://myaccount.google.com/connections

Build the main application

Install the dependencies using NPM or Yarn:

# Go to the main project directory:
$ cd path/to/edu365-text-search

# Install the required npm components:
$ npm install

Then launch the server using NPM:

$ cd path/to/edu365-text-search
$ npm start

Basic usage

After every edit of any page on the site, this URL should be launched on your browser:

http://%HOST%:%APP_PORT%/build-index-page?auth=%AUTH_SECRET%

... replacing %HOST% by the host name or IP (usually 'localhost' on the development environment) and %APP_PORT%, %AUTH_SECRET% by the real values of these variables in .env.

To perform a query, just use this URL:

http://%HOST%:%APP_PORT%/q=%QUERY_TEXT%

You will find examples of a search form and results page in /test.

Advanced settings

This application uses Fuse.js by Kiro Risk to perform search queries. Fuse has a lot of specific settings that can be adjusted to fit your needs. The settings currently used by edu365-text-search are:

// See file: /config.js
SEARCH_OPTIONS: {
  // Basic
  isCaseSensitive: false,
  includeScore: false,
  includeMatches: false,
  minMatchCharLength: 2,
  shouldSort: true,
  findAllMatches: false,
  keys: ['Activitat', 'Descriptors'],

  // Fuzzy
  location: 0,
  threshold: 0.3,
  distance: 3,

  // Advanzed
  useExtendedSearch: false,
},

Please check out Fuse.js for a full description of each option.

Launching the server with Docker

In order to run the server in a Docker container, just launch:

$ docker compose up

In production environments, it should be better to start the service as a daemon:

$ docker compose up -d

License

"Edu365 text search" is an open source development made by the Department of Education of the Government of Catalonia, released under the terms of the European Union Public Licence v. 1.2.

About

Full-text search utilities for sites with static pages (work in progress!)

License:European Union Public License 1.2


Languages

Language:JavaScript 94.3%Language:HTML 4.3%Language:CSS 1.3%Language:Dockerfile 0.0%