Full text search utilities for websites with static pages.
This project was initially created for edu365.cat, a portal with educational content promoted by the Department of Education of the Government of Catalonia. The site was originally built with static HTML pages, and therefore did not have any search engine. "edu365-text-search" extracts text content of columns of a shared Google spreadsheet and provides a simple search API..
You will need NodeJS installed on your system in order to build the main application.
The project was made upon the following components:
- Express as a HTTP API server.
- Fuse.js as a search engine.
- Google Sheets API as a backend.
- Winston as advanced logging system.
- PM2 to launch and monitor the API server.
The search engine uses Google Spreadsheets to organize information. Basically, it will list the records that contain some text in some columns.
Specifically, it uses 2 Google Spreadsheets:
- Source: the spreadsheet edited by the user and used by the system to extract information from and generate the auto-generated spreadsheet. It must have a concrete format.
- Auto-generated: it is auto-generated by the system from the source spreadsheet and it contains only the information the search engine needs in the proper format. User should not modify this spreadsheet.
This spreadsheet is the data source for the Search Engine and is the user who is intended to create and modify it.
It must follow some format rules:
-
At least it must contain 4 pages: 'INFANTIL', 'PRIMĂ€RIA', 'ESO', 'BATXILLERAT', '+EDU'. Other pages would be ignored.
-
Each page must contain, at least, 4 columns: 'Url', 'Activitat', 'Area', 'Descriptors'. Other columns will be ignored.
The configuration values related to the Google spreadsheet data source are:
- EDU_MAP_SPREADSHEET_ID: Google spreadsheet ID
- SEARCH_OPTIONS.keys: Array of the name of the columns used by the search engine to search values.
The system will automatically generate other Google spreadsheet extracting only the relevant information from the Google spreadsheet data source. It is not intended to be modified by the user.
This spreadsheet should be regenerated each time the source one changes by calling the 'build-index-page' endpoint.
The configuration values related to the auto-generated Google spreadsheet:
- AUTO_SPREADSHEET_ID: Google spreadsheet ID for the auto generated spreadsheet
- SPREADSHEET_PAGE: Spreadseet page name
- Environment specific configuration can be done using .env files.
- Other configurations should be done using config.js.
You must obtain a set of OAuth2 credentials from the Google API console. These credentials should be downloaded in a file named credentials.json
and stored on the credentials
folder. You must also enable the Google Sheets API for a user having read and write rights on this sheet.
The next step will be to make a duplicate of the file .env-example
, calling it .env
.
Edit .env
and set the value of EDU_MAP_SPREADSHEET_ID
to the identifier of your spreadsheet (the part between /spreadsheets/d/
and /edit
of the spreadsheet URL). You should also write a random text on AUTH_SECRET
. Other settings like the APP_PORT
, LOG_LEVEL
or LOG_FILE
are optional.
NOTE: In order to obtain a "refresh token" for indefinite time, you must first cancel the currently issued permissions for your project. This can be done by visiting: https://myaccount.google.com/connections
Install the dependencies using NPM or Yarn:
# Go to the main project directory:
$ cd path/to/edu365-text-search
# Install the required npm components:
$ npm install
Then launch the server using NPM:
$ cd path/to/edu365-text-search
$ npm start
After every edit of any page on the site, this URL should be launched on your browser:
http://%HOST%:%APP_PORT%/build-index-page?auth=%AUTH_SECRET%
... replacing %HOST%
by the host name or IP (usually 'localhost' on the development environment) and %APP_PORT%
, %AUTH_SECRET%
by the real values of these variables in .env
.
To perform a query, just use this URL:
http://%HOST%:%APP_PORT%/q=%QUERY_TEXT%
You will find examples of a search form and results page in /test
.
This application uses Fuse.js by Kiro Risk to perform search queries. Fuse has a lot of specific settings that can be adjusted to fit your needs. The settings currently used by edu365-text-search are:
// See file: /config.js
SEARCH_OPTIONS: {
// Basic
isCaseSensitive: false,
includeScore: false,
includeMatches: false,
minMatchCharLength: 2,
shouldSort: true,
findAllMatches: false,
keys: ['Activitat', 'Descriptors'],
// Fuzzy
location: 0,
threshold: 0.3,
distance: 3,
// Advanzed
useExtendedSearch: false,
},
Please check out Fuse.js for a full description of each option.
In order to run the server in a Docker container, just launch:
$ docker compose up
In production environments, it should be better to start the service as a daemon:
$ docker compose up -d
"Edu365 text search" is an open source development made by the Department of Education of the Government of Catalonia, released under the terms of the European Union Public Licence v. 1.2.