Data importer for jore-history-graphql. The importer downloads a database export from an FTP server and imports it into a PostgreSQL database.
You need a Postgres database to run the import against. The app will apply the DB schema when starting.
Ensure you are using the dev database (
transitlog-dev-cluster-c.postgres.database.azure.com
) or a local database when running locally.
Build the container:
docker build -t hsldevcom/jore-history-graphql-import:production .
Use the tags :dev, :stage and :production to target each environment. This way you can prevent code from being deployed to unintended environments.
Instead of using the above command, you can use the provided deploy scripts. Please ensure you are logged in to Docker Hub before using them, as they will push the built image.
Example:
./deploy-env.sh
The deploy-env
script will prompt you for the environment you want to build for. Make a selection, for example 2
for the "development" environment. It will then build the image, tag it appropriately and push it to Docker Hub.
You can also use the deploy-all.sh
script to build for all environments.
Please use the play-update-jore-updater.yml
playbook in Transitlog-IAC to update the JORE importer service. Do this after pushing new image versions.
Check the .env files for an idea of which env variables you need to run the app. All env vars in the public .env files (eg. .env.production
) with the value of secret
need to be set manually when running the app. It is recommended to copy the production env file to .env
, which is gitignored, and entering all the required information that can't be in the public file.
Make sure to add the appropriate credentials for the FTP server (USERNAME, PASSWORD). Ask the project team members for these.
Start the importer with Docker:
docker run -p 8000:8000 --name jore-history-graphql-import hsldevcom/jore-history-graphql-import:production
Alternatively, when developing, start the importer directly with Node on your computer:
yarn run start
Upon starting, the app will check that the DB is migrated fully and run any schema changes if needed. You can also run the migrations with yarn run knex migrate:latest
. Check Knex docs for more information about migrations.
The import runs on a timer and will activate each night at around 3:00 am. Access the admin interface to trigger the import immediately.
If you want to extract the export ZIP file from the importer, bind a volume to the /tmp/build
directory. The export will then be available to you after the importer has downloaded it. To run the import with a local export file, without using the FTP server, bind a volume to the /source
directory. Both volume binds look roughly like this:
docker run -v ./source:/source:ro -v ./downloads:/tmp/build ...etc
When running without Docker, any downloaded files will be in the downloads
directory of this project.
The app exposes an admin interface at port 8000 (by default) which is used to start an import, start an import from an uploaded file, selectively set tables or dump the database.
To access the admin view of an instance running in the cloud (dev or production), use an SSH tunnel to the host that is running the app. Run, for example: ssh -L 8000:localhost:80 10.223.14.12
and then you can access the admin view at http://localhost:8000. Substitute the port number after localhost
with the port you published from Docker and the private IP address with the one that your instance of the app uses.
The admin interface will ask you for credentials. These are admin
and whatever ADMIN_PASSWORD
you set in the env config. For development, it is admin
and secret
.
The admin view exposes a few options. Under normal circumstances you don't need to use it, but it can be useful for development. Remember that all options are only stored in memory, so they will be reset when restarting the app!
You can run the daily import immediately by pressing the "Run import task" button. This will download the latest export from the FTP server. In development, when the DEBUG
env var is true
, it is possible to re-import an already imported file. During normal operation this is not possible, as all imported exports are kept track of in the public.import_status
table of the connected database. If an export is imported successfully, importing it again will be blocked.
Clicking the button will do exactly the same thing as has been scheduled to happen every night.
If you have a compatible export archive you can upload it through the admin interface and import it into the DB. It follows the same rules as the daily import, eg. a successfully imported export cannot be imported again when DEBUG != true
. However, all uploaded exports get "-downloaded" appended to the filename, so they can be imported one more time if already successfully imported through the daily scheduled run.
With these options you can select which tables you want to import. This setting is applied to all future imports, scheduled or manual, until the app is restarted. The setting is kept in memory only, so remember to set it again after restarting the app during development. All tables with a checked box will be imported, those without will not. This is useful if you want to test your code and skip large tables like departure
.
You can also disable or enable importing and removing rows. In addition to the data for each table, each export archive also contains a [table]_removed.dat
file which lists rows which have been removed from JORE since the last export. The "remove" feature of the importer removes these rows from the connected database. Remove is performed before import.
Under normal operation, the importer will take a snapshot of the database and upload it to Azure blob storage. This does not happen when the ENVIRONMENT
env var is not local
. You can use this button to perform this manually.
The task of this importer is to download export files from an FTP server, open them, read them, and insert each row into the connected database. The export archive is produced every day except weekends at the end of the day, separately from this project.
The export archive contains a number of .dat
files, each corresponding to a table in the database. It also contains [table]_removed.dat
files which lists all rows that have been removed since the previous export archive was generated.
Each .dat file in the archive contains rows which are to be inserted in the database. The fields are not separated, but the schema contains the length of each field so that they can be read into database columns.
Note that you need quite a lot of memory on your machine to run the importer as it handles a lot of data. The import may crash when running locally. It is a good idea to exclude the
departures
table from any test imports that you do to both lower the memory requirements of the import and make it complete faster.
The schema of the database and the dat files is described in the schema.js
file. In most cases, the schema for a line in the .dat file matches the database schema, but for the geometry table they look different. To accommodate this, the schema for each table can have an additional lineSchema
property to describe how the lines should be read.
The top-level properties in the schema are table names. Under those, there is an object that describes the schema for the table, and, as mentioned earlier, an optional lineSchema property to describe the lines in the dat.
The properties for each table schema is:
{
filename: name of the .dat file containing the data,
lineSchema (optional): fields in the .dat file if different than the database,
fields: fields in the database and .dat file,
primary (optional): fields that constitute the primary key of the table
}
Each field description in the fields
(or lineSchema
) array is an object with the following properties:
{
length: number, how many characters the data is in the .dat file,
name: the name of the database column,
type: the type of the database column,
notNullable (optional): boolean, whether or not the field is nullable,
index (optional): boolean, true to create an index of this field,
primary (optional): boolean, true to make this field the primary key.
}
During database initialization, the schema file is read and the tables are created in the database based on the information. Find the code for this in the setup/createDb.js
file.
The files containing the data to be imported are .dat files, with one row per item to insert. The fields are read based on the lengths defined in the schema.js file which have been written based on the documentation of the data files. For example, if field A is defined as being 4 characters long, and field B is defined as 2 characters long, this is the result:
// .dat row
aaaabb
// Schema entry
{
name: 'a',
type: 'string',
length: 4
}, {
name: 'b',
type: 'string',
length: 2
}
// Database row (as JS object)
{
a: 'aaaa',
b: 'bb'
}
A primary key has been added to most tables. The primary key is used during both import and removal to identify rows.
During the removal process, the primary key of each line in the remove-files is calculated. Any rows in the database matching the primary key is removed.
This is the process that happens when an import is running:
- The cron scheduler triggers the update process. Alternatively, the process is triggered manually.
schedule.js
- The import process is marked as started.
index.js
- The FTP server is queried and the latest export archive is downloaded. In cases where the file has previously been found to be corrupted, or if it is already downloaded, nothing is downloaded from the server.
sources/fetchExportFromFTP.js
- The archive is unpacked and each file is sent through the import pipeline.
import.js
- For each file/table, the encoding is fixed to be utf-8.
import.js
- The file stream is split into lines.
import.js
- Each line is sent through a preprocessor which fixes the linebreaks and geometries.
preprocess.js
- Then, the stream is piped to the database importer.
database.js
- Each line is parsed into objects that can be processed easier with Javascript.
util/parseLine.js
- Future rows are filtered out, since this is a history database and rows that take place in the future may yet change.
util/futureFilter.js
- The item stream is collected into batches of at most 2000 (defined in
constants.js
).database.js
- The batch is sent to the actual import query.
database.js
- The batched import query performs an "upsert" for each row in the batch. An upsert updates the row if found by the primary key, or inserts it if not found. If no keys or constraints are defined for the table, it just inserts everything.
- This process is done as much in parallel as possible for each table, using a queueing system to not overwhelm the database connection. When the queue is processed for all tables, the import is finished.
import.js
- If not running locally (eg. in development), some tables are vacuumed and analyzed after the import.
import.js
- Also if not running locally, the database is dumped and uploaded to Azure Blob Storage.
util/createDbDump.js
andutil/uploadDbDump.js
. - Then the import is marked as finished and successful.
The geometry table is handled slightly differently, in that lines are grouped and combined by the route.
Errors or other exceptions are logged to the HSLdevcom slack (channel #transitlog-monitoring). In case the import was left unfinished or failed (after a server crash for example), it is retried the next time the service restarts.
If you need to reimport a file, you can enter the database and find the public.import_status
table. This is the table that records all imported archives and marks them as complete. To reimport a file on an instance of the importer that is running without DEBUG=true
, delete the row where filename
equals the name of the file you're trying to import or set success=false
.
To check the dat files for data, use the command line. Many of the files are huge and cannot be opened in normal code editors.
Get an export archive either from the downloads
directory or from the FTP server directly. Extract the .dat file you want to explore from the archive. The aikat.dat
file is used here as an example. This file contains all departures and is the largest one.
Check the schema definition for how long the fields are, then you can build grep queries and find what the .dat file contains.
Lis the first 10 lines:
head -n 10 aikat.dat
Find all lines for route 1001/1 on wednesdays from stop 1050417
:
less aikat.dat | grep "10504171001 1Ke"
Once connected to an instance, check logs:
$ docker exec -it 42f257a81b9c sh
$ yarn run forever logs 0 -f