Parla (api & database)

This is a the api and database for the explorational project Parla. This is not production ready. Currently we explore if we can make the parliamentary documentation provided by the "The Abgeordnetenhaus" of Berlin as open data https://www.parlament-berlin.de/dokumente/open-data more accessible by embedding all the data and do search it using vector similarity search. The project is heavily based on this example from the supabase community. Built with Fastify and deployed to render.com using docker.

Prerequisites

docker
vercel.com account
supabase.com account
running instance of the related frontend https://github.com/technologiestiftung/parla-frontend
running instance of the database, defined in ./supabase
populated database. Using these tools https://github.com/technologiestiftung/parla-document-processor

Needed Environment Variables

See also .envrc.sample. (Might be more up to date).

export SUPABASE_URL="http://localhost:54321"
export SUPABASE_ANON_KEY="ey..."
# Get your key at https://platform.openai.com/account/api-keys
export OPENAI_KEY="sk-UY..."
export SUPABASE_SERVICE_ROLE_KEY=
# in dev we can use a lesser version to save some coins
export OPENAI_MODEL="gpt-3.5-turbo"
export PORT="8080"
export OPENAI_EMBEDDING_MODEL="text-embedding-3-small"
# should be one of "debug", "info", "warning", "error", "critical"
export LOG_LEVEL="info"
# This is only for testing purpose and should not be allowed in production
# for real real!
export DANGEROUSLY_ALLOW_CORS_FOR_ALL_ORIGINS="FOR_REAL_REAL"

Hint. We use direnv for development environment variables. See https://direnv.net/

Installation

npm ci

Deployment

Currently we deploy using docker on render.com.

Go to render.com
allow render to access your github repository
create a new web service (type should be docker)
populate the environment variables
deploy

Development

Startup a local database:

npx supabase start

Run the API:

npm run dev

Edit the files in src

See also the swagger documentation at http://localhost:8080/documentation/static/index.html

Periodically regenerate indices

The indices on the processed_document_chunks and processed_document_summaries tables need be regenerated upon arrival of new data. This is because the lists parameter should be changed accordingly to https://github.com/pgvector/pgvector. To do this, we use the pg_cron extension available: https://github.com/citusdata/pg_cron. To schedule the regeneration of indices, we create two jobs which use functions defined in the API and database definition: https://github.com/technologiestiftung/parla-api.

select cron.schedule (
    'regenerate_embedding_indices_for_chunks',
    '30 5 * * *',
    $$ SELECT * from regenerate_embedding_indices_for_chunks() $$
);

select cron.schedule (
    'regenerate_embedding_indices_for_summaries',
    '30 5 * * *',
    $$ SELECT * from regenerate_embedding_indices_for_summaries() $$
);

Feedback Feature

To have feedback types and tags in the initial version you can use this snippet

INSERT INTO feedbacks (kind, tag)
		values('positive', NULL), ('negative', 'Antwort inhaltlich falsch oder missverständlich'), ('negative', 'Es gab einen Fehler'), ('negative', 'Antwort nicht ausführlich genug'), ('negative', 'Dokumente unpassend');