ckan / ckan

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

Home Page:https://ckan.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nightly test with realistic data, performance measurements

wardi opened this issue · comments

Let's improve the create-test-data command to generate a more typical count of users, groups, orgs and datasets. It should also upload generated test data with tens of columns and thousands of rows. Next we should create a benchmark-test-data command to exercise the UI and APIs to display, sort and query the generated data.

These commands should have an option to generate a detailed report with the time for each each creation or query task.

In our nightly build job we can collect these reports and add them to a github pages static site repo, along with the commit id and pip freeze output, to track performance for these realistic workloads over time similar to https://speed.pypy.org/

This automatic reporting will help us identify changes to ckan's code, dependencies and environment that help or hurt performance.

Instead of synthetic test data, we should snapshot real-world data from well-known sources, e.g.:

  • World Bank
  • UN
  • NYC's 311 and Taxi Data
  • Boston's CKAN Organizations
  • Canada's Open Data Portal
  • non-English content from other CKAN Sites (Saudi Arabia, Singapore, Japan, Africa, Argentina, Finland, etc.)

The sample data snapshot should be curated so that it can exercise CKAN subsystems (e.g. different data types, date formats, UTF-8 encoding, Languages, etc.)