sscu-budapest / board-game-dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Research Dataset Template

Boilerplate and Guide for Future Datasets

relies on sscutils and the tooling of sscub

To Create a New Dataset

1. Add Complete Dataset with Metadata

1.1 Lay Out Structure

  • if you need external namespaces, define them in metadata/imported-namespaces.yaml
    • prefix: {uri: ..., tag (optional): ...} format in yaml
    • import them to code with inv import-namespaces
  • in-script metadata in src/namespace_metadata.py

1.2 Adding Data

  • load all the data to complete subset with the update_data function in src/update_data.py
  • use ScruTable objects created in namespace_metadata to dump the data in data directory
  • execute it with inv update-data
    • this calls the function defined in teh first step

1.3 Add/Refine Metadata

  • run inv serialize-inscript-metadata to move the structure defined in the script to .yaml files
  • optionally extend the schema with descriptions where necessary

metadata schema is created with the intention for export to

2. Configure DVC with Remotes

  • add all remotes to main branch
  • set default remotes for different branches with
    • fill conf/default-remotes.yaml
    • run inv set-dvc-remotes

3. Configure, Create and Push Environments

  • create and document different environments with different access restrictions (in different dvc remotes)
    • save them to conf/created-envs.yaml using the {name: {branch: ..., kwargs: {...}} schema
    • either anonymized datasets with the same schema or simply smaller samples for personal workstation use

To Update a dataset

  • run the script that updates the complete subset
  • run the commands creating and uploading the other subsets
    • inv write-envs
    • inv push-envs --git-push

WIP: Basic Presentation of Dataset

some script should present the dataset automatically

  • schema diagram
  • maybe some table profiling
  • simple figures
  • as general as possible
  • with specifications regarding privacy / security

About

License:MIT License


Languages

Language:Python 100.0%