tilman151 / composing-datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Compose Datasets, Don't Inherit Them

This is the companion repository to this blog post. It illustrates the design pattern composition over inheritance on a PyTorch datasets.

The post references several versions of this repository. Each version is marked with a git tag:

Tag Description
v0.1.0 Single class for hate speech dataset with fixed tokenizer
v0.2.0 Hate speech dataset with string argument to choose tokenizer
v0.3.0 Imdb dataset added through a super class
v0.3.1-a Revtok tokenizer configurable through **kwargs
v0.3.1-b Revtok tokenizer configurable through own child class
v0.4.0 All tokenizers configurable through composition

Installation

First, checkout the repository:

git clone git@github.com:tilman151/composing-datasets.git
# or
git clone https://github.com/tilman151/composing-datasets.git

This project uses poetry for dependency management. Please refer to the poetry docs for installation instructions. After installing poetry, install the dependencies with:

poetry install

Poetry will create a clean virtual environment for this project which can be activated with:

poetry shell

Choosing a Version

Each version is tested and functional. To choose a specific version, look up the tag in the table above and check the commit out:

git checkout tags/<version_tag>

To verify your installation and the version, run the tests:

python -m unittest -v

About

License:MIT License


Languages

Language:Python 100.0%