msamogh / opensubtitles-dataloader

Loads OpenSubtitles v2018 dataset without having to load everything into memory at once. Works well with pytorch.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

opensubtitles-dataloader

PyPI version

pip install opensubtitles-dataloader

Download, preprocess and use sentences from the OpenSubtitles v2018 dataset without ever needing to load all of it into memory.

Download

See possible languages here.

opensubtitles-download en

Load tokenized version.

opensubtitles-download en --token

Use in Python

Load

opensubtites_dataset = OpenSubtitlesDataset('en')

Load only the first 1 million lines.

opensubtites_dataset = OpenSubtitlesDataset('en', first_n_lines=1_000_000)

Group sentences into groups of 5.

opensubtites_dataset = OpenSubtitlesDataset('en', 5)

Group sentences into groups ranging from 2 to 5.

opensubtites_dataset = OpenSubtitlesDataset('en', (2,5))

Split sentences using "\n".

opensubtites_dataset = OpenSubtitlesDataset('en', delimiter="\n")

Do preprocessing.

opensubtites_dataset = OpenSubtitlesDataset('en', preprocess_function=my_preprocessing_function)

Split for Training

train, valid, test = opensubtites_dataset.split()

Set the fractions of the original dataset.

train, valid, test = opensubtites_dataset.split([0.7, 0.15, 0.15])

Use a seed.

train, valid, test = opensubtites_dataset.split(seed=42)

Access

index.

train, valid, text = OpenSubtitlesDataset('en').splits()
train[20_000]

pytorch.

from torch.utils.data import DataLoader
train, valid, text = OpenSubtitlesDataset('en').splits()
train_loader = DataLoader(train, batch_size=16)

About

Loads OpenSubtitles v2018 dataset without having to load everything into memory at once. Works well with pytorch.


Languages

Language:Python 100.0%