deepset-ai / FARM

Tried this with FARM version 1.8.1:

When the data silo is created with max_processes=1, then a textclassification processor with dev_split=0.1 or 0.4 fails to create a dev split and shows the following exception:

File "/home/johann/.local/lib/python3.7/site-packages/farm/data_handler/data_silo.py", line 420, in random_split_ConcatDataset
    assert idx_dataset >= 1, "Dev_split ratio is too large, there is no data in train set. " \
AssertionError: Dev_split ratio is too large, there is no data in train set. Please lower dev_split = 0.4

The problem here seems to be that when reading the data, the data instances are grouped into chunks and that the train/dev split can only be done in multiples of chunks and with this setting, ALL instances get grouped into a single chunk/dataset.

To recreate this bug:

edit the examples/doc_classification.py file and
after line 61

FARM/examples/doc_classification.py

Line 61 in 30c3db2

processor=processor,
insert max_processes=1,
run it

This is related to my work on issue 819 as the underlying problem is that dev set splits are done on the basis of the chunk datasets in the concat dataset instead of properly based on instances which has a number of implications.

Thanks for bringing this up again. This behaviour is know and documented in #583 but since we did not receive any traction we let the issue go stale.

The problem is pytorchs split function did not operate on Concatdatasets - Concatdatasets are used when we chunk data for each process and combine the chunks again afterwards.
When we do not use multiprocessing we do not chunk, therefore do not have concatdatasets and could use torch.split.

Then splitting would be different for multiprocessing vs single processing.

The cleanest solution would be to split properly on a Concatdataset - maybe torch 1.9 supports this now (it did not when we looked into #538 in October 2020)?

Sorry I missed #538

I just created this because I wanted to know if people are aware and what their opinion is on this.

I have in the meantime found a solution that would fix this, because the same problem arises when one tries to do dev set stratification (or really any dev set split) properly.
So my implementation for #819 will also fix this and create better than "approximate" dev set splits when stratification is enabled and I could also fix this and perform better devset splitting if stratification is not enabled.
I also could keep the old behaviour, together with this problem if devset stratification is not enabled, for backwards compatibility.

My solution basically implements a new ConcatDataset that can replace the pytorch ConcatDataset and that new implementation allows arbitrary splicing and index lists to retrieve the corresponding tuples of tensors.

fixed by #825

Devset split does not work with max_processes=1