TSV to HDF5 converter on a very large dataset

Question

TSV to HDF5 converter on a very large dataset

ciodar opened this issue 2 years ago · comments

Hi,
I'm trying to convert several TSV files from c4 200M dataset into HDF5 format and I based my conversion on your notebook.

The dataset is composed by 10 files, each file containing approximately 18 million records, with 2 string columns.
Given the size of the dataset, I thought that converting it to HDF5 format would give a significant benefit and would allow me to know the shape of each file and give significant performance boost in the read of chunks of the dataset.

In a first trial I converted 1 million records in about 3 minutes, however when I tried to convert all 18 million records it is taking more than 6 hours per file.
I am currently loading my tsv in the following way

def csv_to_hf5(csv_path, num_lines=1000000, chunksize=100000, columns=None):
    if columns is None:
        columns = ['input', 'labels']
    csv_path = pl.Path(csv_path)

    hdf_filename = csv_path.parent / pl.Path(csv_path).name.replace('.tsv', '.hf5')

    # suppose this is a large CSV that does not
    # fit into memory:

    # Get number of lines in the CSV file if it's on your hard drive:
    # num_lines = subprocess.check_output(['wc', '-l', in_csv])
    # num_lines = int(nlines.split()[0])
    # use 10,000 or 100,000 or so for large files

    dt = h5py.special_dtype(vlen=str)

    # this is your HDF5 database:
    with h5py.File(hdf_filename, 'w') as h5f:

        # use num_features-1 if the csv file has a column header
        dset1 = h5f.create_dataset('input',
                                   shape=(num_lines,),
                                   compression=9,
                                   dtype=dt
                                   )
        dset2 = h5f.create_dataset('labels',
                                   shape=(num_lines,),
                                   compression=9,
                                   dtype=dt
                                   )

        # change range argument from 0 -> 1 if your csv file contains a column header
        for i in tqdm(range(0, num_lines, chunksize)):
            df = pd.read_csv(csv_path,
                             sep='\t',
                             names=columns,
                             header=None,  # no header, define column header manually later
                             nrows=chunksize,  # number of rows to read at each iteration
                             skiprows=i,
                             )  # skip rows that were already read

            features = df.input.values.astype(str)
            labels = df.labels.values.astype(str)

            # use i-1 and i-1+10 if csv file has a column header
            dset1[i:i + chunksize] = features
            dset2[i:i + chunksize] = labels

where i set num_lines equal to the total lines of each file, where chunksize = 10000.

I did not expect this performance degradation, have you ever tried to use your code to convert dataset of a similar dimension?

Thanks in advance.