error during run this code train_dataset = train_dataset.map(....)

Question

error during run this code train_dataset = train_dataset.map(....)

Hind-Saleh-Alatawi opened this issue 2 years ago · comments

Hind-Saleh-Alatawi commented 2 years ago

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-fb6b689d72250a90/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-210f34888677e925.arrow
     
#0: 0%
0/1 [00:05<?, ?ba/s]
#2: 0%
0/1 [00:05<?, ?ba/s]
#3: 0%
0/1 [00:05<?, ?ba/s]
/usr/local/lib/python3.8/dist-packages/transformers/feature_extraction_utils.py:165: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  tensor = as_tensor(value)
/usr/local/lib/python3.8/dist-packages/transformers/feature_extraction_utils.py:165: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  tensor = as_tensor(value)
/usr/local/lib/python3.8/dist-packages/transformers/feature_extraction_utils.py:165: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  tensor = as_tensor(value)
---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 552, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 519, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/fingerprint.py", line 480, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3259, in _map_single
    writer.write_batch(batch)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py", line 551, in write_batch
    arrays.append(pa.array(typed_sequence))
  File "pyarrow/array.pxi", line 231, in pyarrow.lib.array
    return _handle_arrow_array_protocol(obj, type, mask, size)
  File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
    res = obj.__arrow_array__(type=type)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py", line 186, in __arrow_array__
    out = list_of_np_array_to_pyarrow_listarray(data)
  File "/usr/local/lib/python3.8/dist-packages/datasets/features/features.py", line 1396, in list_of_np_array_to_pyarrow_listarray
    return list_of_pa_arrays_to_pyarrow_listarray(
  File "/usr/local/lib/python3.8/dist-packages/datasets/features/features.py", line 1389, in list_of_pa_arrays_to_pyarrow_listarray
    values = pa.concat_arrays(l_arr)
  File "pyarrow/array.pxi", line 2889, in pyarrow.lib.concat_arrays
    c_concatenated = GetResultValue(Concatenate(c_arrays, pool))
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
    raise ArrowInvalid(message)
pyarrow.lib.ArrowInvalid: arrays to be concatenated must be identically typed, but float and list<item: float> were encountered.
"""

The above exception was the direct cause of the following exception:

ArrowInvalid                              Traceback (most recent call last)
[<ipython-input-45-9ae70b765695>](https://localhost:8080/#) in <module>
----> 1 train_dataset = train_dataset.map(
      2     preprocess_function,
      3     batch_size=100,
      4     batched=True,
      5     num_proc=4

15 frames
[/usr/local/lib/python3.8/dist-packages/pyarrow/error.pxi](https://localhost:8080/#) in pyarrow.lib.check_status()
     98 
     99         if status.IsInvalid():
--> 100             raise ArrowInvalid(message)
    101         elif status.IsIOError():
    102             # Note: OSError constructor is

ArrowInvalid: arrays to be concatenated must be identically typed, but float and list<item: float> were encountered.

Omar Mahhmoud · Answer 1 · Sun Jan 29 2023 02:33:54 GMT+0800 (China Standard Time)

in which notebook that error appears and, would you share the code responsible for loading the dataset?

Hind-Saleh-Alatawi · Answer 2 · Sun Jan 29 2023 23:04:13 GMT+0800 (China Standard Time)

in which notebook that error appears and, would you share the code responsible for loading the dataset?
Hubert notebook and sometimes on word2vec...I followed the same code exactly as you but on my own dataset which are voices ....How do I share CSV file? we have small dataset consist of 75 voices

WARNING:datasets.builder:Using custom data configuration default-93fa3b7b9717628b
Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-93fa3b7b9717628b/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...
Downloading data files: 100%
2/2 [00:00<00:00, 109.10it/s]
Extracting data files: 100%
2/2 [00:00<00:00, 86.94it/s]
Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-93fa3b7b9717628b/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.
100%
2/2 [00:00<00:00, 74.85it/s]
Dataset({
    features: ['name', 'emotion'],
    num_rows: 60
})
Dataset({
    features: ['name', 'emotion'],
    num_rows: 16
})

Omar Mahhmoud · Answer 3 · Sun Jan 29 2023 23:23:21 GMT+0800 (China Standard Time)

you have to build your CSV file based on your dataset folder like the following

path_file_name | emotion
 '/content/wav_1.wav' | sad

then make sure your voices are not corrupted

Hind-Saleh-Alatawi · Answer 4 · Sun Jan 29 2023 23:29:18 GMT+0800 (China Standard Time)

Here is when I print the train dataset

print(train_dataset[0])
{'name': '/content/CS/A-F-28-0.wav', 'emotion': 0}

I wonder How csv file includes the voices? It only includes the voice names...What do you mean by that voice is corrupted and how do i check that please?

Omar Mahhmoud · Answer 5 · Thu Feb 02 2023 22:43:10 GMT+0800 (China Standard Time)

the CSV file includes the file names of the records, not the voice itself, you can check the voices by playing it with the following code

speech, sr = torchaudio.load(path)
speech = speech[0].numpy().squeeze()
speech = librosa.resample(np.asarray(speech), sr, 16_000)
ipd.Audio(data=np.asarray(speech), autoplay=True, rate=16000)

if you are still getting errors, would you share your notebook to run it on my side