Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Preprocessing the data #5

Open
AchintyaX opened this issue Aug 29, 2021 · 6 comments
Open

Error in Preprocessing the data #5

AchintyaX opened this issue Aug 29, 2021 · 6 comments

Comments

@AchintyaX
Copy link

I was using the colab notesbook for training a model using wav2vec2forclassification. In the preprocessing step when I am running the following code -

train_dataset = train_dataset.map(
    preprocess_function,
    batch_size=10,
    batched=True,
   
)
eval_dataset = eval_dataset.map(
    preprocess_function,
    batch_size=10,
    batched=True,
    
)

I am getting into the following error, I guess it has something to do with hugging face datasets -

0%|          | 0/1765 [00:00<?, ?ba/s]
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
/tmp/ipykernel_29222/3011913806.py in <module>
      2 
      3 
----> 4 train_dataset = train_dataset.map(
      5     preprocess_function,
      6     batch_size=10,

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/datasets/arrow_dataset.py in map(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
   1667 
   1668         if num_proc is None or num_proc == 1:
-> 1669             return self._map_single(
   1670                 function=function,
   1671                 with_indices=with_indices,

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
    183         }
    184         # apply actual function
--> 185         out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    186         datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
    187         # re-apply format to the output

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/datasets/fingerprint.py in wrapper(*args, **kwargs)
    395             # Call actual function
    396 
--> 397             out = func(self, *args, **kwargs)
    398 
    399             # Update fingerprint of in-place transforms + update in-place history of transforms

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/datasets/arrow_dataset.py in _map_single(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset, disable_tqdm, desc)
   2036                             else:
   2037                                 batch = cast_to_python_objects(batch)
-> 2038                                 writer.write_batch(batch)
   2039                 if update_data and writer is not None:
   2040                     writer.finalize()  # close_stream=bool(buf_writer is None))  # We only close if we are writing in a file

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/datasets/arrow_writer.py in write_batch(self, batch_examples, writer_batch_size)
    401             typed_sequence = OptimizedTypedSequence(batch_examples[col], type=col_type, try_type=col_try_type, col=col)
    402             typed_sequence_examples[col] = typed_sequence
--> 403         pa_table = pa.Table.from_pydict(typed_sequence_examples)
    404         self.write_table(pa_table, writer_batch_size)
    405 

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._handle_arrow_array_protocol()

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/datasets/arrow_writer.py in __arrow_array__(self, type)
    105                 out = numpy_to_pyarrow_listarray(self.data)
    106             else:
--> 107                 out = pa.array(self.data, type=type)
    108             if trying_type and out[0].as_py() != self.data[0]:
    109                 raise TypeError(

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Can only convert 1-dimensional array values
@AchintyaX
Copy link
Author

@m3hrdadfi I feel it might have something to do with the version of hugging face datasets being used in this implementation, can you share the version since directly installing from git might result in some conflicts

@hamzakhalem
Copy link

i have the same problem here ... if you've got the solution please share it with me

@Alymostafa
Copy link

@AchintyaX
@hamzakhalem
The problem is the torchaudio.load you will find that it returns 2d array like that (2,82585) but it should be like that (82585,). So the solution is to change this code block :

def speech_file_to_array_fn(path):
    speech_array, sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(sampling_rate, target_sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech

To that:

def speech_file_to_array_fn(path):
    speech_array, sampling_rate = torchaudio.load(path)
    speech_array = speech_array[0].numpy().squeeze()
    resampler = librosa.resample(speech_array, sampling_rate, target_sampling_rate)
    speech = resampler.squeeze()
    return speech

it works for me.

@eaedkbaamtu
Copy link

I have the same bug but the first and the second array are different, are you sure that select the [0] is not a splitting of the sample ?

@Alymostafa
Copy link

Alymostafa commented Jan 11, 2022

Yes, it's not splitting. To be sure try librosa.load instead of torch.audio, it will output a single array that is equal [0].

I have the same bug but the first and the second array are different, are you sure that select the [0] is not a splitting of the sample ?

@eaedkbaamtu
Copy link

ok, but when I checked the both arrays were different that why I asked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants