You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi...
As recommended on GitHub, the best size of chunks is 10 to 30 seconds. However, the Librispeech dataset was split into various sizes starts from 2 secs.
My question is what is the optimal chunk's size? and is it okay to pre-train on audios of different sizes and fine-tune on chunks of fixed sizes, or the opposite (fixed for pertaining and variable for fine-tuning)?
Further, when split the audios into chunks (ex. at a fixed size of 3 s), some spoken words might be lost, what is a better approach would be for splitting the audios? given that relying on silences results in a larger chunks size
Thanks in advance.
The text was updated successfully, but these errors were encountered:
Hi...
As recommended on GitHub, the best size of chunks is 10 to 30 seconds. However, the Librispeech dataset was split into various sizes starts from 2 secs.
My question is what is the optimal chunk's size? and is it okay to pre-train on audios of different sizes and fine-tune on chunks of fixed sizes, or the opposite (fixed for pertaining and variable for fine-tuning)?
Further, when split the audios into chunks (ex. at a fixed size of 3 s), some spoken words might be lost, what is a better approach would be for splitting the audios? given that relying on silences results in a larger chunks size
Thanks in advance.
The text was updated successfully, but these errors were encountered: