You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, huge thanks to publishing this codebase with simple and easy to use scripts, this is what moves the progress forward.
The ability of neural networks to generate data is something that i always find very exciting, but nowadays pretty much all the research is focused on the image generation.
One of the key moments of raw audio modelling of music was Jukebox by OpenAI, but it was 2 years ago, it requires thons of GPUs power to train it, and due to autoreggressive nature of Jukebox, it requires hours of time on a beefy GPU(A100 like) to sample a few seconds of audio, as you pointed that out in the paper.
Musika significantly reduced the requirements for both training and sampling, which made raw audio modelling more accessible, without huge loss of audio quality compared to Jukebox(Congratulations on this achievement!).
I used GTX 1060 6GB, the amount of VRAM was enough to run training with the default settings.
"Datasets" that i tried musika on was quite small, so i just used provided models for finetuning.
I tried to finetune musika on Minecraft's soundtrack, and Tangerine Dream's pieces(primarily 1970-1990 live concerts and studio records).
Minecraft OST quickly started to overfit, producing very distorted copies of the original dataset (Albumes Volume Alpha - 2011; Volume Beta - 2013, or 3 hours of audio, or 279 encoded samples). I used techno model cause i thought it was closest to my needs. 96k iterations.
Here are some examples: 1. Overfit of Aria Math Track 2. Overfit of Droopy likes ricochet Track
Altogether i consider this experiment as a failure, obviously dataset was too small.
Training on Tangerine Dream's tracks and albums was a lot more successful, it also was a finetune of a techno model.
It was my first experiment with the Musika, so i quickly gathered a tiny dataset, and as a result, same thing happened as with Minecraft Soundtrack. Overfitting to some tracks and all of that.
Then i gathered some more data, not sure how many hours of audio, but since there is 2344 encoded samples, my calculations estimate something roughly around ~27-28 hours of audio(yep, just checked through AIMP, this is correct). This turned out a lot better. I used your recommended learning rate for finetuning. You can listen results here
I have ran into the training instability issues during this run after 112k iterations, even though it was a finetuning.
Maybe this was caused by the fact that i wanted to cool GPU a bit down without damaging the performance, so i power limited it, and raised the core clock +200 more, i was using MSI afterburner by the way. The instability in the model? Or my actions caused the issue? Who knows, it could be anything, but it saved NaN model exactly after i leaved to bed, so the entire night it was just training the NaN'ed model.
I was somewhat inspired to gather the dataset and try training on the Tangerine Dream albums as i've seen some time ago that someone trained SampleRNN on it. Take a look.
While Musika because of it's autoencoder is not able to reach same chrisp audio quality as SampleRNN, it handles coherency and long term dependencies a lot better.
I'm hyped to get soon the ability to train the autoencoder on our own datasets. Would it be possible to train another, bigger, better autoencoder, but without the need to re-train the generator? That would be awesome.
I have a question about batch size, i tried default value of 32, and then 4, but the amount of used VRAM in both cases was around 5,7 GB.
It definitly was doing it's thing, cause iterations per second was changing, but why memory consumption wasn't chaning? I even tried to make custom model from scratch, with base_channels = 32, but VRAM still was about 5,7 Gigs.
As far as i know Musika does not have the concept of time relative to the begging-end of the track. I think Jukebox had some trick to give the model that feeling of time, so at the beggings of tracks it would start calmly like it has yet to play out, and at the ends it would for example make fades and claps from audience. More technically i understand this like during training the token generator model was conditioned on the information about how soon the sequence should stop. You're planning to add the ability to condition generator on the tempo information, it also would be super nice to have relative time as a option for the conditioning if Musika's architecture able to have that.
Regarding NaN's errors, feature for the automatic suspend, or even maybe restart the training from the last successful model would be super useful.
I'd like to collect some more datasets and train Musika on them in the future, i think something like Aphex Twin and others could end up being pretty cool.
The text was updated successfully, but these errors were encountered:
First of all, huge thanks to publishing this codebase with simple and easy to use scripts, this is what moves the progress forward.
The ability of neural networks to generate data is something that i always find very exciting, but nowadays pretty much all the research is focused on the image generation.
One of the key moments of raw audio modelling of music was Jukebox by OpenAI, but it was 2 years ago, it requires thons of GPUs power to train it, and due to autoreggressive nature of Jukebox, it requires hours of time on a beefy GPU(A100 like) to sample a few seconds of audio, as you pointed that out in the paper.
Musika significantly reduced the requirements for both training and sampling, which made raw audio modelling more accessible, without huge loss of audio quality compared to Jukebox(Congratulations on this achievement!).
I used GTX 1060 6GB, the amount of VRAM was enough to run training with the default settings.
"Datasets" that i tried musika on was quite small, so i just used provided models for finetuning.
I tried to finetune musika on Minecraft's soundtrack, and Tangerine Dream's pieces(primarily 1970-1990 live concerts and studio records).
Minecraft OST quickly started to overfit, producing very distorted copies of the original dataset (Albumes Volume Alpha - 2011; Volume Beta - 2013, or 3 hours of audio, or 279 encoded samples). I used techno model cause i thought it was closest to my needs. 96k iterations.
Here are some examples:
1. Overfit of Aria Math Track
2. Overfit of Droopy likes ricochet Track
Altogether i consider this experiment as a failure, obviously dataset was too small.
Training on Tangerine Dream's tracks and albums was a lot more successful, it also was a finetune of a techno model.
It was my first experiment with the Musika, so i quickly gathered a tiny dataset, and as a result, same thing happened as with Minecraft Soundtrack. Overfitting to some tracks and all of that.
Then i gathered some more data, not sure how many hours of audio, but since there is 2344 encoded samples, my calculations estimate something roughly around ~27-28 hours of audio(yep, just checked through AIMP, this is correct). This turned out a lot better. I used your recommended learning rate for finetuning.
You can listen results here
I have ran into the training instability issues during this run after 112k iterations, even though it was a finetuning.
Maybe this was caused by the fact that i wanted to cool GPU a bit down without damaging the performance, so i power limited it, and raised the core clock +200 more, i was using MSI afterburner by the way. The instability in the model? Or my actions caused the issue? Who knows, it could be anything, but it saved NaN model exactly after i leaved to bed, so the entire night it was just training the NaN'ed model.
I was somewhat inspired to gather the dataset and try training on the Tangerine Dream albums as i've seen some time ago that someone trained SampleRNN on it. Take a look.
While Musika because of it's autoencoder is not able to reach same chrisp audio quality as SampleRNN, it handles coherency and long term dependencies a lot better.
I'm hyped to get soon the ability to train the autoencoder on our own datasets. Would it be possible to train another, bigger, better autoencoder, but without the need to re-train the generator? That would be awesome.
I have a question about batch size, i tried default value of 32, and then 4, but the amount of used VRAM in both cases was around 5,7 GB.
It definitly was doing it's thing, cause iterations per second was changing, but why memory consumption wasn't chaning? I even tried to make custom model from scratch, with base_channels = 32, but VRAM still was about 5,7 Gigs.
As far as i know Musika does not have the concept of time relative to the begging-end of the track. I think Jukebox had some trick to give the model that feeling of time, so at the beggings of tracks it would start calmly like it has yet to play out, and at the ends it would for example make fades and claps from audience. More technically i understand this like during training the token generator model was conditioned on the information about how soon the sequence should stop. You're planning to add the ability to condition generator on the tempo information, it also would be super nice to have relative time as a option for the conditioning if Musika's architecture able to have that.
Regarding NaN's errors, feature for the automatic suspend, or even maybe restart the training from the last successful model would be super useful.
I'd like to collect some more datasets and train Musika on them in the future, i think something like Aphex Twin and others could end up being pretty cool.
The text was updated successfully, but these errors were encountered: