Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My early experiments and experience of using Musika! #18

Open
DEBIHOOD opened this issue Oct 25, 2022 · 0 comments
Open

My early experiments and experience of using Musika! #18

DEBIHOOD opened this issue Oct 25, 2022 · 0 comments

Comments

@DEBIHOOD
Copy link

First of all, huge thanks to publishing this codebase with simple and easy to use scripts, this is what moves the progress forward.

The ability of neural networks to generate data is something that i always find very exciting, but nowadays pretty much all the research is focused on the image generation.
One of the key moments of raw audio modelling of music was Jukebox by OpenAI, but it was 2 years ago, it requires thons of GPUs power to train it, and due to autoreggressive nature of Jukebox, it requires hours of time on a beefy GPU(A100 like) to sample a few seconds of audio, as you pointed that out in the paper.
Musika significantly reduced the requirements for both training and sampling, which made raw audio modelling more accessible, without huge loss of audio quality compared to Jukebox(Congratulations on this achievement!).
I used GTX 1060 6GB, the amount of VRAM was enough to run training with the default settings.
"Datasets" that i tried musika on was quite small, so i just used provided models for finetuning.
I tried to finetune musika on Minecraft's soundtrack, and Tangerine Dream's pieces(primarily 1970-1990 live concerts and studio records).

Minecraft OST quickly started to overfit, producing very distorted copies of the original dataset (Albumes Volume Alpha - 2011; Volume Beta - 2013, or 3 hours of audio, or 279 encoded samples). I used techno model cause i thought it was closest to my needs. 96k iterations.
Here are some examples:
1. Overfit of Aria Math Track
2. Overfit of Droopy likes ricochet Track
Altogether i consider this experiment as a failure, obviously dataset was too small.

Training on Tangerine Dream's tracks and albums was a lot more successful, it also was a finetune of a techno model.
It was my first experiment with the Musika, so i quickly gathered a tiny dataset, and as a result, same thing happened as with Minecraft Soundtrack. Overfitting to some tracks and all of that.
Then i gathered some more data, not sure how many hours of audio, but since there is 2344 encoded samples, my calculations estimate something roughly around ~27-28 hours of audio(yep, just checked through AIMP, this is correct). This turned out a lot better. I used your recommended learning rate for finetuning.
You can listen results here
I have ran into the training instability issues during this run after 112k iterations, even though it was a finetuning.
Maybe this was caused by the fact that i wanted to cool GPU a bit down without damaging the performance, so i power limited it, and raised the core clock +200 more, i was using MSI afterburner by the way. The instability in the model? Or my actions caused the issue? Who knows, it could be anything, but it saved NaN model exactly after i leaved to bed, so the entire night it was just training the NaN'ed model.

I was somewhat inspired to gather the dataset and try training on the Tangerine Dream albums as i've seen some time ago that someone trained SampleRNN on it. Take a look.
While Musika because of it's autoencoder is not able to reach same chrisp audio quality as SampleRNN, it handles coherency and long term dependencies a lot better.

I'm hyped to get soon the ability to train the autoencoder on our own datasets. Would it be possible to train another, bigger, better autoencoder, but without the need to re-train the generator? That would be awesome.

I have a question about batch size, i tried default value of 32, and then 4, but the amount of used VRAM in both cases was around 5,7 GB.
It definitly was doing it's thing, cause iterations per second was changing, but why memory consumption wasn't chaning? I even tried to make custom model from scratch, with base_channels = 32, but VRAM still was about 5,7 Gigs.

As far as i know Musika does not have the concept of time relative to the begging-end of the track. I think Jukebox had some trick to give the model that feeling of time, so at the beggings of tracks it would start calmly like it has yet to play out, and at the ends it would for example make fades and claps from audience. More technically i understand this like during training the token generator model was conditioned on the information about how soon the sequence should stop. You're planning to add the ability to condition generator on the tempo information, it also would be super nice to have relative time as a option for the conditioning if Musika's architecture able to have that.

Regarding NaN's errors, feature for the automatic suspend, or even maybe restart the training from the last successful model would be super useful.

I'd like to collect some more datasets and train Musika on them in the future, i think something like Aphex Twin and others could end up being pretty cool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant