Add announcement blog post for Demucs

sevagh · Dec 8, 2023 · 6ef5793 · 6ef5793
1 parent 5731dbf
commit 6ef5793
Show file tree

Hide file tree

Showing 5 changed files with 258 additions and 4 deletions.
diff --git a/docs/_posts/2023-11-24-Music-demixing-terminology.md b/docs/_posts/2023-11-24-Music-demixing-terminology.md
@@ -40,20 +40,20 @@ Typical music recordings are mono or stereo mixtures, with multiple sound object
 The paper on the Music Demixing Challenge 2021 [[2]](#2) provides a summary of why the audio source separation problem has been interesting to researchers:
 >Audio source separation has been studied extensively for decades as it brings benefits in our daily life, driven by many practical applications, e.g., hearing aids, speech diarization, etc. In particular, music source separation (MSS) attracts professional creators because it allows the remixing or reviving of songs to a level never achieved with conventional approaches such as equalizers. Suppressing vocals in songs can also improve the experience of a karaoke application, where people can enjoy singing together on top of the original song (where the vocals were suppressed), instead of relying on content developed specifically for karaoke applications
 
-Computational source separation has a history of at least 50 years [[1]](#1), originating from the tasks of computational auditory scene analysis (CASA) and blind source separation (BSS). In CASA, the goal is to computationally extract individual streams from recordings of an acoustic scene [[3]](#3), based on the definition of ASA (auditory scene analysis) [[4]](#4). BSS [[5]](#5) solves a subproblem of CASA which aims to recover the sources of a "mixture of multiple, statistically independent sources that are received with separate sensors" [[3]](#3). The term "blind" refers to there being no prior knowledge of what the sources are, and how they were mixed together. <span class="highlight">In CASA and BSS, therefore, the mixed audio contains unknown sources combined in unknown ways that must be separated.</span>
+Computational source separation has a history of at least 50 years [[1]](#1), originating from the tasks of computational auditory scene analysis (CASA) and blind source separation (BSS). In CASA, the goal is to computationally extract individual streams from recordings of an acoustic scene [[3]](#3), based on the definition of ASA (auditory scene analysis) [[4]](#4). BSS [[5]](#5) solves a subproblem of CASA which aims to recover the sources of a "mixture of multiple, statistically independent sources that are received with separate sensors" [[3]](#3). The term "blind" refers to there being no prior knowledge of what the sources are, and how they were mixed together. <span class="blog-highlight">In CASA and BSS, therefore, the mixed audio contains unknown sources combined in unknown ways that must be separated.</span>
 
 By contrast, in music source separation and music demixing, the sources are typically known, or have known characteristics. That is to say, in music source separation, the task is not to separate all of the distinct sources in the mixture, but to extract a predefined set of sources, e.g.: harmonic and percussive sources, or the common four sources defined by the MUSDB18-HQ dataset [[6]](#6): vocals, drums, bass, and other. Music demixing can be considered as the reverse of a simple (no effects) mixing process of *stems* in a recording studio:
 <img src="/assets/blog/post3/mixdemix.webp" width="65%" alt="mixing-demixing-diagram"/>
 
 A stem is a grouping of individually recorded instrument tracks that have been combined together in a common category. For example, a drum stem could include all of the tracks of a drum kit (e.g., snare, tom, hihat), and a vocal stem could include all of the vocal tracks from the different singers in the song. [Izotope](https://www.izotope.com/en/learn/stems-and-multitracks-whats-the-difference.html) and [LANDR](https://blog.landr.com/stems-in-music/), two music tech companies, have written about stems and their history.
 
-<span class="highlight">In this light we can see that music demixing is simply a combination of multiple music source separation subproblems for all of the desired target stems.</span>
+<span class="blog-highlight">In this light we can see that music demixing is simply a combination of multiple music source separation subproblems for all of the desired target stems.</span>
 
 ## Music industry: stems and splitters
 
 The theoretical underpinnings of modern AI and deep learning techniques were [beginning to be discovered by 1960](https://people.idsia.ch/~juergen/firstdeeplearner.html), but the computational power available was too low to take advantage of those ideas (nowadays this is inversed; the insane levels of compute power available in the world have led to huge and powerful AI models like ChatGPT)
 
-Being neither a musician nor a music producer, or audio engineer, I can't speak with authority on the landscape and history of how people or products have approached stem isolation. All I know is that each time I talked about some new algorithm or piece of code I discovered with one of my musician friends, they'd always come back with  "oh yeah I have an izotope plugin for that." <span class="highlight">Theory and practice are related but not strictly dependent on one another: real-world products can be created before there is a mathematical proof for how they work.</span>
+Being neither a musician nor a music producer, or audio engineer, I can't speak with authority on the landscape and history of how people or products have approached stem isolation. All I know is that each time I talked about some new algorithm or piece of code I discovered with one of my musician friends, they'd always come back with  "oh yeah I have an izotope plugin for that." <span class="blog-highlight">Theory and practice are related but not strictly dependent on one another: real-world products can be created before there is a mathematical proof for how they work.</span>
 
 Here's a nice story of the [journey of the HitnMix RipX DAW](https://hitnmix.com/2023/07/17/history-of-audio-separation/); they describe how they had been working in the space of commercial music separation offerings since 2001, when I was not yet 10 years old. [Another story from Wired](https://www.wired.com/story/upmixing-audio-recordings-artificial-intelligence/) discusses the industry and how various academics have over time created startups or products for practical uses in the music industry, such as salvaging or cleaning up old Beatles recordings.
 

diff --git a/docs/_posts/2023-12-08-Demucs-has-landed.md b/docs/_posts/2023-12-08-Demucs-has-landed.md
@@ -0,0 +1,254 @@
+---
+layout: post
+title: "Demucs is now available on this site"
+category: announcements
+tags: [music-demixer, demucs, news]
+header_class: post
+description: "Freemusicdemixer adds the world-leading AI model Demucs, powered by the same transformer architecture as ChatGPT"
+keywords: music demixing, song splitting, song splitter, stem separation, demucs, transformer
+intro: "Demucs v4 hybrid transformer is a very powerful AI model for music demixing, and it's on this website 😎
+"
+---
+
+<h2>Table of contents</h2>
+* Table of contents
+{:toc}
+
+{{ page.intro }}
+
+<img src="/assets/blog/post4/mdxdmc4.webp" alt="demucs-available" width="75%"/>
+
+## Intro
+
+The [Demucs v4 Hybrid Transformer](https://github.com/facebookresearch/demucs) model has world-beating performance. The [IEEE paper](https://research.facebook.com/publications/hybrid-transformers-for-music-source-separation/) published by Facebook Research describes the model. In short, the Transformer architecture, originating from the famous [Attention Is All You Need](https://arxiv.org/abs/1706.03762) paper, is at the heart of a lot of powerful AI  models, [including ChatGPT](https://www.thoughtspot.com/data-trends/ai/what-is-transformer-architecture-chatgpt).
+
+In various demixing competitions, [such as the Cadenza Challenge](https://cadenzachallenge.org/cadenza2023-workshop/programme.html), Open-Unmix and Demucs are featured side-by-side as the **academic, open-source baselines** due to how they are developed openly and push forward the state of the art. For that reason, I'm happy to include both as the models of this site.
+
+## Demixing running times and scores
+
+The base model available on this site is [Open-Unmix](https://github.com/sigsep/open-unmix-pytorch), with the updated [UMXL](https://github.com/sigsep/open-unmix-pytorch#%EF%B8%8F-news) weights. For a song with a length of ~4 minutes ('Zeno - Signs') from the MUSDB18-HQ test set, here are the separation scores of Demucs compared to Open-Unmix.
+
+Demucs v4 (on this site) takes **20 minutes** (on my workstation) to achieve these scores:
+```
+vocals ==> SDR:   8.326  SIR:  18.257  ISR:  15.927  SAR:   8.311
+drums  ==> SDR:  10.041  SIR:  18.413  ISR:  17.054  SAR:  10.692
+bass   ==> SDR:   3.893  SIR:  12.221  ISR:   7.076  SAR:   3.237
+melody ==> SDR:   7.432  SIR:  11.422  ISR:  14.161  SAR:   8.201
+```
+Open-Unmix UMXL (on this site) takes **7 minutes** to achieve these scores:
+```
+vocals ==> SDR:   6.830  SIR:  16.421  ISR:  14.044  SAR:   7.104
+drums  ==> SDR:   7.425  SIR:  14.570  ISR:  12.062  SAR:   8.905
+bass   ==> SDR:   2.462  SIR:   4.859  ISR:   5.346  SAR:   3.566
+melody ==> SDR:   6.197  SIR:   9.437  ISR:  12.519  SAR:   7.627
+```
+
+The largest track from the MUSDB18-HQ test set, 'Georgia Wonder - Siren', around ~7 minutes long, ensures that this site can demix users' large tracks without crashing (by consuming more than the limit of 4 GB of memory).
+
+Demucs v4 takes **40 minutes** to achieve these scores:
+```
+vocals ==> SDR:   7.261  SIR:  13.550  ISR:  13.158  SAR:   6.763
+drums  ==> SDR:  10.629  SIR:  17.819  ISR:  17.373  SAR:  10.829
+bass   ==> SDR:  10.593  SIR:  19.696  ISR:  12.244  SAR:  10.007
+meldoy ==> SDR:   6.324  SIR:   9.005  ISR:  13.223  SAR:   6.067
+```
+Open-Unmix UMXL takes **12 minutes** to achieve these scores:
+```
+vocals ==> SDR:   5.858  SIR:  10.880  ISR:  14.336  SAR:   6.187
+drums  ==> SDR:   7.654  SIR:  14.933  ISR:  11.459  SAR:   8.466
+bass   ==> SDR:   7.256  SIR:  12.007  ISR:  10.743  SAR:   6.757
+melody ==> SDR:   4.699  SIR:   7.452  ISR:   9.142  SAR:   4.298
+```
+
+## Open-Unmix vs. Demucs: stats
+
+This table shows a comparison of the two and also differences in how they are implemented on this site (or, from their characteristics).
+
+| | Open-Unmix | Demucs v4 |
+|:-|:-|:|
+| Overall SDR (signal-to-distortion ratio) | 5.3 | 9.0 |
+| Architecture | Linear encoder/LSTM/linear decoder | Convolution encoder/Transformer/convolution decoder |
+| Input/output | Magnitude spectrogram (STFT) | Time domain (waveform) + complex spectrogram (STFT) |
+| Lines of C++ code | 2364 | 4549 |
+| Model weight size | 45 MB (quantized + compressed with low impact) | 81 MB (no quantization or compression) |
+
+An interesting consequence of the LSTM of Open-Unmix vs. the Transformer of Demucs is that <span class="blog-highlight">Demucs is parallelizeable in a way that Open-Unmix isn't.</span> This means that future releases of this site can implement a parallel, multi-threaded version of Demucs, since separating an isolated subsection of the track **has no bearing on demixing other subsections of the track.**
+
+The LSTM (long-short-term memory) of Open-Unmix requires that the song is fed forward and backward through the network for full demixing quality. To that end, I did implement a [streaming LSTM](https://github.com/sevagh/free-music-demixer/commit/60ad98b9dd0d6b26c6287e2fa99cf6e18bec6f88#diff-58a2429a6b28bcb2571090d39842387d9facad944e85f80fc7c19d248673dd07), such that the entire track did not need to be stored in memory and crash by consuming > 4 GB. However, it still requires the song to be fed forward and backward in order.
+
+## Debugging during the development process
+
+Writing the code for [demucs.cpp](https://github.com/sevagh/demucs.cpp) was more challenging than [umx.cpp](https://github.com/sevagh/umx.cpp), given the nature of Demucs and its more complex architecture.
+
+Over the course of the development, the most important tool in my toolbox was to print **detailed stats of the tensors at every step of the network:**
+```
+Debugging tensor!: x_enc_1
+        shape: (1, 96, 128, 336)
+        min: -0.28594133257865906
+        max: 0.2303868532180786
+        mean: -2.4751065211603418e-05
+        stddev: 0.05556473508477211
+        sum: -102.19140625
+        min idx: (0, 55, 0, 1)
+        max idx: (0, 9, 0, 1)
+FINISHED DEBUG FOR TENSOR x_enc_1
+```
+
+These let me eyeball the values passing through the network at each step and figure out where to track down errors during the implementation phase.
+
+## GEMM everywhere (especially in the convolutions)
+
+Generalized Matrix Multiply (GEMM) is, [as stated by this NVIDIA post](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html):
+>GEMMs (General Matrix Multiplications) are a fundamental building block for many operations in neural networks, for example fully-connected layers, recurrent layers such as RNNs, LSTMs or GRUs, and convolutional layers.
+
+In short, the best way of ensuring a neural network runs fast is by representing every operation as a matrix multiplication.
+
+The original code for my convolution function used a `for-loop` approach:
+```
+template<int in_channels, int out_channels, int kernel_height, int kernel_width, int stride_height, int stride_width, int pad_height, int pad_width, int dilation_height, int dilation_width>
+Eigen::Tensor3dXf conv2d(const Eigen::Tensor3dXf &x, const Eigen::Tensor4dXf &w, const Eigen::Tensor1dXf &b)
+{
+    int in_height = x.dimension(1);
+    int in_width = x.dimension(2);
+
+    int out_height = static_cast<int>(std::floor(in_height + 2 * pad_height -
+                                                 kernel_height) /
+                                      stride_height) +
+                     1;
+    int out_width =
+        static_cast<int>(std::floor(in_width + 2 * pad_width - kernel_width) /
+                         stride_width) +
+        1;
+
+    Eigen::Tensor3dXf y_out(out_channels, out_height, out_width);
+
+    // Initialize y_out to b
+    for (int chout = 0; chout < out_channels; ++chout)
+    {
+        y_out.chip<0>(chout).setConstant(b(chout));
+    }
+
+    // 2d convolution loop
+    for (int n = 0; n < kernel_width; ++n)
+    {
+        for (int m = 0; m < kernel_height; ++m)
+        {
+            for (int chin = 0; chin < in_channels; ++chin)
+            {
+                for (int j = 0; j < out_width; ++j)
+                {
+                    for (int i = 0; i < out_height; ++i)
+                    {
+                        for (int chout = 0; chout < out_channels; ++chout)
+                        {
+                            int ih = i * stride_height + m * dilation_height -
+                                     pad_height;
+                            int jw = j * stride_width + n * dilation_width -
+                                     pad_width;
+                            if (ih >= 0 && ih < in_height && jw >= 0 &&
+                                jw < in_width)
+                            {
+                                y_out(chout, i, j) += x(chin, ih, jw) * w(chout, chin, m, n);
+                            }
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    return y_out;
+}
+```
+
+Notice the template parameters: what this means is since the different convolution calls in Demucs use repetitive parameters, we define templated parameters such that the compiler can generate optimized versions of the code. This got some of my runtimes for convolution operations down from 5 seconds to 2 seconds during my benchmarking.
+
+Eventually, I had to figure out the [GEMM implementation of convolution](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#imp-gemm-dim), and that was crucial in getting Demucs running in under 1 hour.
+
+The first function is [`im2col`](https://www.mathworks.com/help/images/ref/im2col.html), which spreads the convolution pixels of interest into a matrix to be multiplied by the weights:
+```
+template<int kernel_height, int kernel_width, int stride_height, int stride_width, int pad_height, int pad_width, int dilation_height, int dilation_width>
+inline Eigen::MatrixXf im2col(const Eigen::Tensor3dXf& input) {
+    // Adjust the calculation of height_col and width_col for dilation
+    int in_channels = input.dimension(0);
+    int height_col = (input.dimension(1) + 2 * pad_height - dilation_height * (kernel_height - 1) - 1) / stride_height + 1;
+    int width_col = (input.dimension(2) + 2 * pad_width - dilation_width * (kernel_width - 1) - 1) / stride_width + 1;
+
+    int in_height = input.dimension(1);
+    int in_width = input.dimension(2);
+
+    Eigen::MatrixXf output(height_col * width_col, in_channels * kernel_height * kernel_width);
+    output.setZero();
+
+    for (int c = 0; c < in_channels; c++) {
+        for (int kh = 0; kh < kernel_height; kh++) {
+            for (int kw = 0; kw < kernel_width; kw++) {
+                for (int h = 0; h < height_col; h++) {
+                    for (int w = 0; w < width_col; w++) {
+                        int h_pad = h * stride_height + kh * dilation_height - pad_height;
+                        int w_pad = w * stride_width + kw * dilation_width - pad_width;
+                        if (h_pad >= 0 && h_pad < in_height && w_pad >= 0 && w_pad < in_width) {
+                            output(h * width_col + w, c * kernel_height * kernel_width + kh * kernel_width + kw) = input(c, h_pad, w_pad);
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    return output;
+}
+```
+
+After im2col, the actual convolution multiplication becomes an easy one-liner:
+```
+template<int in_channels, int out_channels, int kernel_height, int kernel_width, int stride_height, int stride_width, int pad_height, int pad_width, int dilation_height, int dilation_width>
+Eigen::Tensor3dXf conv2d_gemm(const Eigen::Tensor3dXf &x, const Eigen::Tensor4dXf &w, const Eigen::Tensor1dXf &b) {
+    int in_height = x.dimension(1);
+    int in_width = x.dimension(2);
+
+    // Calculate output dimensions
+    int out_height = static_cast<int>(std::floor(in_height + 2 * pad_height - kernel_height) / stride_height) + 1;
+    int out_width = static_cast<int>(std::floor(in_width + 2 * pad_width - kernel_width) / stride_width) + 1;
+
+    // Apply im2col
+    Eigen::MatrixXf im2col_matrix = im2col<kernel_height, kernel_width, stride_height, stride_width, pad_height, pad_width, dilation_height, dilation_width>(x);
+
+    // Reshape weights
+    // reverse last 3 axes (out chanel x in chan x kernel height x kernel width -> out chan x (kernel width x kernel height x in chan))
+    Eigen::Tensor4dXf w_swapped = w.shuffle(Eigen::array<int, 4>({0, 3, 2, 1}));
+    // then flatten to the last axis
+    Eigen::Tensor2dXf reshaped_weights_tensor = w_swapped.reshape(Eigen::array<int, 2>{out_channels, in_channels * kernel_width * kernel_height});
+    Eigen::MatrixXf reshaped_weights = Eigen::Map<Eigen::MatrixXf>(reshaped_weights_tensor.data(), reshaped_weights_tensor.dimension(0), reshaped_weights_tensor.dimension(1));
+
+    // Perform matrix multiplication with GEMM
+    Eigen::MatrixXf result = im2col_matrix * reshaped_weights.transpose();
+
+    // Add bias to each column of the result matrix
+    for (int chout = 0; chout < out_channels; ++chout) {
+        result.col(chout).array() += b(chout);
+    }
+
+    // Reshape result to 3D output tensor
+    Eigen::Tensor3dXf y_out(out_channels, out_height, out_width);
+    y_out.setZero();
+
+    for (int chout = 0; chout < out_channels; ++chout) {
+        for (int h = 0; h < out_height; ++h) {
+            for (int w = 0; w < out_width; ++w) {
+                int row_idx = h * out_width + w;
+                // Assign the value from the GEMM output to the output tensor
+                y_out(chout, h, w) = result(row_idx, chout);
+            }
+        }
+    }
+
+    return y_out;
+}
+```
+
+<span class="blog-highlight">An expected tradeoff of representing operations with GEMM is overusing memory - but we could afford it, so it was worth it.</span>
+
+One of the last profiling runs of demucs.cpp shows how so much time is spent in GEMM, which is a good thing: I'm basically crunching as much numbers as possible on the CPU, and not wasting time doing other operations:
+
+<img src="/assets/blog/post4/kcachegrind.webp" alt="kcachegrind" width="50%"/>
diff --git a/docs/assets/blog/post4/kcachegrind.webp b/docs/assets/blog/post4/kcachegrind.webp
diff --git a/docs/assets/blog/post4/mdxdmc4.webp b/docs/assets/blog/post4/mdxdmc4.webp