Skip to content

Commit

Permalink
Threaded inference (#10)
Browse files Browse the repository at this point in the history
  • Loading branch information
sevagh authored Mar 3, 2024
1 parent 5cf8cb7 commit 6de86de
Show file tree
Hide file tree
Showing 17 changed files with 864 additions and 63 deletions.
2 changes: 1 addition & 1 deletion .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ IndentWidth: 4
BreakBeforeBraces: Allman
AllowShortIfStatementsOnASingleLine: false
IndentCaseLabels: false
ColumnLimit: 80
ColumnLimit: 80
50 changes: 50 additions & 0 deletions .github/SDR_scores.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,3 +59,53 @@ drums ==> SDR: 10.463 SIR: 19.782 ISR: 17.144 SAR: 11.132
bass ==> SDR: 4.584 SIR: 9.359 ISR: 9.068 SAR: 4.885
other ==> SDR: 7.426 SIR: 12.793 ISR: 12.975 SAR: 7.830
```

### Performance of multi-threaded inference

Zeno - Signs, Demucs 4s multi-threaded using the same strategy used in <https://freemusicdemixer.com>.

Optimal performance: `export OMP_NUM_THREADS=4` + 4 threads via cli args for a total of 16 physical cores on my 5950X.

This should be identical in SDR but still worth testing since multi-threaded large waveform segmentation may still impact demixing quality:
```
vocals ==> SDR: 8.317 SIR: 18.089 ISR: 15.887 SAR: 8.391
drums ==> SDR: 9.987 SIR: 18.579 ISR: 16.997 SAR: 10.755
bass ==> SDR: 4.039 SIR: 12.531 ISR: 6.822 SAR: 3.090
other ==> SDR: 7.405 SIR: 11.246 ISR: 14.186 SAR: 8.099
```

Multi-threaded fine-tuned:
```
```

### Time measurements

Regular, big threads = 1, OMP threads = 16:
```
real 10m23.201s
user 29m42.190s
sys 4m17.248s
```

Fine-tuned, big threads = 1, OMP threads = 16: probably 4x the above, since it's just tautologically 4 Demucs models.

Mt, big threads = 4, OMP threads = 4 (4x4 = 16):
```
real 4m9.331s
user 18m59.731s
sys 3m28.465s
```

Ft Mt, big threads = 4, OMP threads = 4 (4x4 = 16):
```
real 16m30.252s
user 74m27.250s
sys 14m40.643s
```

Mt, big threads = 8, OMP threads = 16:
```
real 4m9.304s
user 43m21.830s
sys 10m15.712s
```
16 changes: 13 additions & 3 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ endif()
set(CMAKE_CXX_FLAGS "-Wall -Wextra")
set(CMAKE_CXX_FLAGS_DEBUG "-g -DEIGEN_FAST_MATH=0 -O0")

set(CMAKE_CXX_FLAGS_RELEASE "-Ofast -march=native -fno-unsafe-math-optimizations -fassociative-math -freciprocal-math -fno-signed-zeros")
set(CMAKE_CXX_FLAGS_RELEASE "-Ofast -march=native -fno-unsafe-math-optimizations -freciprocal-math -fno-signed-zeros")

# define a macro NDEBUG for Eigen3 release builds
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -DNDEBUG")
Expand Down Expand Up @@ -91,14 +91,24 @@ add_executable(demucs_ft.cpp.main "cli-apps/demucs_ft.cpp")
target_include_directories(demucs_ft.cpp.main PRIVATE vendor/libnyquist/include)
target_link_libraries(demucs_ft.cpp.main demucs.cpp.lib libnyquist)

file(GLOB SOURCES_TO_LINT "src/*.cpp" "src/*.hpp" "cli-apps/*.cpp")
add_executable(demucs_mt.cpp.main "cli-apps/demucs_mt.cpp")
target_include_directories(demucs_mt.cpp.main PRIVATE vendor/libnyquist/include)
target_include_directories(demucs_mt.cpp.main PRIVATE cli-apps)
target_link_libraries(demucs_mt.cpp.main demucs.cpp.lib libnyquist)

add_executable(demucs_ft_mt.cpp.main "cli-apps/demucs_ft_mt.cpp")
target_include_directories(demucs_ft_mt.cpp.main PRIVATE vendor/libnyquist/include)
target_include_directories(demucs_ft_mt.cpp.main PRIVATE cli-apps)
target_link_libraries(demucs_ft_mt.cpp.main demucs.cpp.lib libnyquist)

file(GLOB SOURCES_TO_LINT "src/*.cpp" "src/*.hpp" "cli-apps/*.cpp" "cli-apps/*.hpp")

# add target to run standard lints and formatters
add_custom_target(lint
COMMAND clang-format -i ${SOURCES_TO_LINT} --style=file
# add clang-tidy command
# add include dirs to clang-tidy
COMMAND cppcheck --enable=all --suppress=missingIncludeSystem ${SOURCES_TO_LINT} --std=c++17
COMMAND cppcheck -I"src/" -I"cli-apps/" --enable=all --suppress=missingIncludeSystem ${SOURCES_TO_LINT} --std=c++17
COMMAND scan-build -o ${CMAKE_BINARY_DIR}/scan-build-report make -C ${CMAKE_BINARY_DIR}
WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
)
Expand Down
42 changes: 39 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,17 @@

C++17 implementation of the [Demucs v4 hybrid transformer](https://github.com/facebookresearch/demucs), a PyTorch neural network for music demixing. Similar project to [umx.cpp](https://github.com/sevagh/umx.cpp). This code powers my site <https://freemusicdemixer.com>.

It uses [libnyquist](https://github.com/ddiakopoulos/libnyquist) to load audio files, the [ggml](https://github.com/ggerganov/ggml) file format to serialize the PyTorch weights of `htdemucs`, `htdemucs_6s`, and `htdemucs_ft` (4-source, 6-source, fine-tuned) to a binary file format, and [Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) (+ OpenMP) to implement the inference.
It uses [libnyquist](https://github.com/ddiakopoulos/libnyquist) to load audio files, the [ggml](https://github.com/ggerganov/ggml) file format to serialize the PyTorch weights of `htdemucs`, `htdemucs_6s`, and `htdemucs_ft` (4-source, 6-source, fine-tuned) to a binary file format, and [Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) (+ OpenMP) to implement the inference. There are also programs for multi-threaded Demucs inference using C++11's `std::thread`.

**All Hybrid-Transformer weights** (4-source, 6-source, fine-tuned) are supported. See the [Convert weights](#convert-weights) section below. Demixing quality is nearly identical to PyTorch as shown in the [SDR scores doc](./.github/SDR_scores.md).

### Directory structure

`src` contains the library for Demucs inference, and `cli-apps` contains two driver programs, which compile to:
`src` contains the library for Demucs inference, and `cli-apps` contains four driver programs, which compile to:
1. `demucs.cpp.main`: run a single model (4s, 6s, or a single fine-tuned model)
2. `demucs_ft.cpp.main`: run all 4 fine-tuned models for `htdemucs_ft` inference, same as the BagOfModels idea of PyTorch Demucs
1. `demucs_ft.cpp.main`: run all four fine-tuned models for `htdemucs_ft` inference, same as the BagOfModels idea of PyTorch Demucs
1. `demucs_mt.cpp.main`: run a single model, multi-threaded
1. `demucs_ft_mt.cpp.main`: run all four fine-tuned models, multi-threaded

### Multi-core, OpenMP, BLAS, etc.

Expand All @@ -21,6 +23,40 @@ If you have OpenMP and OpenBLAS installed, OpenBLAS might automatically use all
See the [BLAS benchmarks doc](./.github/BLAS_benchmarks.md) for more details.

### Multi-threading

There are two new programs, `demucs_mt.cpp.main` and `demucs_ft_mt.cpp.main` that use C++11 [std::threads](https://en.cppreference.com/w/cpp/thread/thread).

In the single-threaded programs:

* User supplies a waveform of length N seconds
* Waveform is split into 7.8-second segments for Demucs inference
* Segments are processed sequentially, where each segment inference can use >1 core with `OMP_NUM_THREADS`

In the multi-threaded programs:
* User supplies a waveform of length N seconds and a `num_threads` argument
* Waveform is split into `num_threads` sub-waveforms (of length M < N) to process in parallel with a 0.75-second overlap
* We always need overlapping segments in audio applications to eliminate [boundary artifacts](https://freemusicdemixer.com/under-the-hood/2024/02/23/Demucs-segmentation#boundary-artifacts-and-the-overlap-add-method)
* `num_threads` threads are launched to perform Demucs inference on the sub-waveforms in parallel
* Within each thread, the sub-waveform is split into 7.8-second segments
* Segments within a thread are still processed sequentially, where each segment inference can use >1 core with `OMP_NUM_THREADS`

For the single-threaded `demucs.cpp.main`, my suggestion is `OMP_NUM_THREADS=$num_physical_cores`. On my 5950X system with 16 cores, execution time for a 4-minute song:
```
real 10m23.201s
user 29m42.190s
sys 4m17.248s
```

For the multi-threaded `demucs_mt.cpp.main`, using 4 `std::thread` and OMP threads = 4 (4x4 = 16 physical cores):
```
real 4m9.331s
user 18m59.731s
sys 3m28.465s
```

More than 2x faster for 4 threads. This is inspired by the parallelism strategy used in <https://freemusicdemixer.com>.

## Instructions

### Build C++ code
Expand Down
5 changes: 4 additions & 1 deletion cli-apps/demucs_ft.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,6 @@ int main(int argc, const char **argv)

// iterate over all files in model_dir
// and load the model
std::string model_file;
for (const auto &entry : std::filesystem::directory_iterator(model_dir))
{
bool ret = false;
Expand Down Expand Up @@ -167,6 +166,10 @@ int main(int argc, const char **argv)
std::cout << "Loading ft model " << entry.path().string()
<< " for vocals" << std::endl;
}
else
{
continue;
}

// debug some members of model
std::cout << "demucs_model_load returned " << (ret ? "true" : "false")
Expand Down
Loading

0 comments on commit 6de86de

Please sign in to comment.