You can extract a exact segment of laughter from various talking audio using trained model and code. You can also train your own model.
Code, annotations, and model are described in the following paper: Taisei Omine, Kenta Akita, and Reiji Tsuruno, "Robust Laughter Segmentation with Automatic Diverse Data Synthesis", Interspeech 2024. (To be published in a few months)
git clone https://github.com/omine-me/LaughterSegmentation.git
cd LaughterSegmentation
python -m pip install -r requirements.txt
# ↓ Depends on your environment. See https://pytorch.org/get-started/locally/
python -m pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
Run in Venv environment is recommended. Also, download model.safetensors
from Huggingface (1.26 GB) and place it in models
directory and make sure the name is model.safetensors
.
Tested on Windows 11 with GeForce RTX 2060 SUPER.
- Prepare audio file.
- Open Terminal and go to the directory where
inference.py
is located. - Run
python inference.py --audio_path audio.wav
. You have to change audio.wav to your own audio path. You can use common audio format likemp3
,wav
,opus
, etc. 16kHz wav audio is faster. - If you want to change output directory, use
--output_dir
option. If you want to use your own model, use--model_path
option. - Result will be saved in output directory in json format. To visualize the results, you can use this site (not perfect because it's for debugging).
Added about 1 or 2 weeks later.
Read README in evaluavtion directory.
This repository is MIT-licensed, but the publicly available trained model is currently available for research use only.
Use Issues or reach out my X(Twitter).