Skip to content

ramayer/elephant-rumble-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Elephant Rumble Inference (using AVES/HuBERT based transformer models)

From Central African soundscapes like this ...

Elephant rumbles in noise

... this model isolates the elephant-related sounds like this ...

Results of the trained classifier.

... by efficiently classifying each fraction of a second of 24-hour audio files as "elephant" or "not elephant", and generating a structured "Raven Selection File" for the elephant-related time-intervals.

Challenging because:

  • Such forests have complex background sounds from a wide variety of animals.
  • Forests muffle sounds, and Elephants far away from the recording devices may have sounds barely above noise levels.
  • High quality off-the-shelf audio classifiers tend to do best on human voices, birds, or man-made devices.

Usage

pip install git+https://github.com/ramayer/[email protected]
elephant-rumble-inference test.wav --visualizations=5 --save-raven *.wav

Installation note:

  • TorchAudio has a dependency on ffmpeg versions below 7 according to torchaudio docs.
    • On MacOS, you can get that using brew install ffmpeg@6.
    • On Windows, someone reported luck with conda install -c conda-forge 'ffmpeg<7'.

Example Notebooks:

More detailed usage examples below.

Detailed Design

  • https://arxiv.org/abs/2210.14493
  • AVES, in turn, is based on HuBERT, a self-supervised transformer architecture for modeling raw waveforms (not spectrograms), originally for human speech.
  • AVES-bio was pre-trained on a wide-range of unlabeled biological sounds; and performs well when fine tuned for specific animals (cows, crows, bats, whales, mosquitos, fruit-bats, …), when compared against spectrogram-based models like resnets.

Unlike most other approaches in this challenge, this model directly analyzes the audio waveforms, without every generating spectrogram images or using any image-classification or object-detection models.

Challenge - Identifying which clusters are elephants.

Because AVES an unsupervised model, trained on unannotated audio inputs, it needs to either be fine-tuned or augmented with additional classifiers trained to recognize animals of interest.

  • AVES is great at creating clusters of similar biological sounds
  • But doesn’t know which clusters go with which animals.
  • Visualizing AVES embeddings shows this is a promising approach.

The image above shows a UMAP visualization of AVES embeddings of our data:

  • Red = elephants in our labeled data.
  • Blue = non-elephant sound clips from the same .wav files.
  • (larger, interactive version here)

Observe that there are multiple distinct clusters. Listening to the sounds associated with each cluster suggests:

While it's likely a simpler Support Vector Machine should have been able to separate those clusters; all my attempts failed, perhaps because elephant trumpeting is more like other animals than it is to elephant rumbles, and I'd need a kernel trick to make a SVM put the large variety of elephant sounds into a single class. Instead I gave up on that approach and added a simple two-fully-connected-layer model that performed well when given an appropriate mix of positive and negative training data.

Naïve cosine similarity

Results of the trained classifier.

Challenges.

  • Challenge - AVES didn’t have our frequencies or timescales in mind:

    • AVES seems easily distracted by high pitched animals in the background noise. HuBERT - designed to recognize syllables in speech - generates feature vectors at a 20ms timescale - which would be expensive on our 24-hour-long clips.
  • Solution - resample and pitch-shift through re-tagging the sample rate.

    • Upshifting the audio by 3-4 octaves shifts Elephant Rumbles into human hearing ranges, where most audio software operates best.
    • Speeding the audio turns a ½ second elephant-rumble-syllable to 31ms (~ HuBERT’s timescale)
    • Speeding up the audio by 16x and upshiftign 4 octaves reduces compute by a factor of 16.

And as a side-effect, it shifts elephant speech into human hearing ranges, and they sound awesome!!!

Metrics

The version of the model being tested here was

  • Trained on every single labeled rumble from FruitPunch.AI's
    01. Data/cornell_data/Rumble/Training
    and an equal duration of unlabeled segments from the same audio files assumed to be non-elephant-related.
  • Tested entirely against every labeled rumble from
    01. Data/cornell_data/Rumble/Testing and an equal duration uf unabled sounds from the same files.

Unfortunately these datasets are not publically but you might reach out to The Elephant Listening Project and/or FruitPunch.ai.

In both cases:

  • the things being classified are 1/3-of-a-second slices of the audio.
  • every single 1/3-of-a-second that was labeled in the Raven files was considered a "positive".
  • about the same many 1/3s of a second from the files were used as "negative" labels. Those were picked to be near in time to the "positive" samples to try to capture similar background noise (buffalos, airplanes, bugs).

metrics_visualized

sklearn_metrics

Many of the wrong guesses seem to be related to:

  • Differences in opinions of where a rumble starts or ends.
    If the Raven File thinks a rumble starts ends 2/3 of a second sooner than this classifier, that difference in overlap counts as a false negative and a false positive.
  • Some unusual background noises that I didn't yet add to the manual "negative" classes. If it's "animal-like", AVES may consider it "more elephant-like than not" unless I provide the companion model appropriate negative training samples.

Could I set up some time to have someone code-review my code? I'm hoping I'm doing this evaluation fairly. The notebook with the entire training run that produced this particular model can be found here: https://github.com/ramayer/elephant-rumble-inference/blob/main/notebooks/training_notebook.ipynb (edited)

Performance Notes

This really wants a GPU.

  • It processes 24 hours of audio in 22 seconds on a 2060 GPU
  • It processes 24 hours of audio in 28 minutes on a CPU

so about a 70x speedup on a GPU.

This has not been tested on a GPU with less than 6GB of RAM.

Windows instructions

  • I was only able to make this work using conda
  • On Windows, someone reported luck with conda install -c conda-forge 'ffmpeg<7'.
  • Documentation may assume linux-like paths and examples may need to be adjusted

MacOS Instructions.

  • As per: https://pytorch.org/audio/stable/installation.html "TorchAudio official binary distributions are compatible with FFmpeg version 6, 5 and 4. (>=4.4, <7)." -- so it specifically needs an older version of ffmpeg.
  • I needed to brew install ffmpeg@6 for it to run properly

Detailed usage instructions

for inference

  • TODO - add this.

for training

  • TODO - add this.

Future work

  • Certain audio files not in the Test dataset had curious sounds in the background (bugs? different airplanes?) causing a lot of false postiives. Add such examples to the training data as "not a rumble" and retrain to improve its performance on those files.
  • Despite performing well on the audio recorders in both this project's test and train datasets, the version of the model pretrained performs poorly in environments with different background noises like these South African elephant recordings. It regains its performance by re-training with negative "not an elephant" exampes of background noise from these other regions. Training a version with more diverse not-elephant sounds would produce a more robust version.

Bug fixes and feature requests