This project is a proof of the concept implementation of a Convolutioinal Neural Network (CNN
) based implementation of Audio Event Recognition (AER
) in KERAS. Keep in mind that it is the first version by the author. As it is, it is not the best model available for this prupose. The work on the model is under progress and any refinements will be updated on the repository.
Python 3.6 is used during development and following libraries are required to run the code provided in the notebook:
keras 2.x.
numpy
librosa
The ESC-50 dataset is a public labeled set of 2000 environmental recordings (50 classes, 40 clips per class, approximately 5 seconds per clip) suitable for environmental sound classification tasks.
See ESC: Dataset for Environmental Sound Classification - paper replication data for the full paper with a more thorough analysis.
The available sound classes arranged alphabatically are given below:
- 1 - Airplane
- 2 - Breathing
- 3 - Brushing teeth
- 4 - Can opening
- 5 - Cat
- 6 - Car horn
- 7 - Chainsaw
- 8 - Chirping birds
- 9 - Church bells
- 10 - Clapping
- 11 - Clock alarm
- 12 - Clock tick
- 13 - Coughing
- 14 - Cow
- 15 - Crackling fire
- 16 - Crickets
- 17 - Crow
- 18 - Crying baby
- 19 - Dog
- 20 - Door knock
- 21 - Door - wood creaks
- 22 - Drinking - sipping
- 23 - Engine
- 24 - Fireworks
- 25 - Footsteps
- 26 - Frog
- 27 - Glass breaking
- 28 - Hand saw
- 29 - Helicopter
- 30 - Hen
- 31 - Insects (flying)
- 32 - Keyboard typing
- 33 - Laughing
- 34 - Mouse click
- 35 - Pig
- 36 - Pouring water
- 37 - Rain
- 38 - Rooster
- 39 - Sea waves
- 40 - Sheep
- 41 - Siren
- 42 - Sneezing
- 43 - Snoring
- 44 - Thunderstorm
- 45 - Toilet flush
- 46 - Train
- 47 - Vacuum cleaner
- 48 - Washing machine
- 49 - Water drops
- 50 - Wind.
First of all we renamed all the files in the classes to be numbers from 1
to 40
. Then all the files were read and we calculated the dBscale Mel Spectrogram
with n_mels = 128
. All the rest of the elements are left to be default in librosa.features.melspectrogram
. All the files are of different length so in order to make sure that the preprocessed data has equal size for all the files we selected on 300
frames.
We trained our model on all 50
classes. The total data is shuffeled in order to mix the classes and loose patterns. Then the data is divided into 2
subsets. 80%
for training and 20%
for testing. The training data is then further divided into 2
subsets with randomly selecting approaximately 80%
data for trainig
and rest of the data for validation
. So at the end we have 400
instances for testing (8 files per class)
, approximately 1280
instance for training and 320
instance for validation.
We tested the model for all the classes and got the overall average accuracy of 52%
. We found out that our model performs differently on different classes. We categorized the classes into 3 groups, classes with Very Good Performance
(with performance equal or above 75%
), Medium Performance
(with performance value between 60%
and 75%
) and Bad Performance
(with performance equal or less than 50%
). The model performs for each class are reported below.
- Siren : 100.0 %
- DoorKnock : 100.0 %
- Clapping : 100.0 %
- Helicopter : 100.0 %
- Rain : 87.5 %
- Rooster : 87.5 %
- ClockAlarm : 87.5 %
- CanOpening : 87.5 %
- PouringWater : 87.5 %
- HandSaw : 87.5 %
On these classes the accuracy of the model is 92.5%
on average.
- VacuumCleaner : 75.0 %
- Dog : 75.0 %
- Train : 62.5 %
- CarHorn : 62.5 %
- Crow : 62.5 %
- Engine : 62.5 %
- BrushingTeeth : 62.5 %
- Frog : 62.5 %
- Cow : 62.5 %
- KeyboardTyping : 62.5 %
- Insects : 62.5 %
- SeaWaves : 62.5 %
- ChurchBells : 62.5 %
- Sheep : 62.5 %
Average performance for these classes is 64.29%
.
- Crickets : 50.0 %
- GlassBreaking : 50.0 %
- Coughing : 50.0 %
- Pig : 50.0 %
- Thunderstorm : 50.0 %
- CracklingFire : 50.0 %
- ToiletFlush : 50.0 %
- WaterDrops : 37.5 %
- Crying baby : 37.5 %
- Fireworks : 37.5 %
- Hen : 37.5 %
- Cat : 37.5 %
- DrinkingSipping : 37.5 %
- Laughing : 25.0 %
- Chainsaw : 25.0 %
- Breathing : 25.0 %
- Sneezing : 25.0 %
- WashingMachine : 25.0 %
- Snoring : 12.5 %
- ClockTick : 12.5 %
- DoorWoodCreaks : 12.5 %
- ChirpingBirds : 12.5 %
- MouseClick : 12.5 %
- Footsteps : 12.5 %
- Wind : 0.0 %
- Airplane : 0.0 %
Average performance for these classes is 29.80%
.
Follow following steps to use this code.
- Download the dataset and unzip into the
Samples
directory. - Keep only the
50
subdirectories for different events and delete all other files in theSamples
. - Run
rename.py
to rename the files in the sub directories to name them1 to 40.wav
. - Run
preprocess_data.py
to preprocess the data, this will generate the files and directories in thePreproc
sub directory. - Finally run the
train_network.py
. This will load the preprocessed data fromPreProc
directory and will createX_test
,Y_test
,X_validation
,Y_validation
,X_test
andY_test
variables for the training. Then will train the network and save theX_test
variable along withY_test
, pre-trained modelmodel.h5
and class labelsClass_names.npy
. evaluate_network.py
evaluates the pretrained network and prints the performance for each class.