Skip to content

info.md

Hou-Ning Hu edited this page Mar 26, 2017 · 2 revisions

Conclusion

Because librosa use FFMPEG backend while torch audio use SoX backend, there is a small difference in reading files with a start offset.

So, convert all mp3 with following command will do the trick.

sox input.mp3 output.mp3 trim 0

Since FFMPEG seems counting start in, this will make librosa not able to tell the start point we want, and a small shift caused a large difference in amplitude. Convert them will be a temporal cure, but if we really look into the fine detail we will find that there is still a gap of values between the librosa and lua audio version.

My attempts

Original mp3 file

Input #0, mp3, from 'fireworks_shift_1274_6.mp3':
  Metadata:
    encoder         : Lavf56.4.101
  Duration: 00:00:04.08, start: 0.050113, bitrate: 64 kb/s
    Stream #0:0: Audio: mp3, 22050 Hz, stereo, s16p, 64 kb/s

S**tty result.

rosa.max=0.999969482422, rosa.min=-1.0, lua.max=1.0, lua.min=-1.0
Path data/fireworks_shift_1274_6.mp3: rosa.shape=(89856,), lua.shape=(89854,)
Librosa:
[ 0.          0.          0.         ...,  0.19934082  0.22732544
  0.21987915]
Torch:
[ 0.          0.          0.         ...,  0.04422661  0.04547431
  0.16926192]
Round to 4 decimals
Total Diff: 1865.41
Avg Diff: 0.0207604
Max Diff: 0.3136
Min Diff: -0.3041

Enhance bitrate to 192k

Input #0, mp3, from 'fire.mp3':
  Metadata:
    encoder         : Lavf57.40.100
  Duration: 00:00:04.10, start: 0.050113, bitrate: 161 kb/s
    Stream #0:0: Audio: mp3, 22050 Hz, stereo, s16p, 160 kb/s

Seems helped.

rosa.max=0.999969482422, rosa.min=-1.0, lua.max=0.956068873405, lua.min=-1.0
Path data/fire.mp3: rosa.shape=(90432,), lua.shape=(90432,)
Librosa:
[ 0.          0.          0.         ..., -0.00106812 -0.0007019
  0.00125122]
Torch:
[ 0.          0.          0.         ..., -0.00245564 -0.00034991
  0.00209353]
Round to 4 decimals
Total Diff: 1780.92
Avg Diff: 0.0196935
Max Diff: 0.2994
Min Diff: -0.2904

Convert to mono mode

Input #0, mp3, from 'fire2.mp3':
  Metadata:
    encoder         : Lavf57.40.100
  Duration: 00:00:04.10, start: 0.050113, bitrate: 64 kb/s
    Stream #0:0: Audio: mp3, 22050 Hz, mono, s16p, 64 kb/s

Oh, mono mode is closer.

rosa.max=0.960571289062, rosa.min=-1.0, lua.max=0.943222105503, lua.min=-0.981544613838
Path data/fire2.mp3: rosa.shape=(90432,), lua.shape=(90432,)
Librosa:
[ 0.          0.          0.         ...,  0.00140381  0.00143433
  0.0027771 ]
Torch:
[ 0.          0.          0.         ...,  0.00152187  0.00204326
  0.00288566]
Round to 4 decimals
Total Diff: 1671.49
Avg Diff: 0.0184834
Max Diff: 0.2857
Min Diff: -0.2775

Double the sample rate, and mono mode

Input #0, mp3, from 'fire3.mp3':
  Metadata:
    encoder         : Lavf57.40.100
  Duration: 00:00:04.08, start: 0.025057, bitrate: 64 kb/s
    Stream #0:0: Audio: mp3, 44100 Hz, mono, s16p, 64 kb/s

Sample rate is not helping.

rosa.max=0.962280273438, rosa.min=-0.998626708984, lua.max=0.975194513798, lua.min=-0.960788428783
Path data/fire3.mp3: rosa.shape=(179712,), lua.shape=(179712,)
Librosa:
[  0.00000000e+00   0.00000000e+00   0.00000000e+00 ...,   0.00000000e+00
   0.00000000e+00   3.05175781e-05]
Torch:
[  0.00000000e+00   0.00000000e+00   0.00000000e+00 ...,  -9.49576497e-06
   7.80075788e-06   1.77286565e-05]
Round to 4 decimals
Total Diff: 2531.38
Avg Diff: 0.0140858
Max Diff: 0.2518
Min Diff: -0.2508

Trim off the start from 0.023991 by FFMPEG

Input #0, mp3, from 'fire4.mp3':
  Metadata:
    encoder         : Lavf57.40.100
  Duration: 00:00:04.02, start: 0.023991, bitrate: 64 kb/s
    Stream #0:0: Audio: mp3, 22050 Hz, stereo, s16p, 64 kb/s

What is wrong with ffmpeg?

rosa.max=0.999969482422, rosa.min=-1.0, lua.max=1.0, lua.min=-1.0
Path data/fire4.mp3: rosa.shape=(88704,), lua.shape=(88704,)
Librosa:
[ 0.          0.          0.         ...,  0.19934082  0.22732544
  0.21987915]
Torch:
[ 0.          0.          0.         ...,  0.16926192  0.21725897
  0.22740155]
Round to 4 decimals
Total Diff: 1865.31
Avg Diff: 0.0210285
Max Diff: 0.3136
Min Diff: -0.3041

Convert by Sox to start from 0.000000

Input #0, mp3, from 'fire5.mp3':
  Duration: 00:00:04.13, start: 0.000000, bitrate: 64 kb/s
    Stream #0:0: Audio: mp3, 22050 Hz, stereo, s16p, 64 kb/s

OH MY! I guess this is it!

rosa.max=0.999969482422, rosa.min=-1.0, lua.max=1.0, lua.min=-1.0
Path data/fire5.mp3: rosa.shape=(91008,), lua.shape=(90432,)
Librosa:
[ 0.          0.          0.         ...,  0.00692749 -0.00219727
 -0.00576782]
Torch:
[ 0.          0.          0.         ..., -0.25617316 -0.2670404
 -0.26867551]
Round to 4 decimals
Total Diff: 0.901599
Avg Diff: 9.96991e-06
Max Diff: 0.000100017
Min Diff: -0.000100017