CQT2010 problematic output #85

KinWaiCheuk · 2021-03-01T02:45:27Z

It seems I messed up something when updating nnAudio from 0.1.15 to 0.2.0.
The output for CQT2010 is very different from CQT2010v2. I suspect something is wrong during downsampling. But I don't have time to debug at the moment, will post this as an issue to reminder me later. Or if anyone knows the solution to this problem, a pull request is welcome.

The following code produces the above-mentioned issue. The code below is using nnAudio 0.2.2

import torch
import torch.nn as nn
from torch.nn.functional import conv1d, conv2d

import numpy as np
import torch
from time import time
import math
from scipy.signal import get_window
from scipy import signal
from scipy import fft
import warnings
from torch.nn.functional import fold, unfold
import nnAudio.Spectrogram as Spectrogram_old
from scipy.signal import chirp, sweep_poly
from nnAudio import Spectrogram

# Linear sweep case
fs = 44100
t = 1
f0 = 55
f1 = 22050
s = np.linspace(0, t, fs*t)
x = chirp(s, f0, 1, f1, method='linear')
x = x.astype(dtype=np.float32)
device='cpu'

n_bins = 100
bins_per_octave=12
window = 'hann'
filter_scale = 2
# window='hann'
normalization_type = 'wrap'

# Complex
stft2 = Spectrogram.CQT2010v2(sr=fs, fmin=f0, filter_scale=filter_scale,
                 n_bins=n_bins, bins_per_octave=bins_per_octave, window=window)
X2 = stft2(torch.tensor(x, device=device).unsqueeze(0), normalization_type=normalization_type)
X2 = torch.log(X2 + 1e-2)

#     np.save("tests/ground-truths/linear-sweep-cqt-2010-mag-ground-truth", X.cpu()) 


X3 = librosa.cqt(x, sr=fs, fmin=f0, filter_scale=filter_scale,
                 n_bins=n_bins, bins_per_octave=bins_per_octave, window=window)
X3 = np.log(abs(X3) + 1e-2)

stft1 = Spectrogram.CQT2010(sr=fs, fmin=f0, filter_scale=filter_scale,
                 n_bins=n_bins, bins_per_octave=bins_per_octave, window=window, pad_mode='constant')
X1 = stft1(torch.tensor(x, device=device).unsqueeze(0), normalization_type=normalization_type)
X1 = torch.log(X1 + 1e-2)

fig, axes = plt.subplots(1, 2, figsize=(12, 4), dpi=200)
axes[0].imshow(X1[0,:,:], aspect='auto', origin='lower')
axes[0].set_title('CQT2010')
axes[1].imshow(X2[0,:,:], aspect='auto', origin='lower')
axes[1].set_title('CQT2010v2')
# axes[1,0].imshow(X3[:,:], aspect='auto', origin='lower')

The text was updated successfully, but these errors were encountered:

migperfer · 2021-04-21T08:17:22Z

Hi! I would like to contribute. Do you think the problem is in CQT2010 or CQT2010v2? I was thinking to add a valid option for the padding (so no padding, instead of being forced to add padding either with constant of reflect) for those functions, so I guess it would be two birds in a row.

KinWaiCheuk · 2021-04-21T08:29:54Z

@migperfer Thanks for your interest in contributing! I think the problem is from the downsampling functions inside CQT2010. CQT2010v2 is correct and it does not use any downsampling. Therefore, it is very likely that I messed up something in the downsampling function for the CQT2010 class during my previous code refactoring. It would be a great help if you can help me to find and solve the problem!

As for padding, there is a center argument at line 1359 of the Spectrogram.py. When center=False, there will be no padding. So I guess your valid option would be same as center=False, if I am not wrong?

migperfer · 2021-04-21T08:56:51Z

Aaahh I see. Thanks! But there is no center argument for both CQT2010, is that right? I guess is something related to the phase.

Anyway, I think we can add more details to the documentation, so to specify that when center=False no padding is being done. Although I think this should be somehow independent, you may not want to center your window but still want to pad your audio signal.

KinWaiCheuk · 2021-04-21T10:58:37Z

Aaahh I see. Thanks! But there is no center argument for both CQT2010, is that right? I guess is something related to the phase.

Yes, there is no center argument for CQT2010. The reason is that CQT2010v2 is better than CQT2010 in terms of output quality and speed, so I didn't continue developing CQT2010. But I guess in some cases, people might still want to use CQT2010 and it is still worth to make CQT2010 works again.

I am still thinking if there is a case where center=False and padding could happen at the same time. Correct me if I am wrong, it seems center=False already implies no padding and vice versa.

Consider the following case, where we have an audio clip with 5 time steps [x1, x2, x3, x4, x5]. And consider n_fft=3 (i.e. 1D convolutional kernel of size 3). If center=False, then the kernel will convolute with [x1,x2,x3], [x2,x3,x4], [x3,x4,x5]. Padding does not make sense in this case.

When center=True, in order to make x1 and x5 to be at the center of the kernel, we need to pad our sequence into [0, x1, x2, x3, x4, x5, 0]. In this way, the kernel will convolute with [0, x1, x2], [x1, x2, x3], [x2, x3, x4], [x3, x4, x5], [x4, x5, 0]. Padding is needed in this case.

So, it seems the center argument already implies padding or not. Again, correct me if I am wrong.

migperfer · 2021-04-21T11:08:02Z

Yes, sorry I wasn't thinking clearly 😅

KinWaiCheuk · 2021-04-21T11:19:47Z

Yes, sorry I wasn't thinking clearly 😅

But anyhow, I would appreciate that if you are willing to fix the downsampling issue in CQT2010.

migperfer · 2021-05-02T10:35:49Z

Hi! I think it's solved now. I have a few questions though:

What scipy version are you using? Mine is 1.6.2 and I had to change this line to from scipy.fft import fft because the original one wasn't working for me.
I realised that in some lines the dtype is forced to float32. Sometimes I use float16 because it takes less space in memory, and some GPUs are optimized for 16-bits precision. Would you consider adding an argument or something similar for this? I'm not sure if PyTorch can handle this by itself, to be honest, but whenever I tell PyTorchLighning to use 16-bit precision while using CQTs with forced float32 I get an error related to dtypes.

KinWaiCheuk · 2021-05-02T11:06:06Z

Thanks! I am reviewing your pull request and it seems I simply messed up the sin and cos kernels.

I am using scipy version 1.5.4. I have tested from scipy import fft, and it also works. Then I think we can just change it to from scipy import fft.
Oh, I didn't realized that people would also use float16. Then yes, I think it would be great if we could allow people to choose whether they want to use float16 or float32.

I will be running the unit test and if everything is okay, I will accept your pull request #94. For the two points above, maybe you can make another pull request later?

migperfer · 2021-05-02T12:35:28Z

Sure! Number 1 is already done, let me experiment with how can we do number 2 without breaking things that are working now.

migperfer · 2021-05-02T15:03:03Z

Ok. I did point 2 but I'm couldn't try it for float16, because PyTorch's 2D float16 convolutions don't work on CPU, and I don't have a GPU on my personal laptop. In case you have a GPU, would you mind if I do the pull request and you try it on yours? It is working for float64, but just to double-check.

The code I used for float16/half-precision:

x = chirp(s, f0, 1, f1, method='linear')
x = x.astype(dtype=np.float16)
device='cpu'

n_bins = 100
bins_per_octave=12
window = 'hann'
filter_scale = 2
# window='hann'
normalization_type = 'wrap'

# Complex
stft2 = Spectrogram.CQT2010v2(sr=fs, fmin=f0, filter_scale=filter_scale,
                 n_bins=n_bins, bins_per_octave=bins_per_octave, window=window, pad_mode='constant').half()
X2 = stft2(torch.tensor(x, device=device).unsqueeze(0), normalization_type=normalization_type)
X2 = torch.log(X2 + 1e-2)

The code I used for float64/double-precision:

x = chirp(s, f0, 1, f1, method='linear')
x = x.astype(dtype=np.float64)
device='cpu'

n_bins = 100
bins_per_octave=12
window = 'hann'
filter_scale = 2
# window='hann'
normalization_type = 'wrap'

# Complex
stft2 = Spectrogram.CQT2010v2(sr=fs, fmin=f0, filter_scale=filter_scale,
                 n_bins=n_bins, bins_per_octave=bins_per_octave, window=window, pad_mode='constant').double()
X2 = stft2(torch.tensor(x, device=device).unsqueeze(0), normalization_type=normalization_type)
X2 = torch.log(X2 + 1e-2)

KinWaiCheuk · 2021-05-03T03:15:58Z

would you mind if I do the pull request and you try it on yours?

Definitely, a pull request is welcome. I will try it on my GPU later.

KinWaiCheuk · 2021-05-04T08:57:16Z

What scipy version are you using? Mine is 1.6.2 and I had to change this line to from scipy.fft import fft because the original one wasn't working for me.

Can you check if from scipy.fftpack import fft works in your version of Scipy?
I am thinking about a way for nnAudio to support as many versions of Scipy as possible.

It seems from scipy.fftpack import fft is even better than from scipy.fft import fft, since the latter won't work in very old version of scipy, say 1.2.0.

I have finished testing your new pull request, everything seems okay so far. The final thing to improve is the scipy import statement.

KinWaiCheuk · 2021-05-04T09:40:51Z

I realised that in some lines the dtype is forced to float32. Sometimes I use float16 because it takes less space in memory, and some GPUs are optimized for 16-bits precision. Would you consider adding an argument or something similar for this? I'm not sure if PyTorch can handle this by itself, to be honest, but whenever I tell PyTorchLighning to use 16-bit precision while using CQTs with forced float32 I get an error related to dtypes.

Can you double check if the PyTorchLighning runs at 16-bit precision with your modified code? I have just double checked that even with my old code, I can simply do Spectrogram.CQT2010v2.half() without any problem.

But after your modification, the CFP and Combined_Frequency_Periodicity classes are broken due to float64 and float32 type mismatch between scipy generated array (float64) and pytorch tensor (float32).

KinWaiCheuk added the bug Something isn't working label Mar 1, 2021

KinWaiCheuk added the help wanted Extra attention is needed label Apr 8, 2021

migperfer mentioned this issue May 2, 2021

Fixed CQT2010 #94

Merged

KinWaiCheuk closed this as completed Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CQT2010 problematic output #85

CQT2010 problematic output #85

KinWaiCheuk commented Mar 1, 2021 •

edited

Loading

migperfer commented Apr 21, 2021

KinWaiCheuk commented Apr 21, 2021

migperfer commented Apr 21, 2021

KinWaiCheuk commented Apr 21, 2021

migperfer commented Apr 21, 2021

KinWaiCheuk commented Apr 21, 2021

migperfer commented May 2, 2021

KinWaiCheuk commented May 2, 2021

migperfer commented May 2, 2021

migperfer commented May 2, 2021

KinWaiCheuk commented May 3, 2021

KinWaiCheuk commented May 4, 2021

KinWaiCheuk commented May 4, 2021 •

edited

Loading

CQT2010 problematic output #85

CQT2010 problematic output #85

Comments

KinWaiCheuk commented Mar 1, 2021 • edited Loading

migperfer commented Apr 21, 2021

KinWaiCheuk commented Apr 21, 2021

migperfer commented Apr 21, 2021

KinWaiCheuk commented Apr 21, 2021

migperfer commented Apr 21, 2021

KinWaiCheuk commented Apr 21, 2021

migperfer commented May 2, 2021

KinWaiCheuk commented May 2, 2021

migperfer commented May 2, 2021

migperfer commented May 2, 2021

KinWaiCheuk commented May 3, 2021

KinWaiCheuk commented May 4, 2021

KinWaiCheuk commented May 4, 2021 • edited Loading

KinWaiCheuk commented Mar 1, 2021 •

edited

Loading

KinWaiCheuk commented May 4, 2021 •

edited

Loading