Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CQT2010 problematic output #85

Closed
KinWaiCheuk opened this issue Mar 1, 2021 · 13 comments
Closed

CQT2010 problematic output #85

KinWaiCheuk opened this issue Mar 1, 2021 · 13 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@KinWaiCheuk
Copy link
Owner

KinWaiCheuk commented Mar 1, 2021

It seems I messed up something when updating nnAudio from 0.1.15 to 0.2.0.
The output for CQT2010 is very different from CQT2010v2. I suspect something is wrong during downsampling. But I don't have time to debug at the moment, will post this as an issue to reminder me later. Or if anyone knows the solution to this problem, a pull request is welcome.

Unknown

The following code produces the above-mentioned issue. The code below is using nnAudio 0.2.2

import torch
import torch.nn as nn
from torch.nn.functional import conv1d, conv2d

import numpy as np
import torch
from time import time
import math
from scipy.signal import get_window
from scipy import signal
from scipy import fft
import warnings
from torch.nn.functional import fold, unfold
import nnAudio.Spectrogram as Spectrogram_old
from scipy.signal import chirp, sweep_poly
from nnAudio import Spectrogram

# Linear sweep case
fs = 44100
t = 1
f0 = 55
f1 = 22050
s = np.linspace(0, t, fs*t)
x = chirp(s, f0, 1, f1, method='linear')
x = x.astype(dtype=np.float32)
device='cpu'

n_bins = 100
bins_per_octave=12
window = 'hann'
filter_scale = 2
# window='hann'
normalization_type = 'wrap'

# Complex
stft2 = Spectrogram.CQT2010v2(sr=fs, fmin=f0, filter_scale=filter_scale,
                 n_bins=n_bins, bins_per_octave=bins_per_octave, window=window)
X2 = stft2(torch.tensor(x, device=device).unsqueeze(0), normalization_type=normalization_type)
X2 = torch.log(X2 + 1e-2)

#     np.save("tests/ground-truths/linear-sweep-cqt-2010-mag-ground-truth", X.cpu()) 


X3 = librosa.cqt(x, sr=fs, fmin=f0, filter_scale=filter_scale,
                 n_bins=n_bins, bins_per_octave=bins_per_octave, window=window)
X3 = np.log(abs(X3) + 1e-2)

stft1 = Spectrogram.CQT2010(sr=fs, fmin=f0, filter_scale=filter_scale,
                 n_bins=n_bins, bins_per_octave=bins_per_octave, window=window, pad_mode='constant')
X1 = stft1(torch.tensor(x, device=device).unsqueeze(0), normalization_type=normalization_type)
X1 = torch.log(X1 + 1e-2)

fig, axes = plt.subplots(1, 2, figsize=(12, 4), dpi=200)
axes[0].imshow(X1[0,:,:], aspect='auto', origin='lower')
axes[0].set_title('CQT2010')
axes[1].imshow(X2[0,:,:], aspect='auto', origin='lower')
axes[1].set_title('CQT2010v2')
# axes[1,0].imshow(X3[:,:], aspect='auto', origin='lower')
@KinWaiCheuk KinWaiCheuk added the bug Something isn't working label Mar 1, 2021
@KinWaiCheuk KinWaiCheuk added the help wanted Extra attention is needed label Apr 8, 2021
@migperfer
Copy link
Contributor

Hi! I would like to contribute. Do you think the problem is in CQT2010 or CQT2010v2? I was thinking to add a valid option for the padding (so no padding, instead of being forced to add padding either with constant of reflect) for those functions, so I guess it would be two birds in a row.

@KinWaiCheuk
Copy link
Owner Author

@migperfer Thanks for your interest in contributing! I think the problem is from the downsampling functions inside CQT2010. CQT2010v2 is correct and it does not use any downsampling. Therefore, it is very likely that I messed up something in the downsampling function for the CQT2010 class during my previous code refactoring. It would be a great help if you can help me to find and solve the problem!

As for padding, there is a center argument at line 1359 of the Spectrogram.py. When center=False, there will be no padding. So I guess your valid option would be same as center=False, if I am not wrong?

@migperfer
Copy link
Contributor

Aaahh I see. Thanks! But there is no center argument for both CQT2010, is that right? I guess is something related to the phase.

Anyway, I think we can add more details to the documentation, so to specify that when center=False no padding is being done. Although I think this should be somehow independent, you may not want to center your window but still want to pad your audio signal.

@KinWaiCheuk
Copy link
Owner Author

Aaahh I see. Thanks! But there is no center argument for both CQT2010, is that right? I guess is something related to the phase.

Yes, there is no center argument for CQT2010. The reason is that CQT2010v2 is better than CQT2010 in terms of output quality and speed, so I didn't continue developing CQT2010. But I guess in some cases, people might still want to use CQT2010 and it is still worth to make CQT2010 works again.

I am still thinking if there is a case where center=False and padding could happen at the same time. Correct me if I am wrong, it seems center=False already implies no padding and vice versa.

Consider the following case, where we have an audio clip with 5 time steps [x1, x2, x3, x4, x5]. And consider n_fft=3 (i.e. 1D convolutional kernel of size 3). If center=False, then the kernel will convolute with [x1,x2,x3], [x2,x3,x4], [x3,x4,x5]. Padding does not make sense in this case.

When center=True, in order to make x1 and x5 to be at the center of the kernel, we need to pad our sequence into [0, x1, x2, x3, x4, x5, 0]. In this way, the kernel will convolute with [0, x1, x2], [x1, x2, x3], [x2, x3, x4], [x3, x4, x5], [x4, x5, 0]. Padding is needed in this case.

So, it seems the center argument already implies padding or not. Again, correct me if I am wrong.

@migperfer
Copy link
Contributor

Yes, sorry I wasn't thinking clearly 😅

@KinWaiCheuk
Copy link
Owner Author

Yes, sorry I wasn't thinking clearly 😅

But anyhow, I would appreciate that if you are willing to fix the downsampling issue in CQT2010.

@migperfer migperfer mentioned this issue May 2, 2021
@migperfer
Copy link
Contributor

Hi! I think it's solved now. I have a few questions though:

  1. What scipy version are you using? Mine is 1.6.2 and I had to change this line to from scipy.fft import fft because the original one wasn't working for me.

  2. I realised that in some lines the dtype is forced to float32. Sometimes I use float16 because it takes less space in memory, and some GPUs are optimized for 16-bits precision. Would you consider adding an argument or something similar for this? I'm not sure if PyTorch can handle this by itself, to be honest, but whenever I tell PyTorchLighning to use 16-bit precision while using CQTs with forced float32 I get an error related to dtypes.

@KinWaiCheuk
Copy link
Owner Author

Thanks! I am reviewing your pull request and it seems I simply messed up the sin and cos kernels.

  1. I am using scipy version 1.5.4. I have tested from scipy import fft, and it also works. Then I think we can just change it to from scipy import fft.

  2. Oh, I didn't realized that people would also use float16. Then yes, I think it would be great if we could allow people to choose whether they want to use float16 or float32.

I will be running the unit test and if everything is okay, I will accept your pull request #94. For the two points above, maybe you can make another pull request later?

@migperfer
Copy link
Contributor

Sure! Number 1 is already done, let me experiment with how can we do number 2 without breaking things that are working now.

@migperfer
Copy link
Contributor

Ok. I did point 2 but I'm couldn't try it for float16, because PyTorch's 2D float16 convolutions don't work on CPU, and I don't have a GPU on my personal laptop. In case you have a GPU, would you mind if I do the pull request and you try it on yours? It is working for float64, but just to double-check.

The code I used for float16/half-precision:

x = chirp(s, f0, 1, f1, method='linear')
x = x.astype(dtype=np.float16)
device='cpu'

n_bins = 100
bins_per_octave=12
window = 'hann'
filter_scale = 2
# window='hann'
normalization_type = 'wrap'

# Complex
stft2 = Spectrogram.CQT2010v2(sr=fs, fmin=f0, filter_scale=filter_scale,
                 n_bins=n_bins, bins_per_octave=bins_per_octave, window=window, pad_mode='constant').half()
X2 = stft2(torch.tensor(x, device=device).unsqueeze(0), normalization_type=normalization_type)
X2 = torch.log(X2 + 1e-2)

The code I used for float64/double-precision:

x = chirp(s, f0, 1, f1, method='linear')
x = x.astype(dtype=np.float64)
device='cpu'

n_bins = 100
bins_per_octave=12
window = 'hann'
filter_scale = 2
# window='hann'
normalization_type = 'wrap'

# Complex
stft2 = Spectrogram.CQT2010v2(sr=fs, fmin=f0, filter_scale=filter_scale,
                 n_bins=n_bins, bins_per_octave=bins_per_octave, window=window, pad_mode='constant').double()
X2 = stft2(torch.tensor(x, device=device).unsqueeze(0), normalization_type=normalization_type)
X2 = torch.log(X2 + 1e-2)

@KinWaiCheuk
Copy link
Owner Author

would you mind if I do the pull request and you try it on yours?

Definitely, a pull request is welcome. I will try it on my GPU later.

@KinWaiCheuk
Copy link
Owner Author

  1. What scipy version are you using? Mine is 1.6.2 and I had to change this line to from scipy.fft import fft because the original one wasn't working for me.

Can you check if from scipy.fftpack import fft works in your version of Scipy?
I am thinking about a way for nnAudio to support as many versions of Scipy as possible.

It seems from scipy.fftpack import fft is even better than from scipy.fft import fft, since the latter won't work in very old version of scipy, say 1.2.0.

I have finished testing your new pull request, everything seems okay so far. The final thing to improve is the scipy import statement.

@KinWaiCheuk
Copy link
Owner Author

KinWaiCheuk commented May 4, 2021

  1. I realised that in some lines the dtype is forced to float32. Sometimes I use float16 because it takes less space in memory, and some GPUs are optimized for 16-bits precision. Would you consider adding an argument or something similar for this? I'm not sure if PyTorch can handle this by itself, to be honest, but whenever I tell PyTorchLighning to use 16-bit precision while using CQTs with forced float32 I get an error related to dtypes.

Can you double check if the PyTorchLighning runs at 16-bit precision with your modified code? I have just double checked that even with my old code, I can simply do Spectrogram.CQT2010v2.half() without any problem.

But after your modification, the CFP and Combined_Frequency_Periodicity classes are broken due to float64 and float32 type mismatch between scipy generated array (float64) and pytorch tensor (float32).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants