-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference with noisy source #21
Comments
The normalization may have possibly amplified the noises slightly, yet the point of log mel spec is actually the opposite: it tries to emphasize the speech instead of the noise. Probably the arbitrary mean and standard deviation may have some side effect, but if during training you make the model noise-robust, it should have no problems taking noisy input. As mentioned earlier here, if you corrupt your input with Audiomentations, it should have no problems dealing with noisy input. Just make sure you separate your |
@yl4579 Hi. Thank you for your reply. |
@Charlottecuc Sorry for the late reply. I was pretty busy at the end of the year. You can make all |
@yl4579 Thank you for your reply. Just to make sure, when you mean "making the model noise-robust during training", do you mean only corrputing the inputs of the cycle-consistency loss of generator, or, corrupting the inputs of the whole adversial training process (e.g. adding something like "denoising loss" to make the discriminator capable of classifing between clean and noisy inputs and force the generator to produce clean outputs)? Could you give more details? Thank you very much. |
@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version. |
mel_tensor = (torch.log(1e-5 + mel_tensor) - self.mean) / self.std |
@yl4579
|
@skol101 You need to pass in a noisy version here, call it |
I see, because I thought reverb and noise should be added right in StyleEncoder as per #6 (comment) |
@Charlottecuc @yl4579 Are the noisy inputs added only when training the generator, or both the generator and the discriminator? Thank you! |
Style encoder is being called several times in the generator, but the only time it's called with x_real param is in the cycle consistency loss. So I guess that's there x_input should be used (but only 30% of the time). What do you think @Charlottecuc ? |
I'm either doing something wrong or adding reverbs and background noises does nothing. When the source (like VCTK p303_013.wav) has breathing, the converted speech has distortions. Maybe the issue is with the HifiGan vocoder, and I shall try a vocoder more tolerant of breathing/noises. # cycle-consistency loss
s_org = nets.style_encoder(x_input, y_org)
x_rec = nets.generator(x_fake, s_org, masks=None, F0=GAN_F0_fake)
loss_cyc = torch.mean(torch.abs(x_rec - x_real)) In meldataset.py def __getitem__(self, idx):
data = self.data_list[idx]
mel_tensor, label = self._load_data(data)
ref_data = random.choice(self.data_list)
ref_mel_tensor, ref_label = self._load_data(ref_data)
ref2_data = random.choice(self.data_list_per_class[ref_label])
ref2_mel_tensor, _ = self._load_data(ref2_data)
x_input, _ = self._load_data(data, True) #x_input is the same as mel_tensor (aka x_real) but with augmenter corruptions
return mel_tensor, label, ref_mel_tensor, ref2_mel_tensor, ref_label, x_input
def _load_tensor(self, data, corrupt_x_input=False):
wave_path, label = data
label = int(label)
wave, sr = sf.read(wave_path)
if corrupt_x_input and random.uniform(0, 1) <= 0.3:
augmenter = Compose(
[
RoomSimulator(
p=0.8,
leave_length_unchanged=True,
),
AddBackgroundNoise(
sounds_path=BACKGROUND_NOISE_FILES,
min_snr_in_db=20,
max_snr_in_db=35,
p=0.5,
)
]
)
try:
wave = augmenter(samples=wave, sample_rate=sr)
except IndexError as error:
print('error index error with wav file', wave_path)
except ValueError as errorValue:
print('error value error with wav file', wave_path)
wave_tensor = torch.from_numpy(wave).float()
return wave_tensor, label |
@Charlottecuc this issue should be reopened to discuss further. |
Wow, is this really a mystery @yl4579 ? |
I think it's not a good idea to add data aug in style encoder since source audio stream will not flow into style encoder at inference time. |
The issue was closed by @yl4579 , and I am not able to reopen it. |
Training a denoising HiFi-GAN can not largely improve the results for the current issue. Because if you look at the mel-spectrograms, you can see that some parts are vague and unclear if the quality of source wave is low. |
Here it was reported that added reverb/background noises did help #6 (comment) Maybe the solution is to denoise the input wave before proceeding with inference, so something like Facebook denoiser can be used, but this suggestion points to using noise trained vocoder #6 (comment) https://github.com/facebookresearch/denoiser |
If you would like to train an any-to-any model, then add data aug to style encoder will help. |
@Charlottecuc Sorry I'm pretty busy with my other paper submissions so I can't join the discussion at this point, but I have reopened the issue for further discussion and will provide some feedback after I finished my work. |
@Charlottecuc I do have some time now to discuss this problem. I have noticed similar problems with noisy input and have not yet come up with a good solution. The major problem with the GAN-based model is that it is difficult to design denoise loss functions because the target is not as clear as in PPG or TTS based VC models (in that case you have L1 reconstruction loss directly). Not sure if you have got any good solution to this problem, but I would suggest adding some noises in the time-frequency domain by reverse mel-scale and recomputing the mel scale (or you can train a model end-to-end if you prefer). The key here is to add noise to the converted speech and force the model to convert the converted speech back to the clean output. Because one problem I noticed is that even if you add noise to the input during training, the model does not produce good converted examples sometimes. It somehow finds a way to trick the loss function so that the converted speech is not clear, but the second time conversion back to the source domain works quite well so the cycle consistency loss is still low. Adding noises to the converted results force the model to denoise the noisy speech directly. Another way is to add a denoise loss directly where the input is a noisy speech with the source style vector and the output is a clean speech. This might make the model overfit however so the converted speech might not sound similar to the target. This is in general a challenge in this field and there's still a lot of work to be done. |
This does a pretty good job of removing noises from the speech https://github.com/Rikorose/DeepFilterNet |
Another approach which works is to first train the model on a clean dataset and once the model is trained, freeze the model parameters and add two enhancement blocks to the encoder and the style encoder to enhance the noisy voice You can refer to our paper https://arxiv.org/pdf/2210.11096.pdf which shows the figures and results on distorted/noisy samples using StarGANv2-vc model architecture. |
Hi @mayank-git-hub , I have similar application to your idea and I want speech conversion whisper or distorted speech. I do not have that much knowledge in fine tuning model , can you help me out ? |
Hi. I tested the model with various kinds of wave files as source. I notice that at inference time, the model performs well with clean source files, but for those not so clean audio files (e.g. 24khz speech recorded by mobile phone, with background of air conditioning, or heavy breathing, which is quite common in real life application), the converted speech is sometime incomprehensible and usually with annoying noise.
I also tried denosing these noisy source files (e.g. using Audition, or other speech enhancement tools), but the converted speech became even worse.
Besides, do you think this line
mel_tensor = (torch.log(1e-5 + mel_tensor) - self.mean) / self.std
to some extent enlarges the noise...?Could you please give some ideas of making the model more robust with noisy data? Thank you very much.
The text was updated successfully, but these errors were encountered: