Skip to content
Hou-Ning Hu edited this page Mar 26, 2017 · 1 revision

Welcome to the SoundNet-tensorflow wiki!

When read with torchfile, one can use instance._obj to fetch dict

Link

mydict = o['modules'][0]._obj

Set value to variable and make it trainable

Link

def get_var(self, initial_value, name, idx, var_name):
    if self.data_dict is not None and name in self.data_dict:
        value = self.data_dict[name][idx]
    else:
        value = initial_value

    if self.trainable:
        var = tf.Variable(value, name=var_name)
    else:
        var = tf.constant(value, dtype=tf.float32, name=var_name)

    self.var_dict[(name, idx)] = var

    # print var_name, var.get_shape().as_list()
    assert var.get_shape() == initial_value.get_shape()

    return var

Load audio using librosa and torch audio

librosa.core.load(path, sr=22050, mono=True, offset=0.0, duration=None, dtype=<class 'numpy.float32'>, res_type='kaiser_best')
librosa.core.load(path, sr=22050, mono=True, offset=0.0, duration=None, dtype=<class 'numpy.float32'>, res_type='kaiser_best')
# By default, librosa will resample the signal to 22050Hz. And range in (-1., 1.)
# If we want to load the signal with raw sample rate, then set sr=None
# Also, if we want to have stereo (two channels), then set mono=False
sound_sample, sr = librosa.load(audio_path, sr=None, mono=False)
loads an audio file into a torch.Tensor
usage:
audio.load(
           string                              -- path to file
            )

returns torch.Tensor of size NSamples x NChannels, sample_rate
-- By default, audio will load the signal with raw sample rate, stereo (two channels). And range in (-2^31, 2^31)
sound_sample, sample_rate = audio.load(audio_path)
if sound:size(2) > 1 then sound = sound:select(2,1):clone() end -- select first channel
sound_sample:mul(2^-31)                                         -- make range [-1, 1]
  • NOTE: To keep their value difference small, convert all mp3 with sox input.mp3 output.mp3 trim 0. The different value mainly caused by different reading pattern, so a better solution is convert to wav files. For more comparison, please refer to info.md