ROCm · rrawther · Aug 28, 2024 · Jul 12, 2024 · Jul 12, 2024 · Jul 19, 2024
@@ -0,0 +1,281 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "c28810ab",
+   "metadata": {},
+   "source": [
+    "# Audio spectrogram in rocAL \n",
+    "\n",
+    "This example presents a simple rocAL pipeline that loads and decodes audio data along with the calculation of a spectrogram. We use MIVISIONX-data which contains audio data samples in wav format. Illustrated below how to create a pipeline, set_outputs, build, run the pipeline and enumerate over the results.\n",
+    "\n",
+    "## Reference implementation\n",
+    "\n",
+    "To verify the correctness of rocAL's implementation, we will compare it against librosa.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "862b24ff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import os\n",
+    "%matplotlib inline\n",
+    "\n",
+    "import librosa.display\n",
+    "import librosa as librosa\n",
+    "\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "torch.set_printoptions(threshold=10_000)\n",
+    "\n",
+    "import amd.rocal.types as types\n",
+    "import amd.rocal.fn as fn\n",
+    "from amd.rocal.pipeline import Pipeline, pipeline_def\n",
+    "from amd.rocal.plugin.pytorch import ROCALAudioIterator"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aef16554",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def show_spectrogram(spec, title, sr, hop_length, y_axis='log', x_axis='time'):\n",
+    "    librosa.display.specshow(\n",
+    "        spec, sr=16000, y_axis=y_axis, x_axis=x_axis, hop_length=hop_length)\n",
+    "    plt.title(title)\n",
+    "    plt.colorbar(format='%+2.0f dB')\n",
+    "    plt.tight_layout()\n",
+    "    plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0a7ab152",
+   "metadata": {},
+   "source": [
+    "## Librosa implementation\n",
+    "Librosa provides an API to calculate the STFT, producing a complex output (i.e. complex numbers). It is then trivial to calculate the power spectrum from the complex STFT by the following.\n",
+    "Here we load and decoder the audio file and applied spectrogram to it using librosa.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "84f22a01",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "# Set the ROCAL_DATA_PATH env variable before running the botebook\n",
+    "rocal_audio_data_path = os.path.join(os.environ['ROCAL_DATA_PATH'], \"rocal_data\", \"audio\")\n",
+    "data_path = f\"{rocal_audio_data_path}/wav/19-198-0000.wav\"\n",
+    "\n",
+    "y, sr = librosa.load(data_path, sr=16000)\n",
+    "\n",
+    "# Size of the FFT, which will also be used as the window length\n",
+    "n_fft = 2048\n",
+    "\n",
+    "# Step or stride between windows. If the step is smaller than the window length, the windows will overlap\n",
+    "hop_length = 512\n",
+    "\n",
+    "# Calculate the spectrogram as the square of the complex magnitude of the STFT\n",
+    "spectrogram_librosa = np.abs(librosa.stft(\n",
+    "    y, n_fft=n_fft, hop_length=hop_length, win_length=n_fft, window='hann', pad_mode='reflect')) ** 2\n",
+    "\n",
+    "# We can now transform the spectrogram output to a logarithmic scale by transforming the amplitude to decibels.\n",
+    "spectrogram_librosa_db = librosa.power_to_db(spectrogram_librosa, ref=np.max)\n",
+    "\n",
+    "# The last step is to display the spectrogram\n",
+    "show_spectrogram(spectrogram_librosa_db,\n",
+    "                 'Reference power spectrogram', sr, hop_length)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5265f98f",
+   "metadata": {},
+   "source": [
+    "## Configuring rocAL pipeline\n",
+    "Configure the pipeline paramters as required by the user."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "437f4ae7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "file_list = f\"{rocal_audio_data_path}/wav_file_list.txt\"\n",
+    "seed = 1000\n",
+    "nfft = 2048\n",
+    "window_length = 2048\n",
+    "window_step = 512\n",
+    "num_shards = 1\n",
+    "rocal_cpu = True\n",
+    "\n",
+    "audio_pipeline = Pipeline(\n",
+    "    batch_size=1, num_threads=8, rocal_cpu=rocal_cpu)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d4e93277",
+   "metadata": {},
+   "source": [
+    "## Audio pipeline \n",
+    "Here we use the file reader followed by audio decoder. Then the decoded audio data is passed to spectrogram. We enable the output for spectrogram using set_output"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9b782404",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with audio_pipeline:\n",
+    "    audio, labels = fn.readers.file(file_root=rocal_audio_data_path, file_list=file_list)\n",
+    "    decoded_audio = fn.decoders.audio(\n",
+    "        audio,\n",
+    "        file_root=rocal_audio_data_path,\n",
+    "        file_list_path=file_list,\n",
+    "        downmix=False,\n",
+    "        shard_id=0,\n",
+    "        num_shards=1,\n",
+    "        stick_to_shard=False)\n",
+    "    spec = fn.spectrogram(\n",
+    "        decoded_audio,\n",
+    "        nfft=2048,\n",
+    "        window_length=2048,\n",
+    "        window_step=512,\n",
+    "        output_dtype=types.FLOAT)\n",
+    "    audio_pipeline.set_outputs(spec)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a51c23c4",
+   "metadata": {},
+   "source": [
+    "## Building the Pipeline\n",
+    "Here we are creating the pipeline. In order to use our Pipeline, we need to build it. This is achieved by calling the build function. Then iterator object is created with ROCALAudioIterator(audio_pipeline)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3a21212a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "audio_pipeline.build()\n",
+    "audioIteratorPipeline = ROCALAudioIterator(audio_pipeline)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d4688be0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for i, output_list in enumerate(audioIteratorPipeline):\n",
+    "    for x in range(len(output_list[0])):\n",
+    "        for audio_tensor, label, roi in zip(output_list[0][x], output_list[1], output_list[2]):\n",
+    "            print(\"Audio shape\", audio_tensor.shape)\n",
+    "            print(\"Label\", label)\n",
+    "            print(\"Roi\", roi)\n",
+    "audioIteratorPipeline.reset()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bc4d1c4e",
+   "metadata": {},
+   "source": [
+    "## Visualizing outputs\n",
+    "\n",
+    "We have plotted the output of the spectrogram to visually compare it with librosa output."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cd6a096c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for i, it in enumerate(audioIteratorPipeline):\n",
+    "    output = it[0]\n",
+    "    # Augmentation outputs are stored in list[(batch_size, output_shape)] so we index to get each output\n",
+    "    spec_output = output[0][0].numpy()\n",
+    "    roi = it[2][0].numpy()\n",
+    "    # We slice the padded output using the ROI dimensions\n",
+    "    spec_roi_output = spec_output[:roi[0], :roi[1]]\n",
+    "    spectrogram_db = librosa.power_to_db(spec_roi_output, ref=np.max)\n",
+    "    show_spectrogram(spectrogram_db, ' rocal spectrogram', 16000, hop_length)\n",
+    "audioIteratorPipeline.reset()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0c27ad7a",
+   "metadata": {},
+   "source": [
+    "As a last check, we can verify that the numerical difference between the reference implementation and rocAL's is insignificant"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f0e27ec9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "output, label, roi_tensor = next(audioIteratorPipeline)\n",
+    "# Augmentation outputs are stored in list[(batch_size, output_shape)] so we index to get each output\n",
+    "spec_output = output[0][0].numpy()\n",
+    "roi = roi_tensor[0].numpy()\n",
+    "# We slice the padded output using the ROI dimensions\n",
+    "spec_roi_output = spec_output[:roi[0], :roi[1]]\n",
+    "spectrogram_db = librosa.power_to_db(spec_roi_output, ref=np.max)\n",
+    "print(\"Average error: {0:.5f} dB\".format(\n",
+    "    np.mean(np.abs(spectrogram_db - spectrogram_librosa_db))))\n",
+    "assert (np.allclose(spectrogram_db, spectrogram_librosa_db, atol=2))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}