From 763b3c8cc0a8b29d482af984a4ed941a1bf1cb1b Mon Sep 17 00:00:00 2001 From: selamw1 Date: Mon, 2 Dec 2024 15:34:19 -0800 Subject: [PATCH 01/14] md_and_ipynb_files_paired --- docs/data_loaders_on_cpu_with_jax.ipynb | 3570 +++++++++++++++++++++++ docs/data_loaders_on_cpu_with_jax.md | 685 +++++ 2 files changed, 4255 insertions(+) create mode 100644 docs/data_loaders_on_cpu_with_jax.ipynb create mode 100644 docs/data_loaders_on_cpu_with_jax.md diff --git a/docs/data_loaders_on_cpu_with_jax.ipynb b/docs/data_loaders_on_cpu_with_jax.ipynb new file mode 100644 index 0000000..21bd599 --- /dev/null +++ b/docs/data_loaders_on_cpu_with_jax.ipynb @@ -0,0 +1,3570 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "PUFGZggH49zp" + }, + "source": [ + "# Introduction to Data Loaders on CPU with JAX" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3ia4PKEV5Dr8" + }, + "source": [ + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jax-ml/jax-ai-stack/blob/main/docs/data_loaders_on_cpu_with_jax.ipynb)\n", + "\n", + "This tutorial explores different data loading strategies for using **JAX** on a single [**CPU**](https://jax.readthedocs.io/en/latest/glossary.html#term-CPU). While JAX doesn't include a built-in data loader, it seamlessly integrates with popular data loading libraries, including:\n", + "\n", + "- [**PyTorch DataLoader**](https://github.com/pytorch/data)\n", + "- [**TensorFlow Datasets (TFDS)**](https://github.com/tensorflow/datasets)\n", + "- [**Grain**](https://github.com/google/grain)\n", + "- [**Hugging Face**](https://huggingface.co/docs/datasets/en/use_with_jax#data-loading)\n", + "\n", + "You'll see how to use each of these libraries to efficiently load data for a simple image classification task using the MNIST dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pEsb135zE-Jo" + }, + "source": [ + "## Setting JAX to Use CPU Only\n", + "\n", + "First, you'll restrict JAX to use only the CPU, even if a GPU is available. This ensures consistency and allows you to focus on CPU-based data loading." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "vqP6xyObC0_9" + }, + "outputs": [], + "source": [ + "import os\n", + "os.environ['JAX_PLATFORM_NAME'] = 'cpu'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-rsMgVtO6asW" + }, + "source": [ + "Import JAX API" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "tDJNQ6V-Dg5g" + }, + "outputs": [], + "source": [ + "import jax\n", + "import jax.numpy as jnp\n", + "from jax import random, grad, jit, vmap" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TsFdlkSZKp9S" + }, + "source": [ + "### CPU Setup Verification" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "N3sqvaF3KJw1", + "outputId": "449c83d9-d050-4b15-9a8d-f71e340501f2" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[CpuDevice(id=0)]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "jax.devices()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qyJ_WTghDnIc" + }, + "source": [ + "## Setting Hyperparameters and Initializing Parameters\n", + "\n", + "You'll define hyperparameters for your model and data loading, including layer sizes, learning rate, batch size, and the data directory. You'll also initialize the weights and biases for a fully-connected neural network." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "qLNOSloFDka_" + }, + "outputs": [], + "source": [ + "# A helper function to randomly initialize weights and biases\n", + "# for a dense neural network layer\n", + "def random_layer_params(m, n, key, scale=1e-2):\n", + " w_key, b_key = random.split(key)\n", + " return scale * random.normal(w_key, (n, m)), scale * random.normal(b_key, (n,))\n", + "\n", + "# Function to initialize network parameters for all layers based on defined sizes\n", + "def init_network_params(sizes, key):\n", + " keys = random.split(key, len(sizes))\n", + " return [random_layer_params(m, n, k) for m, n, k in zip(sizes[:-1], sizes[1:], keys)]\n", + "\n", + "layer_sizes = [784, 512, 512, 10] # Layers of the network\n", + "step_size = 0.01 # Learning rate for optimization\n", + "num_epochs = 8 # Number of training epochs\n", + "batch_size = 128 # Batch size for training\n", + "n_targets = 10 # Number of classes (digits 0-9)\n", + "num_pixels = 28 * 28 # Input size (MNIST images are 28x28 pixels)\n", + "data_dir = '/tmp/mnist_dataset' # Directory for storing the dataset\n", + "\n", + "# Initialize network parameters using the defined layer sizes and a random seed\n", + "params = init_network_params(layer_sizes, random.PRNGKey(0))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6Ci_CqW7q6XM" + }, + "source": [ + "## Model Prediction with Auto-Batching\n", + "\n", + "In this section, you'll define the `predict` function for your neural network. This function computes the output of the network for a single input image.\n", + "\n", + "To efficiently process multiple images simultaneously, you'll use [`vmap`](https://jax.readthedocs.io/en/latest/_autosummary/jax.vmap.html#jax.vmap), which allows you to vectorize the `predict` function and apply it across a batch of inputs. This technique, called auto-batching, improves computational efficiency by leveraging hardware acceleration." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "bKIYPSkvD1QV" + }, + "outputs": [], + "source": [ + "from jax.scipy.special import logsumexp\n", + "\n", + "def relu(x):\n", + " return jnp.maximum(0, x)\n", + "\n", + "def predict(params, image):\n", + " # per-example prediction\n", + " activations = image\n", + " for w, b in params[:-1]:\n", + " outputs = jnp.dot(w, activations) + b\n", + " activations = relu(outputs)\n", + "\n", + " final_w, final_b = params[-1]\n", + " logits = jnp.dot(final_w, activations) + final_b\n", + " return logits - logsumexp(logits)\n", + "\n", + "# Make a batched version of the `predict` function\n", + "batched_predict = vmap(predict, in_axes=(None, 0))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "niTSr34_sDZi" + }, + "source": [ + "## Utility and Loss Functions\n", + "\n", + "You'll now define utility functions for:\n", + "\n", + "- One-hot encoding: Converts class indices to binary vectors.\n", + "- Accuracy calculation: Measures the performance of the model on the dataset.\n", + "- Loss computation: Calculates the difference between predictions and targets.\n", + "\n", + "To optimize performance:\n", + "\n", + "- [`grad`](https://jax.readthedocs.io/en/latest/_autosummary/jax.grad.html#jax.grad) is used to compute gradients of the loss function with respect to network parameters.\n", + "- [`jit`](https://jax.readthedocs.io/en/latest/_autosummary/jax.jit.html#jax.jit) compiles the update function, enabling faster execution by leveraging JAX's [XLA](https://openxla.org/xla) compilation." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "sA0a06raEQfS" + }, + "outputs": [], + "source": [ + "import time\n", + "\n", + "def one_hot(x, k, dtype=jnp.float32):\n", + " \"\"\"Create a one-hot encoding of x of size k.\"\"\"\n", + " return jnp.array(x[:, None] == jnp.arange(k), dtype)\n", + "\n", + "def accuracy(params, images, targets):\n", + " \"\"\"Calculate the accuracy of predictions.\"\"\"\n", + " target_class = jnp.argmax(targets, axis=1)\n", + " predicted_class = jnp.argmax(batched_predict(params, images), axis=1)\n", + " return jnp.mean(predicted_class == target_class)\n", + "\n", + "def loss(params, images, targets):\n", + " \"\"\"Calculate the loss between predictions and targets.\"\"\"\n", + " preds = batched_predict(params, images)\n", + " return -jnp.mean(preds * targets)\n", + "\n", + "@jit\n", + "def update(params, x, y):\n", + " \"\"\"Update the network parameters using gradient descent.\"\"\"\n", + " grads = grad(loss)(params, x, y)\n", + " return [(w - step_size * dw, b - step_size * db)\n", + " for (w, b), (dw, db) in zip(params, grads)]\n", + "\n", + "def reshape_and_one_hot(x, y):\n", + " \"\"\"Reshape and one-hot encode the inputs.\"\"\"\n", + " x = jnp.reshape(x, (len(x), num_pixels))\n", + " y = one_hot(y, n_targets)\n", + " return x, y\n", + "\n", + "def train_model(num_epochs, params, training_generator, data_loader_type='streamed'):\n", + " \"\"\"Train the model for a given number of epochs.\"\"\"\n", + " for epoch in range(num_epochs):\n", + " start_time = time.time()\n", + " for x, y in training_generator() if data_loader_type == 'streamed' else training_generator:\n", + " x, y = reshape_and_one_hot(x, y)\n", + " params = update(params, x, y)\n", + "\n", + " print(f\"Epoch {epoch + 1} in {time.time() - start_time:.2f} sec: \"\n", + " f\"Train Accuracy: {accuracy(params, train_images, train_labels):.4f}, \"\n", + " f\"Test Accuracy: {accuracy(params, test_images, test_labels):.4f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Hsionp5IYsQ9" + }, + "source": [ + "## Loading Data with PyTorch DataLoader\n", + "\n", + "This section shows how to load the MNIST dataset using PyTorch's DataLoader, convert the data to NumPy arrays, and apply transformations to flatten and cast images." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "jmsfrWrHxIhC", + "outputId": "33dfeada-a763-4d26-f778-a27966e34d55" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.5.1+cu121)\n", + "Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (0.20.1+cu121)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch) (3.16.1)\n", + "Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch) (4.12.2)\n", + "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.4.2)\n", + "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.4)\n", + "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch) (2024.10.0)\n", + "Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch) (1.13.1)\n", + "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch) (1.3.0)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from torchvision) (1.26.4)\n", + "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.10/dist-packages (from torchvision) (11.0.0)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (3.0.2)\n" + ] + } + ], + "source": [ + "!pip install torch torchvision" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "kO5_WzwY59gE" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "from jax.tree_util import tree_map\n", + "from torch.utils import data\n", + "from torchvision.datasets import MNIST" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "6f6qU8PCc143" + }, + "outputs": [], + "source": [ + "def numpy_collate(batch):\n", + " \"\"\"Convert a batch of PyTorch data to NumPy arrays.\"\"\"\n", + " return tree_map(np.asarray, data.default_collate(batch))\n", + "\n", + "class NumpyLoader(data.DataLoader):\n", + " \"\"\"Custom DataLoader to return NumPy arrays from a PyTorch Dataset.\"\"\"\n", + " def __init__(self, dataset, batch_size=1, shuffle=False, **kwargs):\n", + " super().__init__(dataset, batch_size=batch_size, shuffle=shuffle, collate_fn=numpy_collate, **kwargs)\n", + "\n", + "class FlattenAndCast(object):\n", + " \"\"\"Transform class to flatten and cast images to float32.\"\"\"\n", + " def __call__(self, pic):\n", + " return np.ravel(np.array(pic, dtype=jnp.float32))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mfSnfJND6I8G" + }, + "source": [ + "### Load Dataset with Transformations\n", + "\n", + "Standardize the data by flattening the images, casting them to `float32`, and ensuring consistent data types." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Kxbl6bcx6crv", + "outputId": "372bbf4c-3ad5-4fd8-cc5d-27b50f5e4f38" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz\n", + "Failed to download (trying next):\n", + "HTTP Error 403: Forbidden\n", + "\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/train-images-idx3-ubyte.gz\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 9.91M/9.91M [00:00<00:00, 49.4MB/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracting /tmp/mnist_dataset/MNIST/raw/train-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", + "\n", + "Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz\n", + "Failed to download (trying next):\n", + "HTTP Error 403: Forbidden\n", + "\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/train-labels-idx1-ubyte.gz\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 28.9k/28.9k [00:00<00:00, 2.09MB/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracting /tmp/mnist_dataset/MNIST/raw/train-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", + "\n", + "Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Failed to download (trying next):\n", + "HTTP Error 403: Forbidden\n", + "\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/t10k-images-idx3-ubyte.gz\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1.65M/1.65M [00:00<00:00, 13.3MB/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracting /tmp/mnist_dataset/MNIST/raw/t10k-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", + "\n", + "Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz\n", + "Failed to download (trying next):\n", + "HTTP Error 403: Forbidden\n", + "\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 4.54k/4.54k [00:00<00:00, 8.81MB/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracting /tmp/mnist_dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", + "\n" + ] + } + ], + "source": [ + "mnist_dataset = MNIST(data_dir, download=True, transform=FlattenAndCast())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kbdsqvPZGrsa" + }, + "source": [ + "### Full Training Dataset for Accuracy Checks\n", + "\n", + "Convert the entire training dataset to JAX arrays." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "id": "c9ZCJq_rzPck" + }, + "outputs": [], + "source": [ + "train_images = jnp.array(mnist_dataset.data.numpy().reshape(len(mnist_dataset.data), -1), dtype=jnp.float32)\n", + "train_labels = one_hot(np.array(mnist_dataset.targets), n_targets)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WXUh0BwvG8Ko" + }, + "source": [ + "### Get Full Test Dataset\n", + "\n", + "Load and process the full test dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "brlLG4SqGphm" + }, + "outputs": [], + "source": [ + "mnist_dataset_test = MNIST(data_dir, download=True, train=False)\n", + "test_images = jnp.array(mnist_dataset_test.data.numpy().reshape(len(mnist_dataset_test.data), -1), dtype=jnp.float32)\n", + "test_labels = one_hot(np.array(mnist_dataset_test.targets), n_targets)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Oz-UVnCxG5E8", + "outputId": "abbaa26d-491a-4e63-e8c9-d3c571f53a28" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Train: (60000, 784) (60000, 10)\n", + "Test: (10000, 784) (10000, 10)\n" + ] + } + ], + "source": [ + "print('Train:', train_images.shape, train_labels.shape)\n", + "print('Test:', test_images.shape, test_labels.shape)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m3zfxqnMiCbm" + }, + "source": [ + "### Training Data Generator\n", + "\n", + "Define a generator function using PyTorch's DataLoader for batch training. Setting `num_workers > 0` enables multi-process data loading, which can accelerate data loading for larger datasets or intensive preprocessing tasks. Experiment with different values to find the optimal setting for your hardware and workload.\n", + "\n", + "Note: When setting `num_workers > 0`, you may see the following `RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.` This warning can be safely ignored since data loaders do not use JAX within the forked processes." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "B-fES82EiL6Z" + }, + "outputs": [], + "source": [ + "def pytorch_training_generator(mnist_dataset):\n", + " return NumpyLoader(mnist_dataset, batch_size=batch_size, num_workers=0)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xzt2x9S1HC3T" + }, + "source": [ + "### Training Loop (PyTorch DataLoader)\n", + "\n", + "The training loop uses the PyTorch DataLoader to iterate through batches and update model parameters." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "vtUjHsh-rJs8", + "outputId": "4766333e-4366-493b-995a-102778d1345a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 1 in 28.93 sec: Train Accuracy: 0.9158, Test Accuracy: 0.9196\n", + "Epoch 2 in 8.33 sec: Train Accuracy: 0.9372, Test Accuracy: 0.9384\n", + "Epoch 3 in 6.99 sec: Train Accuracy: 0.9492, Test Accuracy: 0.9468\n", + "Epoch 4 in 7.01 sec: Train Accuracy: 0.9569, Test Accuracy: 0.9532\n", + "Epoch 5 in 8.17 sec: Train Accuracy: 0.9630, Test Accuracy: 0.9579\n", + "Epoch 6 in 8.27 sec: Train Accuracy: 0.9674, Test Accuracy: 0.9615\n", + "Epoch 7 in 8.32 sec: Train Accuracy: 0.9708, Test Accuracy: 0.9650\n", + "Epoch 8 in 8.07 sec: Train Accuracy: 0.9737, Test Accuracy: 0.9671\n" + ] + } + ], + "source": [ + "train_model(num_epochs, params, pytorch_training_generator(mnist_dataset), data_loader_type='iterable')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Nm45ZTo6yrf5" + }, + "source": [ + "## Loading Data with TensorFlow Datasets (TFDS)\n", + "\n", + "This section demonstrates how to load the MNIST dataset using TFDS, fetch the full dataset for evaluation, and define a training generator for batch processing. GPU usage is explicitly disabled for TensorFlow." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "id": "sGaQAk1DHMUx" + }, + "outputs": [], + "source": [ + "import tensorflow_datasets as tfds\n", + "import tensorflow as tf\n", + "\n", + "# Ensuring CPU-Only Execution, disable any GPU usage(if applicable) for TF\n", + "tf.config.set_visible_devices([], device_type='GPU')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3xdQY7H6wr3n" + }, + "source": [ + "### Fetch Full Dataset for Evaluation\n", + "\n", + "Load the dataset with `tfds.load`, convert it to NumPy arrays, and process it for evaluation." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 104, + "referenced_widgets": [ + "b8cdabf5c05848f38f03850cab08b56f", + "a8b76d5f93004c089676e5a2a9b3336c", + "119ac8428f9441e7a25eb0afef2fbb2a", + "76a9815e5c2b4764a13409cebaf66821", + "45ce8dd5c4b949afa957ec8ffb926060", + "05b7145fd62d4581b2123c7680f11cdd", + "b96267f014814ec5b96ad7e6165104b1", + "bce34bdbfbd64f1f8353a4e8515cee0b", + "93b8206f8c5841a692cdce985ae301d8", + "c95f592620c64da595cc787567b2c4db", + "8a97071f862c4ec3b4b4140d2e34eda2" + ] + }, + "id": "1hOamw_7C8Pb", + "outputId": "ca166490-22db-4732-b29f-866b7593e489" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /tmp/mnist_dataset/mnist/3.0.1...\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "b8cdabf5c05848f38f03850cab08b56f", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Dl Completed...: 0%| | 0/5 [00:00=9.1.0 in /usr/local/lib/python3.10/dist-packages (from grain) (10.5.0)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from grain) (1.26.4)\n", + "Requirement already satisfied: typing_extensions in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (4.12.2)\n", + "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (2024.10.0)\n", + "Requirement already satisfied: importlib_resources in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (6.4.5)\n", + "Requirement already satisfied: zipp in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (3.21.0)\n", + "Downloading grain-0.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (418 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m419.0/419.0 kB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading jaxtyping-0.2.36-py3-none-any.whl (55 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m55.8/55.8 kB\u001b[0m \u001b[31m4.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hInstalling collected packages: jaxtyping, grain\n", + "Successfully installed grain-0.2.2 jaxtyping-0.2.36\n" + ] + } + ], + "source": [ + "!pip install grain" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "66bH3ZDJ7Iat" + }, + "source": [ + "Import Required Libraries (import MNIST dataset from torchvision)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "id": "mS62eVL9Ifmz" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import grain.python as pygrain\n", + "from torchvision.datasets import MNIST" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0h6mwVrspPA-" + }, + "source": [ + "### Define Dataset Class\n", + "\n", + "Create a custom dataset class to load MNIST data for Grain." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "id": "bnrhac5Hh7y1" + }, + "outputs": [], + "source": [ + "class Dataset:\n", + " def __init__(self, data_dir, train=True):\n", + " self.data_dir = data_dir\n", + " self.train = train\n", + " self.load_data()\n", + "\n", + " def load_data(self):\n", + " self.dataset = MNIST(self.data_dir, download=True, train=self.train)\n", + "\n", + " def __len__(self):\n", + " return len(self.dataset)\n", + "\n", + " def __getitem__(self, index):\n", + " img, label = self.dataset[index]\n", + " return np.ravel(np.array(img, dtype=np.float32)), label" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "53mf8bWEsyTr" + }, + "source": [ + "### Initialize the Dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "id": "pN3oF7-ostGE" + }, + "outputs": [], + "source": [ + "mnist_dataset = Dataset(data_dir)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GqD-ycgBuwv9" + }, + "source": [ + "### Get the full train and test dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "id": "f1VnTuX3u_kL" + }, + "outputs": [], + "source": [ + "# Convert training data to JAX arrays and encode labels as one-hot vectors\n", + "train_images = jnp.array([mnist_dataset[i][0] for i in range(len(mnist_dataset))], dtype=jnp.float32)\n", + "train_labels = one_hot(np.array([mnist_dataset[i][1] for i in range(len(mnist_dataset))]), n_targets)\n", + "\n", + "# Load test dataset and process it\n", + "mnist_dataset_test = MNIST(data_dir, download=True, train=False)\n", + "test_images = jnp.array([np.ravel(np.array(mnist_dataset_test[i][0], dtype=np.float32)) for i in range(len(mnist_dataset_test))], dtype=jnp.float32)\n", + "test_labels = one_hot(np.array([mnist_dataset_test[i][1] for i in range(len(mnist_dataset_test))]), n_targets)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "a2NHlp9klrQL", + "outputId": "14be58c0-851e-4a44-dfcc-d02f0718dab5" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Train: (60000, 784) (60000, 10)\n", + "Test: (10000, 784) (10000, 10)\n" + ] + } + ], + "source": [ + "print(\"Train:\", train_images.shape, train_labels.shape)\n", + "print(\"Test:\", test_images.shape, test_labels.shape)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fETnWRo2crhf" + }, + "source": [ + "### Initialize PyGrain DataLoader\n", + "\n", + "Set up a PyGrain DataLoader for sequential batch sampling." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "id": "9RuFTcsCs2Ac" + }, + "outputs": [], + "source": [ + "sampler = pygrain.SequentialSampler(\n", + " num_records=len(mnist_dataset),\n", + " shard_options=pygrain.NoSharding()) # Single-device, no sharding\n", + "\n", + "def pygrain_training_generator():\n", + " \"\"\"Grain DataLoader generator for training.\"\"\"\n", + " return pygrain.DataLoader(\n", + " data_source=mnist_dataset,\n", + " sampler=sampler,\n", + " operations=[pygrain.Batch(batch_size)],\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GvpJPHAbeuHW" + }, + "source": [ + "### Training Loop (Grain)\n", + "\n", + "Run the training loop using the Grain DataLoader." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cjxJRtiTadEI", + "outputId": "3f624366-b683-4d20-9d0a-777d345b0e21" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 1 in 15.39 sec: Train Accuracy: 0.9158, Test Accuracy: 0.9196\n", + "Epoch 2 in 15.27 sec: Train Accuracy: 0.9372, Test Accuracy: 0.9384\n", + "Epoch 3 in 12.61 sec: Train Accuracy: 0.9492, Test Accuracy: 0.9468\n", + "Epoch 4 in 12.62 sec: Train Accuracy: 0.9569, Test Accuracy: 0.9532\n", + "Epoch 5 in 12.39 sec: Train Accuracy: 0.9630, Test Accuracy: 0.9579\n", + "Epoch 6 in 12.19 sec: Train Accuracy: 0.9674, Test Accuracy: 0.9615\n", + "Epoch 7 in 12.56 sec: Train Accuracy: 0.9708, Test Accuracy: 0.9650\n", + "Epoch 8 in 13.04 sec: Train Accuracy: 0.9737, Test Accuracy: 0.9671\n" + ] + } + ], + "source": [ + "train_model(num_epochs, params, pygrain_training_generator)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oixvOI816qUn" + }, + "source": [ + "## Loading Data with Hugging Face\n", + "\n", + "This section demonstrates loading MNIST data using the Hugging Face `datasets` library. You'll format the dataset for JAX compatibility, prepare flattened images and one-hot-encoded labels, and define a training generator." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o51P6lr86wz-" + }, + "source": [ + "Install the Hugging Face `datasets` library." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "19ipxPhI6oSN", + "outputId": "684e445f-d23e-4924-9e76-2c2c9359f0be" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting datasets\n", + " Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.16.1)\n", + "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.26.4)\n", + "Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (17.0.0)\n", + "Collecting dill<0.3.9,>=0.3.0 (from datasets)\n", + " Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)\n", + "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.2.2)\n", + "Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.32.3)\n", + "Requirement already satisfied: tqdm>=4.66.3 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.6)\n", + "Collecting xxhash (from datasets)\n", + " Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)\n", + "Collecting multiprocess<0.70.17 (from datasets)\n", + " Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)\n", + "Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)\n", + " Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)\n", + "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.11.2)\n", + "Requirement already satisfied: huggingface-hub>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.26.2)\n", + "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (24.2)\n", + "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.2)\n", + "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.4.3)\n", + "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n", + "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)\n", + "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.5.0)\n", + "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.1.0)\n", + "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (0.2.0)\n", + "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.17.2)\n", + "Requirement already satisfied: async-timeout<6.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0->datasets) (4.12.2)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.4.0)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.10)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2.2.3)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2024.8.30)\n", + "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n", + "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)\n", + "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n", + "Downloading datasets-3.1.0-py3-none-any.whl (480 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m9.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (179 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m179.3/179.3 kB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m9.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m15.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hInstalling collected packages: xxhash, fsspec, dill, multiprocess, datasets\n", + " Attempting uninstall: fsspec\n", + " Found existing installation: fsspec 2024.10.0\n", + " Uninstalling fsspec-2024.10.0:\n", + " Successfully uninstalled fsspec-2024.10.0\n", + "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[0mSuccessfully installed datasets-3.1.0 dill-0.3.8 fsspec-2024.9.0 multiprocess-0.70.16 xxhash-3.5.0\n" + ] + } + ], + "source": [ + "!pip install datasets" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "be0h_dZv0593" + }, + "source": [ + "Import Library" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "id": "8v1N59p76zn0" + }, + "outputs": [], + "source": [ + "from datasets import load_dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Gaj11tO7C86" + }, + "source": [ + "### Load and Format MNIST Dataset\n", + "\n", + "Load the MNIST dataset from Hugging Face and format it as `numpy` arrays for quick access or `jax` to get JAX arrays." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 301, + "referenced_widgets": [ + "32f6132a31aa4c508d3c3c5ef70348bb", + "d7c2ffa6b143463c91cbf8befca6ca01", + "fd964ecd3926419d92927c67f955d5d0", + "60feca3fde7c4447ad8393b0542eb999", + "3354a0baeca94d18bc6b2a8b8b465b58", + "a0d0d052772b46deac7657ad052991a4", + "fb34783b9cba462e9b690e0979c4b07a", + "8d8170c1ed99490589969cd753c40748", + "f1ecb6db00a54e088f1e09164222d637", + "3cf5dd8d29aa4619b39dc2542df7e42e", + "2e5d42ca710441b389895f2d3b611d0a", + "5d8202da24244dc896e9a8cba6a4ed4f", + "a6d64c953631412b8bd8f0ba53ae4d32", + "69240c5cbfbb4e91961f5b49812a26f0", + "865f38532b784a7c971f5d33b87b443e", + "ceb1c004191947cdaa10af9b9c03c80d", + "64c6041037914779b5e8e9cf5a80ad04", + "562fa6a0e7b846a180ac4b423c5511c5", + "b3b922288f9c4df2a4088279ff6d1531", + "75a1a8ffda554318890cf74c345ed9a9", + "3bae06cacf394a5998c2326199da94f5", + "ff6428a3daa5496c81d5e664aba01f97", + "1ba3f86870724f55b94a35cb6b4173af", + "b3e163fd8b8a4f289d5a25611cb66d23", + "abd2daba215e4f7c9ddabde04d6eb382", + "e22ee019049144d5aba573cdf4dbe4fc", + "6ac765dac67841a69218140785f024c6", + "7b057411a54e434fb74804b90daa8d44", + "563f71b3c67d47c3ab1100f5dc1b98f3", + "d81a657361ab4bba8bcc0cf309d2ff64", + "20316312ab88471ba90cbb954be3e964", + "698fda742f834473a23fb7e5e4cf239c", + "289b52c5a38146b8b467a5f4678f6271", + "d07c2f37cf914894b1551a8104e6cb70", + "5b55c73d551d483baaa6a1411c2597b1", + "2308f77723f54ac898588f48d1853b65", + "54d2589714d04b2e928b816258cb0df4", + "f84b795348c04c7a950165301a643671", + "bc853a4a8d3c4dbda23d183f0a3b4f27", + "1012ddc0343842d8b913a7d85df8ab8f", + "771a73a8f5084a57afc5654d72e022f0", + "311a43449f074841b6df4130b0871ac9", + "cd4d29cb01134469b52d6936c35eb943", + "013cf89ee6174d29bb3f4fdff7b36049", + "9237d877d84e4b3ab69698ecf56915bb", + "337ef4d37e6b4ff6bf6e8bd4ca93383f", + "b4096d3837b84ccdb8f1186435c87281", + "7259d3b7e11b4736b4d2aa8e9c55e994", + "1ad1f8e99a864fc4a2bc532d9a4ff110", + "b2b50451eabd40978ef46db5e7dd08c4", + "2dad5c5541e243128e23c3dd3e420ac2", + "a3de458b61e5493081d6bb9cf7e923db", + "37760f8a7b164e6f9c1a23d621e9fe6b", + "745a2aedcfab491fb9cffba19958b0c5", + "2f6c670640d048d2af453638cfde3a1e" + ] + }, + "id": "a22kTvgk6_fJ", + "outputId": "35fc38b9-a6ab-4b02-ffa4-ab27fac69df4" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: \n", + "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", + "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", + "You will be able to reuse this secret in all of your notebooks.\n", + "Please note that authentication is recommended but still optional to access public models or datasets.\n", + " warnings.warn(\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "32f6132a31aa4c508d3c3c5ef70348bb", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "README.md: 0%| | 0.00/6.97k [00:00 0` enables multi-process data loading, which can accelerate data loading for larger datasets or intensive preprocessing tasks. Experiment with different values to find the optimal setting for your hardware and workload. + +Note: When setting `num_workers > 0`, you may see the following `RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.` This warning can be safely ignored since data loaders do not use JAX within the forked processes. + +```{code-cell} +:id: B-fES82EiL6Z + +def pytorch_training_generator(mnist_dataset): + return NumpyLoader(mnist_dataset, batch_size=batch_size, num_workers=0) +``` + ++++ {"id": "Xzt2x9S1HC3T"} + +### Training Loop (PyTorch DataLoader) + +The training loop uses the PyTorch DataLoader to iterate through batches and update model parameters. + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: vtUjHsh-rJs8 +outputId: 4766333e-4366-493b-995a-102778d1345a +--- +train_model(num_epochs, params, pytorch_training_generator(mnist_dataset), data_loader_type='iterable') +``` + ++++ {"id": "Nm45ZTo6yrf5"} + +## Loading Data with TensorFlow Datasets (TFDS) + +This section demonstrates how to load the MNIST dataset using TFDS, fetch the full dataset for evaluation, and define a training generator for batch processing. GPU usage is explicitly disabled for TensorFlow. + +```{code-cell} +:id: sGaQAk1DHMUx + +import tensorflow_datasets as tfds +import tensorflow as tf + +# Ensuring CPU-Only Execution, disable any GPU usage(if applicable) for TF +tf.config.set_visible_devices([], device_type='GPU') +``` + ++++ {"id": "3xdQY7H6wr3n"} + +### Fetch Full Dataset for Evaluation + +Load the dataset with `tfds.load`, convert it to NumPy arrays, and process it for evaluation. + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ + height: 104 + referenced_widgets: [b8cdabf5c05848f38f03850cab08b56f, a8b76d5f93004c089676e5a2a9b3336c, + 119ac8428f9441e7a25eb0afef2fbb2a, 76a9815e5c2b4764a13409cebaf66821, 45ce8dd5c4b949afa957ec8ffb926060, + 05b7145fd62d4581b2123c7680f11cdd, b96267f014814ec5b96ad7e6165104b1, bce34bdbfbd64f1f8353a4e8515cee0b, + 93b8206f8c5841a692cdce985ae301d8, c95f592620c64da595cc787567b2c4db, 8a97071f862c4ec3b4b4140d2e34eda2] +id: 1hOamw_7C8Pb +outputId: ca166490-22db-4732-b29f-866b7593e489 +--- +# tfds.load returns tf.Tensors (or tf.data.Datasets if batch_size != -1) +mnist_data, info = tfds.load(name="mnist", batch_size=-1, data_dir=data_dir, with_info=True) +mnist_data = tfds.as_numpy(mnist_data) +train_data, test_data = mnist_data['train'], mnist_data['test'] + +# Full train set +train_images, train_labels = train_data['image'], train_data['label'] +train_images = jnp.reshape(train_images, (len(train_images), num_pixels)) +train_labels = one_hot(train_labels, n_targets) + +# Full test set +test_images, test_labels = test_data['image'], test_data['label'] +test_images = jnp.reshape(test_images, (len(test_images), num_pixels)) +test_labels = one_hot(test_labels, n_targets) +``` + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: Td3PiLdmEf7z +outputId: 96403b0f-6079-43ce-df16-d4583f09906b +--- +print('Train:', train_images.shape, train_labels.shape) +print('Test:', test_images.shape, test_labels.shape) +``` + ++++ {"id": "UWRSaalfdyDX"} + +### Define the Training Generator + +Create a generator function to yield batches of data for training. + +```{code-cell} +:id: vX59u8CqEf4J + +def training_generator(): + # as_supervised=True gives us the (image, label) as a tuple instead of a dict + ds = tfds.load(name='mnist', split='train', as_supervised=True, data_dir=data_dir) + # You can build up an arbitrary tf.data input pipeline + ds = ds.batch(batch_size).prefetch(1) + # tfds.dataset_as_numpy converts the tf.data.Dataset into an iterable of NumPy arrays + return tfds.as_numpy(ds) +``` + ++++ {"id": "EAWeUdnuFNBY"} + +### Training Loop (TFDS) + +Use the training generator in a custom training loop. + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: h2sO13XDGvq1 +outputId: a150246e-ceb5-46ac-db71-2a8177a9d04d +--- +train_model(num_epochs, params, training_generator) +``` + ++++ {"id": "-ryVkrAITS9Z"} + +## Loading Data with Grain + +This section demonstrates how to load MNIST data using Grain, a data-loading library. You'll define a custom dataset class for Grain and set up a Grain DataLoader for efficient training. + ++++ {"id": "waYhUMUGmhH-"} + +Install Grain + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: L78o7eeyGvn5 +outputId: 76d16565-0d9e-4f5f-c6b1-4cf4a683d0e7 +--- +!pip install grain +``` + ++++ {"id": "66bH3ZDJ7Iat"} + +Import Required Libraries (import MNIST dataset from torchvision) + +```{code-cell} +:id: mS62eVL9Ifmz + +import numpy as np +import grain.python as pygrain +from torchvision.datasets import MNIST +``` + ++++ {"id": "0h6mwVrspPA-"} + +### Define Dataset Class + +Create a custom dataset class to load MNIST data for Grain. + +```{code-cell} +:id: bnrhac5Hh7y1 + +class Dataset: + def __init__(self, data_dir, train=True): + self.data_dir = data_dir + self.train = train + self.load_data() + + def load_data(self): + self.dataset = MNIST(self.data_dir, download=True, train=self.train) + + def __len__(self): + return len(self.dataset) + + def __getitem__(self, index): + img, label = self.dataset[index] + return np.ravel(np.array(img, dtype=np.float32)), label +``` + ++++ {"id": "53mf8bWEsyTr"} + +### Initialize the Dataset + +```{code-cell} +:id: pN3oF7-ostGE + +mnist_dataset = Dataset(data_dir) +``` + ++++ {"id": "GqD-ycgBuwv9"} + +### Get the full train and test dataset + +```{code-cell} +:id: f1VnTuX3u_kL + +# Convert training data to JAX arrays and encode labels as one-hot vectors +train_images = jnp.array([mnist_dataset[i][0] for i in range(len(mnist_dataset))], dtype=jnp.float32) +train_labels = one_hot(np.array([mnist_dataset[i][1] for i in range(len(mnist_dataset))]), n_targets) + +# Load test dataset and process it +mnist_dataset_test = MNIST(data_dir, download=True, train=False) +test_images = jnp.array([np.ravel(np.array(mnist_dataset_test[i][0], dtype=np.float32)) for i in range(len(mnist_dataset_test))], dtype=jnp.float32) +test_labels = one_hot(np.array([mnist_dataset_test[i][1] for i in range(len(mnist_dataset_test))]), n_targets) +``` + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: a2NHlp9klrQL +outputId: 14be58c0-851e-4a44-dfcc-d02f0718dab5 +--- +print("Train:", train_images.shape, train_labels.shape) +print("Test:", test_images.shape, test_labels.shape) +``` + ++++ {"id": "fETnWRo2crhf"} + +### Initialize PyGrain DataLoader + +Set up a PyGrain DataLoader for sequential batch sampling. + +```{code-cell} +:id: 9RuFTcsCs2Ac + +sampler = pygrain.SequentialSampler( + num_records=len(mnist_dataset), + shard_options=pygrain.NoSharding()) # Single-device, no sharding + +def pygrain_training_generator(): + """Grain DataLoader generator for training.""" + return pygrain.DataLoader( + data_source=mnist_dataset, + sampler=sampler, + operations=[pygrain.Batch(batch_size)], + ) +``` + ++++ {"id": "GvpJPHAbeuHW"} + +### Training Loop (Grain) + +Run the training loop using the Grain DataLoader. + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: cjxJRtiTadEI +outputId: 3f624366-b683-4d20-9d0a-777d345b0e21 +--- +train_model(num_epochs, params, pygrain_training_generator) +``` + ++++ {"id": "oixvOI816qUn"} + +## Loading Data with Hugging Face + +This section demonstrates loading MNIST data using the Hugging Face `datasets` library. You'll format the dataset for JAX compatibility, prepare flattened images and one-hot-encoded labels, and define a training generator. + ++++ {"id": "o51P6lr86wz-"} + +Install the Hugging Face `datasets` library. + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: 19ipxPhI6oSN +outputId: 684e445f-d23e-4924-9e76-2c2c9359f0be +--- +!pip install datasets +``` + ++++ {"id": "be0h_dZv0593"} + +Import Library + +```{code-cell} +:id: 8v1N59p76zn0 + +from datasets import load_dataset +``` + ++++ {"id": "8Gaj11tO7C86"} + +### Load and Format MNIST Dataset + +Load the MNIST dataset from Hugging Face and format it as `numpy` arrays for quick access or `jax` to get JAX arrays. + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ + height: 301 + referenced_widgets: [32f6132a31aa4c508d3c3c5ef70348bb, d7c2ffa6b143463c91cbf8befca6ca01, + fd964ecd3926419d92927c67f955d5d0, 60feca3fde7c4447ad8393b0542eb999, 3354a0baeca94d18bc6b2a8b8b465b58, + a0d0d052772b46deac7657ad052991a4, fb34783b9cba462e9b690e0979c4b07a, 8d8170c1ed99490589969cd753c40748, + f1ecb6db00a54e088f1e09164222d637, 3cf5dd8d29aa4619b39dc2542df7e42e, 2e5d42ca710441b389895f2d3b611d0a, + 5d8202da24244dc896e9a8cba6a4ed4f, a6d64c953631412b8bd8f0ba53ae4d32, 69240c5cbfbb4e91961f5b49812a26f0, + 865f38532b784a7c971f5d33b87b443e, ceb1c004191947cdaa10af9b9c03c80d, 64c6041037914779b5e8e9cf5a80ad04, + 562fa6a0e7b846a180ac4b423c5511c5, b3b922288f9c4df2a4088279ff6d1531, 75a1a8ffda554318890cf74c345ed9a9, + 3bae06cacf394a5998c2326199da94f5, ff6428a3daa5496c81d5e664aba01f97, 1ba3f86870724f55b94a35cb6b4173af, + b3e163fd8b8a4f289d5a25611cb66d23, abd2daba215e4f7c9ddabde04d6eb382, e22ee019049144d5aba573cdf4dbe4fc, + 6ac765dac67841a69218140785f024c6, 7b057411a54e434fb74804b90daa8d44, 563f71b3c67d47c3ab1100f5dc1b98f3, + d81a657361ab4bba8bcc0cf309d2ff64, 20316312ab88471ba90cbb954be3e964, 698fda742f834473a23fb7e5e4cf239c, + 289b52c5a38146b8b467a5f4678f6271, d07c2f37cf914894b1551a8104e6cb70, 5b55c73d551d483baaa6a1411c2597b1, + 2308f77723f54ac898588f48d1853b65, 54d2589714d04b2e928b816258cb0df4, f84b795348c04c7a950165301a643671, + bc853a4a8d3c4dbda23d183f0a3b4f27, 1012ddc0343842d8b913a7d85df8ab8f, 771a73a8f5084a57afc5654d72e022f0, + 311a43449f074841b6df4130b0871ac9, cd4d29cb01134469b52d6936c35eb943, 013cf89ee6174d29bb3f4fdff7b36049, + 9237d877d84e4b3ab69698ecf56915bb, 337ef4d37e6b4ff6bf6e8bd4ca93383f, b4096d3837b84ccdb8f1186435c87281, + 7259d3b7e11b4736b4d2aa8e9c55e994, 1ad1f8e99a864fc4a2bc532d9a4ff110, b2b50451eabd40978ef46db5e7dd08c4, + 2dad5c5541e243128e23c3dd3e420ac2, a3de458b61e5493081d6bb9cf7e923db, 37760f8a7b164e6f9c1a23d621e9fe6b, + 745a2aedcfab491fb9cffba19958b0c5, 2f6c670640d048d2af453638cfde3a1e] +id: a22kTvgk6_fJ +outputId: 35fc38b9-a6ab-4b02-ffa4-ab27fac69df4 +--- +mnist_dataset = load_dataset("mnist").with_format("numpy") +``` + ++++ {"id": "IFjTyGxY19b0"} + +### Extract images and labels + +Get image shape and flatten for model input + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: NHrKatD_7HbH +outputId: deec1739-2fc0-4e71-8567-f2e0c9db198b +--- +train_images = mnist_dataset["train"]["image"] +train_labels = mnist_dataset["train"]["label"] +test_images = mnist_dataset["test"]["image"] +test_labels = mnist_dataset["test"]["label"] + +# Flatten images and one-hot encode labels +image_shape = train_images.shape[1:] +num_features = image_shape[0] * image_shape[1] + +train_images = train_images.reshape(-1, num_features) +test_images = test_images.reshape(-1, num_features) + +train_labels = one_hot(train_labels, n_targets) +test_labels = one_hot(test_labels, n_targets) + +print('Train:', train_images.shape, train_labels.shape) +print('Test:', test_images.shape, test_labels.shape) +``` + ++++ {"id": "kk_4zJlz7T1E"} + +### Define Training Generator + +Set up a generator to yield batches of images and labels for training. + +```{code-cell} +:id: -zLJhogj7RL- + +def hf_training_generator(): + """Yield batches for training.""" + for batch in mnist_dataset["train"].iter(batch_size): + x, y = batch["image"], batch["label"] + yield x, y +``` + ++++ {"id": "HIsGfkLI7dvZ"} + +### Training Loop (Hugging Face Datasets) + +Run the training loop using the Hugging Face training generator. + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: RhloYGsw6nPf +outputId: d49c1cd2-a546-46a6-84fb-d9507c38f4ca +--- +train_model(num_epochs, params, hf_training_generator) +``` + ++++ {"id": "qXylIOwidWI3"} + +## Summary + +This notebook has guided you through efficient methods for loading data on a CPU when using JAX. You’ve learned how to leverage popular libraries such as PyTorch DataLoader, TensorFlow Datasets, Grain, and Hugging Face Datasets to streamline the data loading process for your machine learning tasks. Each of these methods offers unique advantages and considerations, allowing you to choose the best approach based on the specific needs of your project. From fd8d2112b11231b4a045dbe1e1dce844cc1f8aca Mon Sep 17 00:00:00 2001 From: selamw1 Date: Tue, 3 Dec 2024 10:23:27 -0800 Subject: [PATCH 02/14] notebook_added_to_config --- docs/conf.py | 2 ++ docs/tutorials.md | 1 + 2 files changed, 3 insertions(+) diff --git a/docs/conf.py b/docs/conf.py index ceec641..b27a425 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -61,6 +61,7 @@ 'JAX_image_captioning.md', 'JAX_time_series_classification.md', 'JAX_transformer_text_classification.md', + 'data_loaders_on_cpu_with_jax.md', ] suppress_warnings = [ @@ -96,4 +97,5 @@ 'JAX_image_captioning.ipynb', 'JAX_time_series_classification.ipynb', 'JAX_transformer_text_classification.ipynb', + 'data_loaders_on_cpu_with_jax.ipynb', ] diff --git a/docs/tutorials.md b/docs/tutorials.md index dab201f..3370597 100644 --- a/docs/tutorials.md +++ b/docs/tutorials.md @@ -21,6 +21,7 @@ JAX_visualizing_models_metrics JAX_image_captioning JAX_time_series_classification JAX_transformer_text_classification +data_loaders_on_cpu_with_jax ``` Once you've gone through this content, you can refer to package-specific From 16107cdc26ed2e9b980b5f8dd54f7b954ffdf7d0 Mon Sep 17 00:00:00 2001 From: selamw1 Date: Tue, 3 Dec 2024 14:53:11 -0800 Subject: [PATCH 03/14] =?UTF-8?q?=E2=80=9Creferece=5Ftutorial=5Flinks=5Fad?= =?UTF-8?q?ded=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/data_loaders_on_cpu_with_jax.ipynb | 79 +++++++++++++------------ docs/data_loaders_on_cpu_with_jax.md | 10 +++- 2 files changed, 50 insertions(+), 39 deletions(-) diff --git a/docs/data_loaders_on_cpu_with_jax.ipynb b/docs/data_loaders_on_cpu_with_jax.ipynb index 21bd599..0ba897e 100644 --- a/docs/data_loaders_on_cpu_with_jax.ipynb +++ b/docs/data_loaders_on_cpu_with_jax.ipynb @@ -24,7 +24,13 @@ "- [**Grain**](https://github.com/google/grain)\n", "- [**Hugging Face**](https://huggingface.co/docs/datasets/en/use_with_jax#data-loading)\n", "\n", - "You'll see how to use each of these libraries to efficiently load data for a simple image classification task using the MNIST dataset." + "In this tutorial, you'll learn how to efficiently load data using these libraries for a simple image classification task based on the MNIST dataset.\n", + "\n", + "Compared to GPU or multi-device setups, CPU-based data loading is straightforward as it avoids challenges like GPU memory management and data synchronization across devices. This makes it ideal for smaller-scale tasks or scenarios where data resides exclusively on the CPU.\n", + "\n", + "If you're looking for GPU-specific data loading advice, see [Data Loaders on GPU](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_on_gpu_with_jax.html).\n", + "\n", + "If you're looking for a multi-device data loading strategy, see [Data Loaders on Multi-Device Setups](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_for_multi_device_setups_with_jax.html)." ] }, { @@ -40,7 +46,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": { "id": "vqP6xyObC0_9" }, @@ -61,7 +67,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": { "id": "tDJNQ6V-Dg5g" }, @@ -83,7 +89,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -120,7 +126,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": { "id": "qLNOSloFDka_" }, @@ -164,7 +170,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": { "id": "bKIYPSkvD1QV" }, @@ -212,7 +218,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": { "id": "sA0a06raEQfS" }, @@ -274,7 +280,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -308,7 +314,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": { "id": "kO5_WzwY59gE" }, @@ -322,7 +328,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": null, "metadata": { "id": "6f6qU8PCc143" }, @@ -356,7 +362,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -486,7 +492,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "metadata": { "id": "c9ZCJq_rzPck" }, @@ -509,7 +515,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, "metadata": { "id": "brlLG4SqGphm" }, @@ -522,7 +528,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -560,7 +566,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": null, "metadata": { "id": "B-fES82EiL6Z" }, @@ -583,7 +589,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -624,7 +630,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": null, "metadata": { "id": "sGaQAk1DHMUx" }, @@ -650,7 +656,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", @@ -721,7 +727,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -757,7 +763,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": null, "metadata": { "id": "vX59u8CqEf4J" }, @@ -785,7 +791,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -835,7 +841,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -887,7 +893,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": null, "metadata": { "id": "mS62eVL9Ifmz" }, @@ -911,7 +917,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": null, "metadata": { "id": "bnrhac5Hh7y1" }, @@ -945,7 +951,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": null, "metadata": { "id": "pN3oF7-ostGE" }, @@ -965,7 +971,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": null, "metadata": { "id": "f1VnTuX3u_kL" }, @@ -983,7 +989,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -1019,7 +1025,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": null, "metadata": { "id": "9RuFTcsCs2Ac" }, @@ -1051,7 +1057,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -1101,7 +1107,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -1187,7 +1193,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": null, "metadata": { "id": "8v1N59p76zn0" }, @@ -1209,7 +1215,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", @@ -1376,7 +1382,7 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -1427,7 +1433,7 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": null, "metadata": { "id": "-zLJhogj7RL-" }, @@ -1453,7 +1459,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -1489,13 +1495,12 @@ "source": [ "## Summary\n", "\n", - "This notebook has guided you through efficient methods for loading data on a CPU when using JAX. You’ve learned how to leverage popular libraries such as PyTorch DataLoader, TensorFlow Datasets, Grain, and Hugging Face Datasets to streamline the data loading process for your machine learning tasks. Each of these methods offers unique advantages and considerations, allowing you to choose the best approach based on the specific needs of your project." + "This notebook has introduced efficient strategies for data loading on a CPU with JAX, demonstrating how to integrate popular libraries like PyTorch DataLoader, TensorFlow Datasets, Grain, and Hugging Face Datasets. Each library offers distinct advantages, enabling you to streamline the data loading process for machine learning tasks. By understanding the strengths of these methods, you can select the approach that best suits your project's specific requirements." ] } ], "metadata": { "colab": { - "name": "data_loaders_on_cpu_with_jax.ipynb", "provenance": [] }, "jupytext": { diff --git a/docs/data_loaders_on_cpu_with_jax.md b/docs/data_loaders_on_cpu_with_jax.md index f565d1d..d26c687 100644 --- a/docs/data_loaders_on_cpu_with_jax.md +++ b/docs/data_loaders_on_cpu_with_jax.md @@ -26,7 +26,13 @@ This tutorial explores different data loading strategies for using **JAX** on a - [**Grain**](https://github.com/google/grain) - [**Hugging Face**](https://huggingface.co/docs/datasets/en/use_with_jax#data-loading) -You'll see how to use each of these libraries to efficiently load data for a simple image classification task using the MNIST dataset. +In this tutorial, you'll learn how to efficiently load data using these libraries for a simple image classification task based on the MNIST dataset. + +Compared to GPU or multi-device setups, CPU-based data loading is straightforward as it avoids challenges like GPU memory management and data synchronization across devices. This makes it ideal for smaller-scale tasks or scenarios where data resides exclusively on the CPU. + +If you're looking for GPU-specific data loading advice, see [Data Loaders on GPU](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_on_gpu_with_jax.html). + +If you're looking for a multi-device data loading strategy, see [Data Loaders on Multi-Device Setups](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_for_multi_device_setups_with_jax.html). +++ {"id": "pEsb135zE-Jo"} @@ -682,4 +688,4 @@ train_model(num_epochs, params, hf_training_generator) ## Summary -This notebook has guided you through efficient methods for loading data on a CPU when using JAX. You’ve learned how to leverage popular libraries such as PyTorch DataLoader, TensorFlow Datasets, Grain, and Hugging Face Datasets to streamline the data loading process for your machine learning tasks. Each of these methods offers unique advantages and considerations, allowing you to choose the best approach based on the specific needs of your project. +This notebook has introduced efficient strategies for data loading on a CPU with JAX, demonstrating how to integrate popular libraries like PyTorch DataLoader, TensorFlow Datasets, Grain, and Hugging Face Datasets. Each library offers distinct advantages, enabling you to streamline the data loading process for machine learning tasks. By understanding the strengths of these methods, you can select the approach that best suits your project's specific requirements. From 15cc56fbe179460e6eb111fe91bb0abf8301c353 Mon Sep 17 00:00:00 2001 From: selamw1 Date: Tue, 26 Nov 2024 12:36:38 -0800 Subject: [PATCH 04/14] file_conflict_resolved --- docs/source/conf.py | 3 +++ docs/source/tutorials.md | 1 + 2 files changed, 4 insertions(+) diff --git a/docs/source/conf.py b/docs/source/conf.py index aad2b2f..45b5040 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -67,10 +67,12 @@ 'JAX_transformer_text_classification.md', 'data_loaders_on_cpu_with_jax.md', 'data_loaders_on_gpu_with_jax.md', + 'data_loaders_for_multi_device_setups_with_jax.md', ] suppress_warnings = [ 'misc.highlighting_failure', # Suppress warning in exception in digits_vae + 'mystnb.unknown_mime_type', # Suppress warning for unknown mime type (e.g. colab-display-data+json) ] # -- Options for myst ---------------------------------------------- @@ -104,4 +106,5 @@ 'JAX_transformer_text_classification.ipynb', 'data_loaders_on_cpu_with_jax.ipynb', 'data_loaders_on_gpu_with_jax.ipynb', + 'data_loaders_for_multi_device_setups_with_jax.ipynb', ] diff --git a/docs/source/tutorials.md b/docs/source/tutorials.md index 23cdd58..071343b 100644 --- a/docs/source/tutorials.md +++ b/docs/source/tutorials.md @@ -25,6 +25,7 @@ JAX_time_series_classification JAX_transformer_text_classification data_loaders_on_cpu_with_jax data_loaders_on_gpu_with_jax +data_loaders_for_multi_device_setups_with_jax ``` Once you've gone through this content, you can refer to package-specific From d37c643a796a3ca30e978ddaff6e6b28fb22f232 Mon Sep 17 00:00:00 2001 From: selamw1 Date: Wed, 27 Nov 2024 10:26:07 -0800 Subject: [PATCH 05/14] missed_notebook_files_added --- ...ers_for_multi_device_setups_with_jax.ipynb | 3761 +++++++++++++++++ ...oaders_for_multi_device_setups_with_jax.md | 719 ++++ 2 files changed, 4480 insertions(+) create mode 100644 docs/data_loaders_for_multi_device_setups_with_jax.ipynb create mode 100644 docs/data_loaders_for_multi_device_setups_with_jax.md diff --git a/docs/data_loaders_for_multi_device_setups_with_jax.ipynb b/docs/data_loaders_for_multi_device_setups_with_jax.ipynb new file mode 100644 index 0000000..749dd7c --- /dev/null +++ b/docs/data_loaders_for_multi_device_setups_with_jax.ipynb @@ -0,0 +1,3761 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "PUFGZggH49zp" + }, + "source": [ + "# Introduction to Data Loaders for Multi-Device Training with JAX" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3ia4PKEV5Dr8" + }, + "source": [ + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jax-ml/jax-ai-stack/blob/main/docs/data_loaders_for_multi_device_setups_with_jax.ipynb)\n", + "\n", + "This tutorial explores various data loading strategies for **JAX** in **multi-device distributed** environments, leveraging [**TPUs**](https://jax.readthedocs.io/en/latest/pallas/tpu/details.html#what-is-a-tpu). While JAX doesn't include a built-in data loader, it seamlessly integrates with popular data loading libraries, including:\n", + "* [**PyTorch DataLoader**](https://github.com/pytorch/data)\n", + "* [**TensorFlow Datasets (TFDS)**](https://github.com/tensorflow/datasets)\n", + "* [**Grain**](https://github.com/google/grain)\n", + "* [**Hugging Face**](https://huggingface.co/docs/datasets/en/use_with_jax#data-loading)\n", + "\n", + "You'll see how to use each of these libraries to efficiently load data for a simple image classification task using the MNIST dataset.\n", + "\n", + "Building on the [Data Loaders on GPU](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_on_gpu_with_jax.html) tutorial, this guide introduces optimizations for distributed training across multiple GPUs or TPUs. It focuses on data sharding with `Mesh` and `NamedSharding` to efficiently partition and synchronize data across devices. By leveraging multi-device setups, you'll maximize resource utilization for large datasets in distributed environments." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-rsMgVtO6asW" + }, + "source": [ + "Import JAX API" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "tDJNQ6V-Dg5g" + }, + "outputs": [], + "source": [ + "import jax\n", + "import jax.numpy as jnp\n", + "from jax import grad, jit, vmap, random, device_put\n", + "from jax.sharding import Mesh, PartitionSpec, NamedSharding" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TsFdlkSZKp9S" + }, + "source": [ + "### Checking TPU Availability for JAX" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "N3sqvaF3KJw1", + "outputId": "ee3286d0-b75f-46c5-8548-b57e3d895dd7" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0),\n", + " TpuDevice(id=1, process_index=0, coords=(0,0,0), core_on_chip=1),\n", + " TpuDevice(id=2, process_index=0, coords=(1,0,0), core_on_chip=0),\n", + " TpuDevice(id=3, process_index=0, coords=(1,0,0), core_on_chip=1),\n", + " TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0),\n", + " TpuDevice(id=5, process_index=0, coords=(0,1,0), core_on_chip=1),\n", + " TpuDevice(id=6, process_index=0, coords=(1,1,0), core_on_chip=0),\n", + " TpuDevice(id=7, process_index=0, coords=(1,1,0), core_on_chip=1)]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "jax.devices()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qyJ_WTghDnIc" + }, + "source": [ + "### Setting Hyperparameters and Initializing Parameters\n", + "\n", + "You'll define hyperparameters for your model and data loading, including layer sizes, learning rate, batch size, and the data directory. You'll also initialize the weights and biases for a fully-connected neural network." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "qLNOSloFDka_" + }, + "outputs": [], + "source": [ + "# A helper function to randomly initialize weights and biases\n", + "# for a dense neural network layer\n", + "def random_layer_params(m, n, key, scale=1e-2):\n", + " w_key, b_key = random.split(key)\n", + " return scale * random.normal(w_key, (n, m)), scale * random.normal(b_key, (n,))\n", + "\n", + "# Function to initialize network parameters for all layers based on defined sizes\n", + "def init_network_params(sizes, key):\n", + " keys = random.split(key, len(sizes))\n", + " return [random_layer_params(m, n, k) for m, n, k in zip(sizes[:-1], sizes[1:], keys)]\n", + "\n", + "layer_sizes = [784, 512, 512, 10] # Layers of the network\n", + "step_size = 0.01 # Learning rate\n", + "num_epochs = 8 # Number of training epochs\n", + "batch_size = 128 # Batch size for training\n", + "n_targets = 10 # Number of classes (digits 0-9)\n", + "num_pixels = 28 * 28 # Each MNIST image is 28x28 pixels\n", + "data_dir = '/tmp/mnist_dataset' # Directory for storing the dataset\n", + "\n", + "# Initialize network parameters using the defined layer sizes and a random seed\n", + "params = init_network_params(layer_sizes, random.PRNGKey(0))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rHLdqeI7D2WZ" + }, + "source": [ + "### Model Prediction with Auto-Batching\n", + "\n", + "In this section, you'll define the `predict` function for your neural network. This function computes the output of the network for a single input image.\n", + "\n", + "To efficiently process multiple images simultaneously, you'll use [`vmap`](https://jax.readthedocs.io/en/latest/_autosummary/jax.vmap.html#jax.vmap), which allows you to vectorize the `predict` function and apply it across a batch of inputs. This technique, called auto-batching, improves computational efficiency by leveraging hardware acceleration." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "bKIYPSkvD1QV" + }, + "outputs": [], + "source": [ + "from jax.scipy.special import logsumexp\n", + "\n", + "def relu(x):\n", + " return jnp.maximum(0, x)\n", + "\n", + "def predict(params, image):\n", + " # per-example predictions\n", + " activations = image\n", + " for w, b in params[:-1]:\n", + " outputs = jnp.dot(w, activations) + b\n", + " activations = relu(outputs)\n", + "\n", + " final_w, final_b = params[-1]\n", + " logits = jnp.dot(final_w, activations) + final_b\n", + " return logits - logsumexp(logits)\n", + "\n", + "# Make a batched version of the `predict` function\n", + "batched_predict = vmap(predict, in_axes=(None, 0))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AMWmxjVEpH2D" + }, + "source": [ + "Multi-device setup using a Mesh of devices" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "4Jc5YLFnpE-_" + }, + "outputs": [], + "source": [ + "# Get the number of available devices (GPUs/TPUs) for sharding\n", + "num_devices = len(jax.devices())\n", + "\n", + "# Multi-device setup using a Mesh of devices\n", + "devices = jax.devices()\n", + "mesh = Mesh(devices, ('device',))\n", + "\n", + "# Define the sharding specification - split the data along the first axis (batch)\n", + "sharding_spec = PartitionSpec('device')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rLqfeORsERek" + }, + "source": [ + "### Utility and Loss Functions\n", + "\n", + "You'll now define utility functions for:\n", + "- One-hot encoding: Converts class indices to binary vectors.\n", + "- Accuracy calculation: Measures the performance of the model on the dataset.\n", + "- Loss computation: Calculates the difference between predictions and targets.\n", + "\n", + "To optimize performance:\n", + "- [`grad`](https://jax.readthedocs.io/en/latest/_autosummary/jax.grad.html#jax.grad) is used to compute gradients of the loss function with respect to network parameters.\n", + "- [`jit`](https://jax.readthedocs.io/en/latest/_autosummary/jax.jit.html#jax.jit) compiles the update function, enabling faster execution by leveraging JAX's [XLA](https://openxla.org/xla) compilation.\n", + "\n", + "- [`device_put`](https://jax.readthedocs.io/en/latest/_autosummary/jax.device_put.html) to distribute the dataset across TPU cores." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "sA0a06raEQfS" + }, + "outputs": [], + "source": [ + "import time\n", + "\n", + "def one_hot(x, k, dtype=jnp.float32):\n", + " \"\"\"Create a one-hot encoding of x of size k.\"\"\"\n", + " return jnp.array(x[:, None] == jnp.arange(k), dtype)\n", + "\n", + "def accuracy(params, images, targets):\n", + " \"\"\"Calculate the accuracy of predictions.\"\"\"\n", + " target_class = jnp.argmax(targets, axis=1)\n", + " predicted_class = jnp.argmax(batched_predict(params, images), axis=1)\n", + " return jnp.mean(predicted_class == target_class)\n", + "\n", + "def loss(params, images, targets):\n", + " \"\"\"Calculate the loss between predictions and targets.\"\"\"\n", + " preds = batched_predict(params, images)\n", + " return -jnp.mean(preds * targets)\n", + "\n", + "@jit\n", + "def update(params, x, y):\n", + " \"\"\"Update the network parameters using gradient descent.\"\"\"\n", + " grads = grad(loss)(params, x, y)\n", + " return [(w - step_size * dw, b - step_size * db)\n", + " for (w, b), (dw, db) in zip(params, grads)]\n", + "\n", + "def reshape_and_one_hot(x, y):\n", + " \"\"\"Reshape and one-hot encode the inputs.\"\"\"\n", + " x = jnp.reshape(x, (len(x), num_pixels))\n", + " y = one_hot(y, n_targets)\n", + " return x, y\n", + "\n", + "def train_model(num_epochs, params, training_generator, data_loader_type='streamed'):\n", + " \"\"\"Train the model for a given number of epochs and device_put for TPU transfer.\"\"\"\n", + " for epoch in range(num_epochs):\n", + " start_time = time.time()\n", + " for x, y in training_generator() if data_loader_type == 'streamed' else training_generator:\n", + " x, y = reshape_and_one_hot(x, y)\n", + " x, y = device_put(x, NamedSharding(mesh, sharding_spec)), device_put(y, NamedSharding(mesh, sharding_spec))\n", + " params = update(params, x, y)\n", + "\n", + " print(f\"Epoch {epoch + 1} in {time.time() - start_time:.2f} sec: \"\n", + " f\"Train Accuracy: {accuracy(params, train_images, train_labels):.4f},\"\n", + " f\"Test Accuracy: {accuracy(params, test_images, test_labels):.4f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Hsionp5IYsQ9" + }, + "source": [ + "## Loading Data with PyTorch DataLoader\n", + "\n", + "This section shows how to load the MNIST dataset using PyTorch's DataLoader, convert the data to NumPy arrays, and apply transformations to flatten and cast images." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "33Wyf77WzNjA", + "outputId": "a2378431-79f2-4dc4-aa1a-d98704657d26" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.5.1+cpu)\n", + "Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (0.20.1+cpu)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch) (3.16.1)\n", + "Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch) (4.12.2)\n", + "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.4.2)\n", + "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.4)\n", + "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch) (2024.10.0)\n", + "Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch) (1.13.1)\n", + "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch) (1.3.0)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from torchvision) (1.26.4)\n", + "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.10/dist-packages (from torchvision) (11.0.0)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (3.0.2)\n" + ] + } + ], + "source": [ + "!pip install torch torchvision" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "kO5_WzwY59gE" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "from jax.tree_util import tree_map\n", + "from torch.utils import data\n", + "from torchvision.datasets import MNIST" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "6f6qU8PCc143" + }, + "outputs": [], + "source": [ + "def numpy_collate(batch):\n", + " \"\"\"Collate function to convert a batch of PyTorch data into NumPy arrays.\"\"\"\n", + " return tree_map(np.asarray, data.default_collate(batch))\n", + "\n", + "class NumpyLoader(data.DataLoader):\n", + " \"\"\"Custom DataLoader to return NumPy arrays from a PyTorch Dataset.\"\"\"\n", + " def __init__(self, dataset, batch_size=1,\n", + " shuffle=False, sampler=None,\n", + " batch_sampler=None, num_workers=0,\n", + " pin_memory=False, drop_last=False,\n", + " timeout=0, worker_init_fn=None):\n", + " super(self.__class__, self).__init__(dataset,\n", + " batch_size=batch_size,\n", + " shuffle=shuffle,\n", + " sampler=sampler,\n", + " batch_sampler=batch_sampler,\n", + " num_workers=num_workers,\n", + " collate_fn=numpy_collate,\n", + " pin_memory=pin_memory,\n", + " drop_last=drop_last,\n", + " timeout=timeout,\n", + " worker_init_fn=worker_init_fn)\n", + "class FlattenAndCast(object):\n", + " \"\"\"Transform class to flatten and cast images to float32.\"\"\"\n", + " def __call__(self, pic):\n", + " return np.ravel(np.array(pic, dtype=jnp.float32))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ec-MHhv6hYsK" + }, + "source": [ + "### Load Dataset with Transformations\n", + "\n", + "Standardize the data by flattening the images, casting them to `float32`, and ensuring consistent data types." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "nSviwX9ohhUh", + "outputId": "0bb3bc04-11ac-4fb6-8854-76a3f5e725a5" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz\n", + "Failed to download (trying next):\n", + "HTTP Error 403: Forbidden\n", + "\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/train-images-idx3-ubyte.gz\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 9.91M/9.91M [00:00<00:00, 36.1MB/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracting /tmp/mnist_dataset/MNIST/raw/train-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", + "\n", + "Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz\n", + "Failed to download (trying next):\n", + "HTTP Error 403: Forbidden\n", + "\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/train-labels-idx1-ubyte.gz\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 28.9k/28.9k [00:00<00:00, 1.13MB/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracting /tmp/mnist_dataset/MNIST/raw/train-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", + "\n", + "Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz\n", + "Failed to download (trying next):\n", + "HTTP Error 403: Forbidden\n", + "\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/t10k-images-idx3-ubyte.gz\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1.65M/1.65M [00:00<00:00, 10.1MB/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracting /tmp/mnist_dataset/MNIST/raw/t10k-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", + "\n", + "Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz\n", + "Failed to download (trying next):\n", + "HTTP Error 403: Forbidden\n", + "\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz\n", + "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 4.54k/4.54k [00:00<00:00, 6.34MB/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracting /tmp/mnist_dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "mnist_dataset = MNIST(data_dir, download=True, transform=FlattenAndCast())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kbdsqvPZGrsa" + }, + "source": [ + "### Full Training Dataset for Accuracy Checks\n", + "\n", + "Convert the entire training dataset to JAX arrays." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "id": "c9ZCJq_rzPck" + }, + "outputs": [], + "source": [ + "train_images = jnp.array(mnist_dataset.data.numpy().reshape(len(mnist_dataset.data), -1), dtype=jnp.float32)\n", + "train_labels = one_hot(np.array(mnist_dataset.targets), n_targets)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WXUh0BwvG8Ko" + }, + "source": [ + "### Get Full Test Dataset\n", + "\n", + "Load and process the full test dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "brlLG4SqGphm" + }, + "outputs": [], + "source": [ + "mnist_dataset_test = MNIST(data_dir, download=True, train=False)\n", + "test_images = jnp.array(mnist_dataset_test.data.numpy().reshape(len(mnist_dataset_test.data), -1), dtype=jnp.float32)\n", + "test_labels = one_hot(np.array(mnist_dataset_test.targets), n_targets)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Oz-UVnCxG5E8", + "outputId": "0f44cb63-b12c-47a7-8bd5-ed773e2b2ec5" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Train: (60000, 784) (60000, 10)\n", + "Test: (10000, 784) (10000, 10)\n" + ] + } + ], + "source": [ + "print('Train:', train_images.shape, train_labels.shape)\n", + "print('Test:', test_images.shape, test_labels.shape)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mfSnfJND6I8G" + }, + "source": [ + "### Training Data Generator\n", + "\n", + "Define a generator function using PyTorch's DataLoader for batch training.\n", + "Setting `num_workers > 0` enables multi-process data loading, which can accelerate data loading for larger datasets or intensive preprocessing tasks. Experiment with different values to find the optimal setting for your hardware and workload.\n", + "\n", + "Note: When setting `num_workers > 0`, you may see the following `RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.`\n", + "This warning can be safely ignored since data loaders do not use JAX within the forked processes." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "Kxbl6bcx6crv" + }, + "outputs": [], + "source": [ + "def pytorch_training_generator(mnist_dataset):\n", + " return NumpyLoader(mnist_dataset, batch_size=batch_size, num_workers=0)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xzt2x9S1HC3T" + }, + "source": [ + "### Training Loop (PyTorch DataLoader)\n", + "\n", + "The training loop uses the PyTorch DataLoader to iterate through batches and update model parameters." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "MUrJxpjvUyOm", + "outputId": "629a19b1-acba-418a-f04b-3b78d7909de1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 1 in 5.65 sec: Train Accuracy: 0.9159,Test Accuracy: 0.9197\n", + "Epoch 2 in 4.26 sec: Train Accuracy: 0.9371,Test Accuracy: 0.9383\n", + "Epoch 3 in 4.39 sec: Train Accuracy: 0.9493,Test Accuracy: 0.9468\n", + "Epoch 4 in 4.16 sec: Train Accuracy: 0.9568,Test Accuracy: 0.9536\n", + "Epoch 5 in 4.04 sec: Train Accuracy: 0.9632,Test Accuracy: 0.9576\n", + "Epoch 6 in 4.06 sec: Train Accuracy: 0.9674,Test Accuracy: 0.9617\n", + "Epoch 7 in 4.06 sec: Train Accuracy: 0.9708,Test Accuracy: 0.9649\n", + "Epoch 8 in 4.07 sec: Train Accuracy: 0.9737,Test Accuracy: 0.9672\n" + ] + } + ], + "source": [ + "train_model(num_epochs, params, pytorch_training_generator(mnist_dataset), data_loader_type='iterable')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ACy1PoSVa3zH" + }, + "source": [ + "## Loading Data with TensorFlow Datasets (TFDS)\n", + "\n", + "This section demonstrates how to load the MNIST dataset using TFDS, fetch the full dataset for evaluation, and define a training generator for batch processing. GPU usage is explicitly disabled for TensorFlow." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tcJRzpyOveWK" + }, + "source": [ + "Ensure you have the latest versions of both TensorFlow and TensorFlow Datasets" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "_f55HPGAZu6P", + "outputId": "838c8f76-aa07-49d5-986d-3c88ed516b22" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: tensorflow in /usr/local/lib/python3.10/dist-packages (2.15.0)\n", + "Collecting tensorflow\n", + " Downloading tensorflow-2.18.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)\n", + "Requirement already satisfied: tensorflow-datasets in /usr/local/lib/python3.10/dist-packages (4.9.7)\n", + "Requirement already satisfied: absl-py>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.4.0)\n", + "Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.6.3)\n", + "Requirement already satisfied: flatbuffers>=24.3.25 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (24.3.25)\n", + "Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.6.0)\n", + "Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.2.0)\n", + "Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (18.1.1)\n", + "Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (3.4.0)\n", + "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from tensorflow) (24.2)\n", + "Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (4.25.5)\n", + "Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.32.3)\n", + "Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from tensorflow) (75.1.0)\n", + "Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.16.0)\n", + "Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.5.0)\n", + "Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (4.12.2)\n", + "Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.14.1)\n", + "Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.68.0)\n", + "Collecting tensorboard<2.19,>=2.18 (from tensorflow)\n", + " Downloading tensorboard-2.18.0-py3-none-any.whl.metadata (1.6 kB)\n", + "Collecting keras>=3.5.0 (from tensorflow)\n", + " Downloading keras-3.6.0-py3-none-any.whl.metadata (5.8 kB)\n", + "Requirement already satisfied: numpy<2.1.0,>=1.26.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.26.4)\n", + "Requirement already satisfied: h5py>=3.11.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (3.12.1)\n", + "Collecting ml-dtypes<0.5.0,>=0.4.0 (from tensorflow)\n", + " Downloading ml_dtypes-0.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)\n", + "Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.37.1)\n", + "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets) (8.1.7)\n", + "Requirement already satisfied: dm-tree in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets) (0.1.8)\n", + "Requirement already satisfied: immutabledict in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets) (4.2.1)\n", + "Requirement already satisfied: promise in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets) (2.3)\n", + "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets) (5.9.5)\n", + "Requirement already satisfied: pyarrow in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets) (18.0.0)\n", + "Requirement already satisfied: simple-parsing in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets) (0.1.6)\n", + "Requirement already satisfied: tensorflow-metadata in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets) (1.13.1)\n", + "Requirement already satisfied: toml in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets) (0.10.2)\n", + "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets) (4.66.6)\n", + "Requirement already satisfied: array-record>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets) (0.5.1)\n", + "Requirement already satisfied: etils>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from etils[edc,enp,epath,epy,etree]>=1.6.0; python_version < \"3.11\"->tensorflow-datasets) (1.10.0)\n", + "Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from astunparse>=1.6.0->tensorflow) (0.45.0)\n", + "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from etils[edc,enp,epath,epy,etree]>=1.6.0; python_version < \"3.11\"->tensorflow-datasets) (2024.10.0)\n", + "Requirement already satisfied: importlib_resources in /usr/local/lib/python3.10/dist-packages (from etils[edc,enp,epath,epy,etree]>=1.6.0; python_version < \"3.11\"->tensorflow-datasets) (6.4.5)\n", + "Requirement already satisfied: zipp in /usr/local/lib/python3.10/dist-packages (from etils[edc,enp,epath,epy,etree]>=1.6.0; python_version < \"3.11\"->tensorflow-datasets) (3.21.0)\n", + "Requirement already satisfied: rich in /usr/local/lib/python3.10/dist-packages (from keras>=3.5.0->tensorflow) (13.9.4)\n", + "Collecting namex (from keras>=3.5.0->tensorflow)\n", + " Downloading namex-0.0.8-py3-none-any.whl.metadata (246 bytes)\n", + "Collecting optree (from keras>=3.5.0->tensorflow)\n", + " Downloading optree-0.13.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (47 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m47.8/47.8 kB\u001b[0m \u001b[31m1.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorflow) (3.4.0)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorflow) (3.10)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorflow) (2.2.3)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorflow) (2024.8.30)\n", + "Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.19,>=2.18->tensorflow) (3.7)\n", + "Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.19,>=2.18->tensorflow) (0.7.2)\n", + "Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.19,>=2.18->tensorflow) (3.1.3)\n", + "Requirement already satisfied: docstring-parser<1.0,>=0.15 in /usr/local/lib/python3.10/dist-packages (from simple-parsing->tensorflow-datasets) (0.16)\n", + "Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow-metadata->tensorflow-datasets) (1.66.0)\n", + "Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from werkzeug>=1.0.1->tensorboard<2.19,>=2.18->tensorflow) (3.0.2)\n", + "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras>=3.5.0->tensorflow) (3.0.0)\n", + "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras>=3.5.0->tensorflow) (2.18.0)\n", + "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich->keras>=3.5.0->tensorflow) (0.1.2)\n", + "Downloading tensorflow-2.18.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (615.3 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m615.3/615.3 MB\u001b[0m \u001b[31m626.4 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading keras-3.6.0-py3-none-any.whl (1.2 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.2/1.2 MB\u001b[0m \u001b[31m49.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading ml_dtypes-0.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.2/2.2 MB\u001b[0m \u001b[31m77.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading tensorboard-2.18.0-py3-none-any.whl (5.5 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.5/5.5 MB\u001b[0m \u001b[31m70.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading namex-0.0.8-py3-none-any.whl (5.8 kB)\n", + "Downloading optree-0.13.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (381 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m381.3/381.3 kB\u001b[0m \u001b[31m27.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hInstalling collected packages: namex, optree, ml-dtypes, tensorboard, keras, tensorflow\n", + " Attempting uninstall: ml-dtypes\n", + " Found existing installation: ml-dtypes 0.2.0\n", + " Uninstalling ml-dtypes-0.2.0:\n", + " Successfully uninstalled ml-dtypes-0.2.0\n", + " Attempting uninstall: tensorboard\n", + " Found existing installation: tensorboard 2.15.2\n", + " Uninstalling tensorboard-2.15.2:\n", + " Successfully uninstalled tensorboard-2.15.2\n", + " Attempting uninstall: keras\n", + " Found existing installation: keras 2.15.0\n", + " Uninstalling keras-2.15.0:\n", + " Successfully uninstalled keras-2.15.0\n", + " Attempting uninstall: tensorflow\n", + " Found existing installation: tensorflow 2.15.0\n", + " Uninstalling tensorflow-2.15.0:\n", + " Successfully uninstalled tensorflow-2.15.0\n", + "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "tensorflow-text 2.15.0 requires tensorflow<2.16,>=2.15.0; platform_machine != \"arm64\" or platform_system != \"Darwin\", but you have tensorflow 2.18.0 which is incompatible.\n", + "tf-keras 2.15.1 requires tensorflow<2.16,>=2.15, but you have tensorflow 2.18.0 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[0mSuccessfully installed keras-3.6.0 ml-dtypes-0.4.1 namex-0.0.8 optree-0.13.1 tensorboard-2.18.0 tensorflow-2.18.0\n" + ] + }, + { + "data": { + "application/vnd.colab-display-data+json": { + "id": "62e7ae5195964acea7f16ab1423ff920", + "pip_warning": { + "packages": [ + "ml_dtypes" + ] + } + } + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "!pip install --upgrade tensorflow tensorflow-datasets" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "sGaQAk1DHMUx" + }, + "outputs": [], + "source": [ + "import tensorflow_datasets as tfds" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F6OlzaDqwe4p" + }, + "source": [ + "### Fetch Full Dataset for Evaluation\n", + "\n", + "Load the dataset with `tfds.load`, convert it to NumPy arrays, and process it for evaluation." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 104, + "referenced_widgets": [ + "43d95e3e6b704cb5ae941541862e35fe", + "fca543b71352477db00545b3990d44fa", + "d3c971a3507249c9a22cad026e46d739", + "6da776e94f7740b9aae06f298c1e03cd", + "b4aec5e3895e4a19912c74777e9ea835", + "ef4dc5b756d74129bd2d643d99a1ab2e", + "30243b81748e497eb526b25404e95826", + "3bb9b93e595d4a0ca973ded476c0a5d0", + "b770951ecace4b02ad1575fe9eb9e640", + "79009c4ea2bf46b1a3a2c6558fa6ec2f", + "5cb081d3a038482583350d018a768bd4" + ] + }, + "id": "1hOamw_7C8Pb", + "outputId": "0e3805dc-1bfd-4222-9052-0b2111ea3091" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /tmp/mnist_dataset/mnist/3.0.1...\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "43d95e3e6b704cb5ae941541862e35fe", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Dl Completed...: 0%| | 0/5 [00:00=9.1.0 (from grain)\n", + " Downloading more_itertools-10.5.0-py3-none-any.whl.metadata (36 kB)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from grain) (1.26.4)\n", + "Requirement already satisfied: typing_extensions in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (4.12.2)\n", + "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (2024.10.0)\n", + "Requirement already satisfied: importlib_resources in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (6.4.5)\n", + "Requirement already satisfied: zipp in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (3.21.0)\n", + "Downloading grain-0.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (418 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m419.0/419.0 kB\u001b[0m \u001b[31m7.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading more_itertools-10.5.0-py3-none-any.whl (60 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m61.0/61.0 kB\u001b[0m \u001b[31m3.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading jaxtyping-0.2.36-py3-none-any.whl (55 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m55.8/55.8 kB\u001b[0m \u001b[31m4.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hInstalling collected packages: more-itertools, jaxtyping, grain\n", + " Attempting uninstall: more-itertools\n", + " Found existing installation: more-itertools 8.10.0\n", + " Uninstalling more-itertools-8.10.0:\n", + " Successfully uninstalled more-itertools-8.10.0\n", + "Successfully installed grain-0.2.2 jaxtyping-0.2.36 more-itertools-10.5.0\n" + ] + } + ], + "source": [ + "!pip install grain" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "66bH3ZDJ7Iat" + }, + "source": [ + "Import Required Libraries (import MNIST dataset from torchvision)" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "id": "mS62eVL9Ifmz" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import grain.python as pygrain\n", + "from torchvision.datasets import MNIST" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0h6mwVrspPA-" + }, + "source": [ + "### Define Dataset Class\n", + "\n", + "Create a custom dataset class to load MNIST data for Grain." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "id": "bnrhac5Hh7y1" + }, + "outputs": [], + "source": [ + "class Dataset:\n", + " def __init__(self, data_dir, train=True):\n", + " self.data_dir = data_dir\n", + " self.train = train\n", + " self.load_data()\n", + "\n", + " def load_data(self):\n", + " # Load the MNIST dataset using PyGrain\n", + " self.dataset = MNIST(self.data_dir, download=True, train=self.train)\n", + "\n", + " def __len__(self):\n", + " return len(self.dataset)\n", + "\n", + " def __getitem__(self, index):\n", + " img, label = self.dataset[index]\n", + " return np.ravel(np.array(img, dtype=np.float32)), label" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "53mf8bWEsyTr" + }, + "source": [ + "### Initialize the Dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "id": "pN3oF7-ostGE" + }, + "outputs": [], + "source": [ + "mnist_dataset = Dataset(data_dir)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GqD-ycgBuwv9" + }, + "source": [ + "### Get the full train and test dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "id": "f1VnTuX3u_kL" + }, + "outputs": [], + "source": [ + "train_images = jnp.array([mnist_dataset[i][0] for i in range(len(mnist_dataset))], dtype=jnp.float32)\n", + "train_labels = one_hot(np.array([mnist_dataset[i][1] for i in range(len(mnist_dataset))]), n_targets)\n", + "\n", + "mnist_dataset_test = MNIST(data_dir, download=True, train=False)\n", + "\n", + "# Convert test images to JAX arrays and encode test labels as one-hot vectors\n", + "test_images = jnp.array([np.ravel(np.array(mnist_dataset_test[i][0], dtype=np.float32)) for i in range(len(mnist_dataset_test))], dtype=jnp.float32)\n", + "test_labels = one_hot(np.array([mnist_dataset_test[i][1] for i in range(len(mnist_dataset_test))]), n_targets)" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "a2NHlp9klrQL", + "outputId": "cc9e0958-8484-4669-a2d1-abac36a3097f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Train: (60000, 784) (60000, 10)\n", + "Test: (10000, 784) (10000, 10)\n" + ] + } + ], + "source": [ + "print(\"Train:\", train_images.shape, train_labels.shape)\n", + "print(\"Test:\", test_images.shape, test_labels.shape)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1QPbXt7O0JN-" + }, + "source": [ + "### Initialize PyGrain DataLoader" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "id": "9RuFTcsCs2Ac" + }, + "outputs": [], + "source": [ + "sampler = pygrain.SequentialSampler(\n", + " num_records=len(mnist_dataset),\n", + " shard_options=pygrain.ShardByJaxProcess()) # Shard across TPU cores\n", + "\n", + "def pygrain_training_generator():\n", + " return pygrain.DataLoader(\n", + " data_source=mnist_dataset,\n", + " sampler=sampler,\n", + " operations=[pygrain.Batch(batch_size)],\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GvpJPHAbeuHW" + }, + "source": [ + "### Training Loop (Grain)\n", + "\n", + "Run the training loop using the Grain DataLoader." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cjxJRtiTadEI", + "outputId": "a620e9f7-7a01-4ba8-fe16-6f988401c7c1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 1 in 8.05 sec: Train Accuracy: 0.9159,Test Accuracy: 0.9197\n", + "Epoch 2 in 8.14 sec: Train Accuracy: 0.9371,Test Accuracy: 0.9383\n", + "Epoch 3 in 8.99 sec: Train Accuracy: 0.9493,Test Accuracy: 0.9468\n", + "Epoch 4 in 9.00 sec: Train Accuracy: 0.9568,Test Accuracy: 0.9536\n", + "Epoch 5 in 8.40 sec: Train Accuracy: 0.9632,Test Accuracy: 0.9576\n", + "Epoch 6 in 8.28 sec: Train Accuracy: 0.9674,Test Accuracy: 0.9617\n", + "Epoch 7 in 8.20 sec: Train Accuracy: 0.9708,Test Accuracy: 0.9649\n", + "Epoch 8 in 8.24 sec: Train Accuracy: 0.9737,Test Accuracy: 0.9672\n" + ] + } + ], + "source": [ + "train_model(num_epochs, params, pygrain_training_generator)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oixvOI816qUn" + }, + "source": [ + "## Loading Data with Hugging Face\n", + "\n", + "This section demonstrates loading MNIST data using the Hugging Face `datasets` library. You'll format the dataset for JAX compatibility, prepare flattened images and one-hot-encoded labels, and define a training generator." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o51P6lr86wz-" + }, + "source": [ + "Install the Hugging Face `datasets` library." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "19ipxPhI6oSN", + "outputId": "e0d52dfb-6c60-4539-a043-574d2533a744" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting datasets\n", + " Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.16.1)\n", + "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.26.4)\n", + "Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (18.0.0)\n", + "Collecting dill<0.3.9,>=0.3.0 (from datasets)\n", + " Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)\n", + "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.2.2)\n", + "Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.32.3)\n", + "Requirement already satisfied: tqdm>=4.66.3 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.6)\n", + "Collecting xxhash (from datasets)\n", + " Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)\n", + "Collecting multiprocess<0.70.17 (from datasets)\n", + " Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)\n", + "Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)\n", + " Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)\n", + "Collecting aiohttp (from datasets)\n", + " Downloading aiohttp-3.11.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)\n", + "Requirement already satisfied: huggingface-hub>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.26.2)\n", + "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (24.2)\n", + "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.2)\n", + "Collecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)\n", + " Downloading aiohappyeyeballs-2.4.3-py3-none-any.whl.metadata (6.1 kB)\n", + "Collecting aiosignal>=1.1.2 (from aiohttp->datasets)\n", + " Downloading aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)\n", + "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)\n", + "Collecting frozenlist>=1.1.1 (from aiohttp->datasets)\n", + " Downloading frozenlist-1.5.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)\n", + "Collecting multidict<7.0,>=4.5 (from aiohttp->datasets)\n", + " Downloading multidict-6.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.0 kB)\n", + "Collecting propcache>=0.2.0 (from aiohttp->datasets)\n", + " Downloading propcache-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)\n", + "Collecting yarl<2.0,>=1.17.0 (from aiohttp->datasets)\n", + " Downloading yarl-1.17.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (66 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m66.6/66.6 kB\u001b[0m \u001b[31m1.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting async-timeout<6.0,>=4.0 (from aiohttp->datasets)\n", + " Downloading async_timeout-5.0.1-py3-none-any.whl.metadata (5.1 kB)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0->datasets) (4.12.2)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.4.0)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.10)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2.2.3)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2024.8.30)\n", + "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.9.0.post0)\n", + "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)\n", + "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n", + "Downloading datasets-3.1.0-py3-none-any.whl (480 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m7.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m10.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (179 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m179.3/179.3 kB\u001b[0m \u001b[31m15.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading aiohttp-3.11.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m30.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m9.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m15.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading aiohappyeyeballs-2.4.3-py3-none-any.whl (14 kB)\n", + "Downloading aiosignal-1.3.1-py3-none-any.whl (7.6 kB)\n", + "Downloading async_timeout-5.0.1-py3-none-any.whl (6.2 kB)\n", + "Downloading frozenlist-1.5.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (241 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m241.9/241.9 kB\u001b[0m \u001b[31m18.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading multidict-6.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (124 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m124.6/124.6 kB\u001b[0m \u001b[31m10.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading propcache-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (208 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m208.9/208.9 kB\u001b[0m \u001b[31m15.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading yarl-1.17.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (319 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m319.2/319.2 kB\u001b[0m \u001b[31m23.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hInstalling collected packages: xxhash, propcache, multidict, fsspec, frozenlist, dill, async-timeout, aiohappyeyeballs, yarl, multiprocess, aiosignal, aiohttp, datasets\n", + " Attempting uninstall: fsspec\n", + " Found existing installation: fsspec 2024.10.0\n", + " Uninstalling fsspec-2024.10.0:\n", + " Successfully uninstalled fsspec-2024.10.0\n", + "Successfully installed aiohappyeyeballs-2.4.3 aiohttp-3.11.6 aiosignal-1.3.1 async-timeout-5.0.1 datasets-3.1.0 dill-0.3.8 frozenlist-1.5.0 fsspec-2024.9.0 multidict-6.1.0 multiprocess-0.70.16 propcache-0.2.0 xxhash-3.5.0 yarl-1.17.2\n" + ] + } + ], + "source": [ + "!pip install datasets" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "id": "8v1N59p76zn0" + }, + "outputs": [], + "source": [ + "from datasets import load_dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Gaj11tO7C86" + }, + "source": [ + "Load the MNIST dataset from Hugging Face and format it as `numpy` arrays for quick access or `jax` to get JAX arrays." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 301, + "referenced_widgets": [ + "86617153e14143c6900da3535b74ef07", + "8de57c9ecba14aa5b1f642af5c7e9094", + "515fe154b1b74ed981e877aef503aa99", + "4e291a8b028847328ea1d9a650c20beb", + "87a0c8cdc0ad423daba7082b985cbd2b", + "4764b5b806b94734b760cf6cc2fc224d", + "5307bf3142804235bb688694c517d80c", + "6a2fd6755667443abe7710ad607a79cc", + "91bc1755904e40db8d758db4d09754e3", + "69c38d75960542fb83fa087cae761957", + "dc31cb349c9b4c3580b2b77cbad1325c", + "d451224a0ce540648b0c28d433d85803", + "52f2f12dcffe4507ab92286fd3810db6", + "6ab919475c80413e94afa66304b05338", + "305d05093c6e411cb438a0bbf122d574", + "aa11f21e68994a8d9ddead215f2f4920", + "59a7233abf61461b8b3feeb31b2f544f", + "9d909399be9a4fa48bc3d781905c7f5a", + "5b6172eb4e0541a3b07d4f82de77a303", + "bc3bec617b0040f487f80134537a3068", + "9fe417f8159244f8ac808f2844922cf3", + "c4748e35e8574bb286a527295df98c8e", + "f50572e8058c4864bb8143c364d191f9", + "436955f611674e27b4ddf3e040cc5ce9", + "048231bf788c447091b8ef0174101f42", + "97009f7e20d84c7c9d89f7497efc494c", + "84e2844437884f6c89683e6545a2262e", + "df3019cc6aa44a4cbcb62096444769a7", + "ce17fe81850c49cd924297d21ecda621", + "422117e32e0b4a95bed7925c99fd9f78", + "56ab1fa0212a43a4a70838e440be0e9c", + "1c5483472cea483bbf2a8fe2a9182ce0", + "00034cb6a66143d8a87922befb1da7a6", + "368b51d79aed4184854f155e2951da81", + "eb9de18be48d4a0db1034a38a0287ea6", + "dbec1d9b196849a5ad79a5f083dbe64e", + "66db6915d27b4fb49e1b44f70cb61654", + "80f3e3a30dc24d3fa54bb72dc1c60182", + "c320096ba1e74c7bbbd9509cc11c22e9", + "a664dd9c446040e8b175bb91d1c051db", + "66c7826ff9b4455db9f7e9717a432f73", + "74ec8cec0f3c4c04b76f5fb87ea2d9bb", + "ea4537aef1e247378de1935ad50ef76c", + "a9cffb2f5e194dfaba516bb4c8c47e3f", + "4f17b7ab6ae94ce3b122561bcd8d4427", + "3c0bdc06fe07412bacc00daa6f1eec34", + "1ba273ced1484bcf9855366ff0dc3645", + "7413d8bab616446ba6b820a3f874f6a0", + "53c160c26c634b53a914be18ed91016c", + "ebc4ad2fae264e72a5307a0481a97ab3", + "83ab5e7617fb45898c259bc20f71e958", + "21f1138e807e4946953e3074d72d9a27", + "86d7357878634706b9e214103efa262a", + "3713a0e1880a43bc8b23225dbb8b4c45", + "f9f85ce1cbf34a7da27804ce7cc6444e" + ] + }, + "id": "a22kTvgk6_fJ", + "outputId": "53e1d208-5360-479b-c097-0c03c7fac3e8" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: \n", + "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", + "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", + "You will be able to reuse this secret in all of your notebooks.\n", + "Please note that authentication is recommended but still optional to access public models or datasets.\n", + " warnings.warn(\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "86617153e14143c6900da3535b74ef07", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "README.md: 0%| | 0.00/6.97k [00:00 0` enables multi-process data loading, which can accelerate data loading for larger datasets or intensive preprocessing tasks. Experiment with different values to find the optimal setting for your hardware and workload. + +Note: When setting `num_workers > 0`, you may see the following `RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.` +This warning can be safely ignored since data loaders do not use JAX within the forked processes. + +```{code-cell} +:id: Kxbl6bcx6crv + +def pytorch_training_generator(mnist_dataset): + return NumpyLoader(mnist_dataset, batch_size=batch_size, num_workers=0) +``` + ++++ {"id": "Xzt2x9S1HC3T"} + +### Training Loop (PyTorch DataLoader) + +The training loop uses the PyTorch DataLoader to iterate through batches and update model parameters. + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: MUrJxpjvUyOm +outputId: 629a19b1-acba-418a-f04b-3b78d7909de1 +--- +train_model(num_epochs, params, pytorch_training_generator(mnist_dataset), data_loader_type='iterable') +``` + ++++ {"id": "ACy1PoSVa3zH"} + +## Loading Data with TensorFlow Datasets (TFDS) + +This section demonstrates how to load the MNIST dataset using TFDS, fetch the full dataset for evaluation, and define a training generator for batch processing. GPU usage is explicitly disabled for TensorFlow. + ++++ {"id": "tcJRzpyOveWK"} + +Ensure you have the latest versions of both TensorFlow and TensorFlow Datasets + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ + height: 1000 +id: _f55HPGAZu6P +outputId: 838c8f76-aa07-49d5-986d-3c88ed516b22 +--- +!pip install --upgrade tensorflow tensorflow-datasets +``` + +```{code-cell} +:id: sGaQAk1DHMUx + +import tensorflow_datasets as tfds +``` + ++++ {"id": "F6OlzaDqwe4p"} + +### Fetch Full Dataset for Evaluation + +Load the dataset with `tfds.load`, convert it to NumPy arrays, and process it for evaluation. + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ + height: 104 + referenced_widgets: [43d95e3e6b704cb5ae941541862e35fe, fca543b71352477db00545b3990d44fa, + d3c971a3507249c9a22cad026e46d739, 6da776e94f7740b9aae06f298c1e03cd, b4aec5e3895e4a19912c74777e9ea835, + ef4dc5b756d74129bd2d643d99a1ab2e, 30243b81748e497eb526b25404e95826, 3bb9b93e595d4a0ca973ded476c0a5d0, + b770951ecace4b02ad1575fe9eb9e640, 79009c4ea2bf46b1a3a2c6558fa6ec2f, 5cb081d3a038482583350d018a768bd4] +id: 1hOamw_7C8Pb +outputId: 0e3805dc-1bfd-4222-9052-0b2111ea3091 +--- +# tfds.load returns tf.Tensors (or tf.data.Datasets if batch_size != -1) +mnist_data, info = tfds.load(name="mnist", batch_size=-1, data_dir=data_dir, with_info=True) +mnist_data = tfds.as_numpy(mnist_data) +train_data, test_data = mnist_data['train'], mnist_data['test'] + +# Full train set +train_images, train_labels = train_data['image'], train_data['label'] +train_images = jnp.reshape(train_images, (len(train_images), num_pixels)) +train_labels = one_hot(train_labels, n_targets) + +# Full test set +test_images, test_labels = test_data['image'], test_data['label'] +test_images = jnp.reshape(test_images, (len(test_images), num_pixels)) +test_labels = one_hot(test_labels, n_targets) +``` + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: Td3PiLdmEf7z +outputId: 464da4f6-f028-4667-889d-a812382739b0 +--- +print('Train:', train_images.shape, train_labels.shape) +print('Test:', test_images.shape, test_labels.shape) +``` + ++++ {"id": "yy9PunCJdI-G"} + +### Define the Training Generator + +Create a generator function to yield batches of data for training. + +```{code-cell} +:id: vX59u8CqEf4J + +def training_generator(): + # as_supervised=True gives us the (image, label) as a tuple instead of a dict + ds = tfds.load(name='mnist', split='train', as_supervised=True, data_dir=data_dir) + # You can build up an arbitrary tf.data input pipeline + ds = ds.batch(batch_size).prefetch(1) + # tfds.dataset_as_numpy converts the tf.data.Dataset into an iterable of NumPy arrays + return tfds.as_numpy(ds) +``` + ++++ {"id": "EAWeUdnuFNBY"} + +### Training Loop (TFDS) + +Use the training generator in a custom training loop. + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: AsFKboVRaV6r +outputId: 9cb33f79-1b17-439d-88d3-61cd984124f6 +--- +train_model(num_epochs, params, training_generator) +``` + ++++ {"id": "-ryVkrAITS9Z"} + +## Loading Data with Grain + +This section demonstrates how to load MNIST data using Grain, a data-loading library. You'll define a custom dataset class for Grain and set up a Grain DataLoader for efficient training. + ++++ {"id": "waYhUMUGmhH-"} + +Install Grain + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: L78o7eeyGvn5 +outputId: 8f32bb0f-9a73-48a9-dbcd-4eb93ba3f606 +--- +!pip install grain +``` + ++++ {"id": "66bH3ZDJ7Iat"} + +Import Required Libraries (import MNIST dataset from torchvision) + +```{code-cell} +:id: mS62eVL9Ifmz + +import numpy as np +import grain.python as pygrain +from torchvision.datasets import MNIST +``` + ++++ {"id": "0h6mwVrspPA-"} + +### Define Dataset Class + +Create a custom dataset class to load MNIST data for Grain. + +```{code-cell} +:id: bnrhac5Hh7y1 + +class Dataset: + def __init__(self, data_dir, train=True): + self.data_dir = data_dir + self.train = train + self.load_data() + + def load_data(self): + # Load the MNIST dataset using PyGrain + self.dataset = MNIST(self.data_dir, download=True, train=self.train) + + def __len__(self): + return len(self.dataset) + + def __getitem__(self, index): + img, label = self.dataset[index] + return np.ravel(np.array(img, dtype=np.float32)), label +``` + ++++ {"id": "53mf8bWEsyTr"} + +### Initialize the Dataset + +```{code-cell} +:id: pN3oF7-ostGE + +mnist_dataset = Dataset(data_dir) +``` + ++++ {"id": "GqD-ycgBuwv9"} + +### Get the full train and test dataset + +```{code-cell} +:id: f1VnTuX3u_kL + +train_images = jnp.array([mnist_dataset[i][0] for i in range(len(mnist_dataset))], dtype=jnp.float32) +train_labels = one_hot(np.array([mnist_dataset[i][1] for i in range(len(mnist_dataset))]), n_targets) + +mnist_dataset_test = MNIST(data_dir, download=True, train=False) + +# Convert test images to JAX arrays and encode test labels as one-hot vectors +test_images = jnp.array([np.ravel(np.array(mnist_dataset_test[i][0], dtype=np.float32)) for i in range(len(mnist_dataset_test))], dtype=jnp.float32) +test_labels = one_hot(np.array([mnist_dataset_test[i][1] for i in range(len(mnist_dataset_test))]), n_targets) +``` + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: a2NHlp9klrQL +outputId: cc9e0958-8484-4669-a2d1-abac36a3097f +--- +print("Train:", train_images.shape, train_labels.shape) +print("Test:", test_images.shape, test_labels.shape) +``` + ++++ {"id": "1QPbXt7O0JN-"} + +### Initialize PyGrain DataLoader + +```{code-cell} +:id: 9RuFTcsCs2Ac + +sampler = pygrain.SequentialSampler( + num_records=len(mnist_dataset), + shard_options=pygrain.ShardByJaxProcess()) # Shard across TPU cores + +def pygrain_training_generator(): + return pygrain.DataLoader( + data_source=mnist_dataset, + sampler=sampler, + operations=[pygrain.Batch(batch_size)], + ) +``` + ++++ {"id": "GvpJPHAbeuHW"} + +### Training Loop (Grain) + +Run the training loop using the Grain DataLoader. + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: cjxJRtiTadEI +outputId: a620e9f7-7a01-4ba8-fe16-6f988401c7c1 +--- +train_model(num_epochs, params, pygrain_training_generator) +``` + ++++ {"id": "oixvOI816qUn"} + +## Loading Data with Hugging Face + +This section demonstrates loading MNIST data using the Hugging Face `datasets` library. You'll format the dataset for JAX compatibility, prepare flattened images and one-hot-encoded labels, and define a training generator. + ++++ {"id": "o51P6lr86wz-"} + +Install the Hugging Face `datasets` library. + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: 19ipxPhI6oSN +outputId: e0d52dfb-6c60-4539-a043-574d2533a744 +--- +!pip install datasets +``` + +```{code-cell} +:id: 8v1N59p76zn0 + +from datasets import load_dataset +``` + ++++ {"id": "8Gaj11tO7C86"} + +Load the MNIST dataset from Hugging Face and format it as `numpy` arrays for quick access or `jax` to get JAX arrays. + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ + height: 301 + referenced_widgets: [86617153e14143c6900da3535b74ef07, 8de57c9ecba14aa5b1f642af5c7e9094, + 515fe154b1b74ed981e877aef503aa99, 4e291a8b028847328ea1d9a650c20beb, 87a0c8cdc0ad423daba7082b985cbd2b, + 4764b5b806b94734b760cf6cc2fc224d, 5307bf3142804235bb688694c517d80c, 6a2fd6755667443abe7710ad607a79cc, + 91bc1755904e40db8d758db4d09754e3, 69c38d75960542fb83fa087cae761957, dc31cb349c9b4c3580b2b77cbad1325c, + d451224a0ce540648b0c28d433d85803, 52f2f12dcffe4507ab92286fd3810db6, 6ab919475c80413e94afa66304b05338, + 305d05093c6e411cb438a0bbf122d574, aa11f21e68994a8d9ddead215f2f4920, 59a7233abf61461b8b3feeb31b2f544f, + 9d909399be9a4fa48bc3d781905c7f5a, 5b6172eb4e0541a3b07d4f82de77a303, bc3bec617b0040f487f80134537a3068, + 9fe417f8159244f8ac808f2844922cf3, c4748e35e8574bb286a527295df98c8e, f50572e8058c4864bb8143c364d191f9, + 436955f611674e27b4ddf3e040cc5ce9, 048231bf788c447091b8ef0174101f42, 97009f7e20d84c7c9d89f7497efc494c, + 84e2844437884f6c89683e6545a2262e, df3019cc6aa44a4cbcb62096444769a7, ce17fe81850c49cd924297d21ecda621, + 422117e32e0b4a95bed7925c99fd9f78, 56ab1fa0212a43a4a70838e440be0e9c, 1c5483472cea483bbf2a8fe2a9182ce0, + 00034cb6a66143d8a87922befb1da7a6, 368b51d79aed4184854f155e2951da81, eb9de18be48d4a0db1034a38a0287ea6, + dbec1d9b196849a5ad79a5f083dbe64e, 66db6915d27b4fb49e1b44f70cb61654, 80f3e3a30dc24d3fa54bb72dc1c60182, + c320096ba1e74c7bbbd9509cc11c22e9, a664dd9c446040e8b175bb91d1c051db, 66c7826ff9b4455db9f7e9717a432f73, + 74ec8cec0f3c4c04b76f5fb87ea2d9bb, ea4537aef1e247378de1935ad50ef76c, a9cffb2f5e194dfaba516bb4c8c47e3f, + 4f17b7ab6ae94ce3b122561bcd8d4427, 3c0bdc06fe07412bacc00daa6f1eec34, 1ba273ced1484bcf9855366ff0dc3645, + 7413d8bab616446ba6b820a3f874f6a0, 53c160c26c634b53a914be18ed91016c, ebc4ad2fae264e72a5307a0481a97ab3, + 83ab5e7617fb45898c259bc20f71e958, 21f1138e807e4946953e3074d72d9a27, 86d7357878634706b9e214103efa262a, + 3713a0e1880a43bc8b23225dbb8b4c45, f9f85ce1cbf34a7da27804ce7cc6444e] +id: a22kTvgk6_fJ +outputId: 53e1d208-5360-479b-c097-0c03c7fac3e8 +--- +mnist_dataset = load_dataset("mnist", cache_dir=data_dir).with_format("numpy") +``` + ++++ {"id": "tgI7dIaX7JzM"} + +### Extract images and labels + +Get image shape and flatten for model input. + +```{code-cell} +:id: NHrKatD_7HbH + +train_images = mnist_dataset["train"]["image"] +train_labels = mnist_dataset["train"]["label"] +test_images = mnist_dataset["test"]["image"] +test_labels = mnist_dataset["test"]["label"] + +# Extract image shape +image_shape = train_images.shape[1:] +num_features = image_shape[0] * image_shape[1] + +# Flatten the images +train_images = train_images.reshape(-1, num_features) +test_images = test_images.reshape(-1, num_features) + +# One-hot encode the labels +train_labels = one_hot(train_labels, n_targets) +test_labels = one_hot(test_labels, n_targets) +``` + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: dITh435Z7Nwb +outputId: cd77ebf6-7d44-420f-f8d8-4357f915c956 +--- +print('Train:', train_images.shape, train_labels.shape) +print('Test:', test_images.shape, test_labels.shape) +``` + ++++ {"id": "kk_4zJlz7T1E"} + +### Define Training Generator + +Set up a generator to yield batches of images and labels for training. + +```{code-cell} +:id: -zLJhogj7RL- + +def hf_training_generator(): + """Yield batches for training.""" + for batch in mnist_dataset["train"].iter(batch_size): + x, y = batch["image"], batch["label"] + yield x, y +``` + ++++ {"id": "HIsGfkLI7dvZ"} + +### Training Loop (Hugging Face Datasets) + +Run the training loop using the Hugging Face training generator. + +```{code-cell} +--- +colab: + base_uri: https://localhost:8080/ +id: Ui6aLiZP7aLe +outputId: 48347baf-30f2-443d-b3bf-b12100d96b8f +--- +train_model(num_epochs, params, hf_training_generator) +``` + ++++ {"id": "_JR0V1Aix9Id"} + +## Summary + +This notebook has introduced efficient methods for multi-device distributed data loading on TPUs with JAX. You explored how to leverage popular libraries like PyTorch DataLoader, TensorFlow Datasets, Grain, and Hugging Face Datasets to streamline the data loading process for machine learning tasks. Each library offers distinct advantages, allowing you to select the best approach for your specific project needs. + +For more detailed strategies on distributed data loading with JAX, including global data pipelines and per-device processing, refer to the [Distributed Data Loading Guide](https://jax.readthedocs.io/en/latest/distributed_data_loading.html). From 9020bd52146169df681cf1ce9bd8ccc413639fda Mon Sep 17 00:00:00 2001 From: selamw1 Date: Tue, 3 Dec 2024 16:05:09 -0800 Subject: [PATCH 06/14] =?UTF-8?q?=E2=80=9Creferece=5Ftutorial=5Flinks=5Fad?= =?UTF-8?q?ded=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...ers_for_multi_device_setups_with_jax.ipynb | 22 +++++++++++-------- ...oaders_for_multi_device_setups_with_jax.md | 22 +++++++++++-------- 2 files changed, 26 insertions(+), 18 deletions(-) diff --git a/docs/data_loaders_for_multi_device_setups_with_jax.ipynb b/docs/data_loaders_for_multi_device_setups_with_jax.ipynb index 749dd7c..6c4a7e0 100644 --- a/docs/data_loaders_for_multi_device_setups_with_jax.ipynb +++ b/docs/data_loaders_for_multi_device_setups_with_jax.ipynb @@ -23,9 +23,13 @@ "* [**Grain**](https://github.com/google/grain)\n", "* [**Hugging Face**](https://huggingface.co/docs/datasets/en/use_with_jax#data-loading)\n", "\n", - "You'll see how to use each of these libraries to efficiently load data for a simple image classification task using the MNIST dataset.\n", + "You'll learn how to use each of these libraries to efficiently load data for an image classification task using the MNIST dataset.\n", "\n", - "Building on the [Data Loaders on GPU](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_on_gpu_with_jax.html) tutorial, this guide introduces optimizations for distributed training across multiple GPUs or TPUs. It focuses on data sharding with `Mesh` and `NamedSharding` to efficiently partition and synchronize data across devices. By leveraging multi-device setups, you'll maximize resource utilization for large datasets in distributed environments." + "Building on the [Data Loaders on GPU](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_on_gpu_with_jax.html) tutorial, this guide covers advanced strategies for multi-device setups, such as data sharding with `Mesh` and `NamedSharding` to partition and synchronize data across devices. These techniques allow you to partition and synchronize data across multiple devices, balancing the complexities of distributed systems while optimizing resource usage for large-scale datasets.\n", + "\n", + "If you're looking for CPU-specific data loading advice, see [Data Loaders on CPU](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_on_cpu_with_jax.html).\n", + "\n", + "If you're looking for GPU-specific data loading advice, see [Data Loaders on GPU](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_on_gpu_with_jax.html)." ] }, { @@ -57,7 +61,7 @@ "id": "TsFdlkSZKp9S" }, "source": [ - "### Checking TPU Availability for JAX" + "## Checking TPU Availability for JAX" ] }, { @@ -99,7 +103,7 @@ "id": "qyJ_WTghDnIc" }, "source": [ - "### Setting Hyperparameters and Initializing Parameters\n", + "## Setting Hyperparameters and Initializing Parameters\n", "\n", "You'll define hyperparameters for your model and data loading, including layer sizes, learning rate, batch size, and the data directory. You'll also initialize the weights and biases for a fully-connected neural network." ] @@ -141,7 +145,7 @@ "id": "rHLdqeI7D2WZ" }, "source": [ - "### Model Prediction with Auto-Batching\n", + "## Model Prediction with Auto-Batching\n", "\n", "In this section, you'll define the `predict` function for your neural network. This function computes the output of the network for a single input image.\n", "\n", @@ -182,7 +186,7 @@ "id": "AMWmxjVEpH2D" }, "source": [ - "Multi-device setup using a Mesh of devices" + "## Multi-device setup using a Mesh of devices" ] }, { @@ -210,7 +214,7 @@ "id": "rLqfeORsERek" }, "source": [ - "### Utility and Loss Functions\n", + "## Utility and Loss Functions\n", "\n", "You'll now define utility functions for:\n", "- One-hot encoding: Converts class indices to binary vectors.\n", @@ -1676,9 +1680,9 @@ "source": [ "## Summary\n", "\n", - "This notebook has introduced efficient methods for multi-device distributed data loading on TPUs with JAX. You explored how to leverage popular libraries like PyTorch DataLoader, TensorFlow Datasets, Grain, and Hugging Face Datasets to streamline the data loading process for machine learning tasks. Each library offers distinct advantages, allowing you to select the best approach for your specific project needs.\n", + "This notebook introduced efficient methods for multi-device distributed data loading on TPUs with JAX. You explored how to leverage popular libraries like PyTorch DataLoader, TensorFlow Datasets, Grain, and Hugging Face Datasets to optimize the data loading process for machine learning tasks. Each library offers unique advantages, enabling you to choose the best approach based on your project’s requirements.\n", "\n", - "For more detailed strategies on distributed data loading with JAX, including global data pipelines and per-device processing, refer to the [Distributed Data Loading Guide](https://jax.readthedocs.io/en/latest/distributed_data_loading.html)." + "For more in-depth strategies on distributed data loading with JAX, including global data pipelines and per-device processing, refer to the [Distributed Data Loading Guide](https://jax.readthedocs.io/en/latest/distributed_data_loading.html)." ] } ], diff --git a/docs/data_loaders_for_multi_device_setups_with_jax.md b/docs/data_loaders_for_multi_device_setups_with_jax.md index afeec1b..4494b37 100644 --- a/docs/data_loaders_for_multi_device_setups_with_jax.md +++ b/docs/data_loaders_for_multi_device_setups_with_jax.md @@ -25,9 +25,13 @@ This tutorial explores various data loading strategies for **JAX** in **multi-de * [**Grain**](https://github.com/google/grain) * [**Hugging Face**](https://huggingface.co/docs/datasets/en/use_with_jax#data-loading) -You'll see how to use each of these libraries to efficiently load data for a simple image classification task using the MNIST dataset. +You'll learn how to use each of these libraries to efficiently load data for an image classification task using the MNIST dataset. -Building on the [Data Loaders on GPU](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_on_gpu_with_jax.html) tutorial, this guide introduces optimizations for distributed training across multiple GPUs or TPUs. It focuses on data sharding with `Mesh` and `NamedSharding` to efficiently partition and synchronize data across devices. By leveraging multi-device setups, you'll maximize resource utilization for large datasets in distributed environments. +Building on the [Data Loaders on GPU](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_on_gpu_with_jax.html) tutorial, this guide covers advanced strategies for multi-device setups, such as data sharding with `Mesh` and `NamedSharding` to partition and synchronize data across devices. These techniques allow you to partition and synchronize data across multiple devices, balancing the complexities of distributed systems while optimizing resource usage for large-scale datasets. + +If you're looking for CPU-specific data loading advice, see [Data Loaders on CPU](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_on_cpu_with_jax.html). + +If you're looking for GPU-specific data loading advice, see [Data Loaders on GPU](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_on_gpu_with_jax.html). +++ {"id": "-rsMgVtO6asW"} @@ -44,7 +48,7 @@ from jax.sharding import Mesh, PartitionSpec, NamedSharding +++ {"id": "TsFdlkSZKp9S"} -### Checking TPU Availability for JAX +## Checking TPU Availability for JAX ```{code-cell} --- @@ -58,7 +62,7 @@ jax.devices() +++ {"id": "qyJ_WTghDnIc"} -### Setting Hyperparameters and Initializing Parameters +## Setting Hyperparameters and Initializing Parameters You'll define hyperparameters for your model and data loading, including layer sizes, learning rate, batch size, and the data directory. You'll also initialize the weights and biases for a fully-connected neural network. @@ -90,7 +94,7 @@ params = init_network_params(layer_sizes, random.PRNGKey(0)) +++ {"id": "rHLdqeI7D2WZ"} -### Model Prediction with Auto-Batching +## Model Prediction with Auto-Batching In this section, you'll define the `predict` function for your neural network. This function computes the output of the network for a single input image. @@ -121,7 +125,7 @@ batched_predict = vmap(predict, in_axes=(None, 0)) +++ {"id": "AMWmxjVEpH2D"} -Multi-device setup using a Mesh of devices +## Multi-device setup using a Mesh of devices ```{code-cell} :id: 4Jc5YLFnpE-_ @@ -139,7 +143,7 @@ sharding_spec = PartitionSpec('device') +++ {"id": "rLqfeORsERek"} -### Utility and Loss Functions +## Utility and Loss Functions You'll now define utility functions for: - One-hot encoding: Converts class indices to binary vectors. @@ -714,6 +718,6 @@ train_model(num_epochs, params, hf_training_generator) ## Summary -This notebook has introduced efficient methods for multi-device distributed data loading on TPUs with JAX. You explored how to leverage popular libraries like PyTorch DataLoader, TensorFlow Datasets, Grain, and Hugging Face Datasets to streamline the data loading process for machine learning tasks. Each library offers distinct advantages, allowing you to select the best approach for your specific project needs. +This notebook introduced efficient methods for multi-device distributed data loading on TPUs with JAX. You explored how to leverage popular libraries like PyTorch DataLoader, TensorFlow Datasets, Grain, and Hugging Face Datasets to optimize the data loading process for machine learning tasks. Each library offers unique advantages, enabling you to choose the best approach based on your project’s requirements. -For more detailed strategies on distributed data loading with JAX, including global data pipelines and per-device processing, refer to the [Distributed Data Loading Guide](https://jax.readthedocs.io/en/latest/distributed_data_loading.html). +For more in-depth strategies on distributed data loading with JAX, including global data pipelines and per-device processing, refer to the [Distributed Data Loading Guide](https://jax.readthedocs.io/en/latest/distributed_data_loading.html). From 3e4b4aac4be5690dcc089dda6626fe9de3131ca1 Mon Sep 17 00:00:00 2001 From: selamw1 Date: Wed, 4 Dec 2024 14:24:31 -0800 Subject: [PATCH 07/14] files_rebased_from_docs_to_dosc_source --- .../data_loaders_for_multi_device_setups_with_jax.ipynb | 0 .../data_loaders_for_multi_device_setups_with_jax.md | 0 docs/source/tutorials.md | 2 +- 3 files changed, 1 insertion(+), 1 deletion(-) rename docs/{ => source}/data_loaders_for_multi_device_setups_with_jax.ipynb (100%) rename docs/{ => source}/data_loaders_for_multi_device_setups_with_jax.md (100%) diff --git a/docs/data_loaders_for_multi_device_setups_with_jax.ipynb b/docs/source/data_loaders_for_multi_device_setups_with_jax.ipynb similarity index 100% rename from docs/data_loaders_for_multi_device_setups_with_jax.ipynb rename to docs/source/data_loaders_for_multi_device_setups_with_jax.ipynb diff --git a/docs/data_loaders_for_multi_device_setups_with_jax.md b/docs/source/data_loaders_for_multi_device_setups_with_jax.md similarity index 100% rename from docs/data_loaders_for_multi_device_setups_with_jax.md rename to docs/source/data_loaders_for_multi_device_setups_with_jax.md diff --git a/docs/source/tutorials.md b/docs/source/tutorials.md index 071343b..2fa663a 100644 --- a/docs/source/tutorials.md +++ b/docs/source/tutorials.md @@ -32,4 +32,4 @@ Once you've gone through this content, you can refer to package-specific documentation for resources that go into more depth on various topics: - [JAX tutorials](https://jax.readthedocs.io/en/latest/tutorials.html) -- [FLAX user guides](https://flax.readthedocs.io/en/latest/guides/index.html) +- [FLAX user guides](https://flax.readthedocs.io/en/latest/guides/index.html) \ No newline at end of file From 3bb829429486c9c178493691266654984c8430ca Mon Sep 17 00:00:00 2001 From: selamw1 Date: Wed, 4 Dec 2024 14:27:49 -0800 Subject: [PATCH 08/14] old_files_removed_from_docs --- docs/data_loaders_on_cpu_with_jax.ipynb | 3575 ----------------------- docs/data_loaders_on_cpu_with_jax.md | 691 ----- 2 files changed, 4266 deletions(-) delete mode 100644 docs/data_loaders_on_cpu_with_jax.ipynb delete mode 100644 docs/data_loaders_on_cpu_with_jax.md diff --git a/docs/data_loaders_on_cpu_with_jax.ipynb b/docs/data_loaders_on_cpu_with_jax.ipynb deleted file mode 100644 index 0ba897e..0000000 --- a/docs/data_loaders_on_cpu_with_jax.ipynb +++ /dev/null @@ -1,3575 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "PUFGZggH49zp" - }, - "source": [ - "# Introduction to Data Loaders on CPU with JAX" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "3ia4PKEV5Dr8" - }, - "source": [ - "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jax-ml/jax-ai-stack/blob/main/docs/data_loaders_on_cpu_with_jax.ipynb)\n", - "\n", - "This tutorial explores different data loading strategies for using **JAX** on a single [**CPU**](https://jax.readthedocs.io/en/latest/glossary.html#term-CPU). While JAX doesn't include a built-in data loader, it seamlessly integrates with popular data loading libraries, including:\n", - "\n", - "- [**PyTorch DataLoader**](https://github.com/pytorch/data)\n", - "- [**TensorFlow Datasets (TFDS)**](https://github.com/tensorflow/datasets)\n", - "- [**Grain**](https://github.com/google/grain)\n", - "- [**Hugging Face**](https://huggingface.co/docs/datasets/en/use_with_jax#data-loading)\n", - "\n", - "In this tutorial, you'll learn how to efficiently load data using these libraries for a simple image classification task based on the MNIST dataset.\n", - "\n", - "Compared to GPU or multi-device setups, CPU-based data loading is straightforward as it avoids challenges like GPU memory management and data synchronization across devices. This makes it ideal for smaller-scale tasks or scenarios where data resides exclusively on the CPU.\n", - "\n", - "If you're looking for GPU-specific data loading advice, see [Data Loaders on GPU](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_on_gpu_with_jax.html).\n", - "\n", - "If you're looking for a multi-device data loading strategy, see [Data Loaders on Multi-Device Setups](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_for_multi_device_setups_with_jax.html)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pEsb135zE-Jo" - }, - "source": [ - "## Setting JAX to Use CPU Only\n", - "\n", - "First, you'll restrict JAX to use only the CPU, even if a GPU is available. This ensures consistency and allows you to focus on CPU-based data loading." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "vqP6xyObC0_9" - }, - "outputs": [], - "source": [ - "import os\n", - "os.environ['JAX_PLATFORM_NAME'] = 'cpu'" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-rsMgVtO6asW" - }, - "source": [ - "Import JAX API" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "tDJNQ6V-Dg5g" - }, - "outputs": [], - "source": [ - "import jax\n", - "import jax.numpy as jnp\n", - "from jax import random, grad, jit, vmap" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TsFdlkSZKp9S" - }, - "source": [ - "### CPU Setup Verification" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "N3sqvaF3KJw1", - "outputId": "449c83d9-d050-4b15-9a8d-f71e340501f2" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[CpuDevice(id=0)]" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "jax.devices()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qyJ_WTghDnIc" - }, - "source": [ - "## Setting Hyperparameters and Initializing Parameters\n", - "\n", - "You'll define hyperparameters for your model and data loading, including layer sizes, learning rate, batch size, and the data directory. You'll also initialize the weights and biases for a fully-connected neural network." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "qLNOSloFDka_" - }, - "outputs": [], - "source": [ - "# A helper function to randomly initialize weights and biases\n", - "# for a dense neural network layer\n", - "def random_layer_params(m, n, key, scale=1e-2):\n", - " w_key, b_key = random.split(key)\n", - " return scale * random.normal(w_key, (n, m)), scale * random.normal(b_key, (n,))\n", - "\n", - "# Function to initialize network parameters for all layers based on defined sizes\n", - "def init_network_params(sizes, key):\n", - " keys = random.split(key, len(sizes))\n", - " return [random_layer_params(m, n, k) for m, n, k in zip(sizes[:-1], sizes[1:], keys)]\n", - "\n", - "layer_sizes = [784, 512, 512, 10] # Layers of the network\n", - "step_size = 0.01 # Learning rate for optimization\n", - "num_epochs = 8 # Number of training epochs\n", - "batch_size = 128 # Batch size for training\n", - "n_targets = 10 # Number of classes (digits 0-9)\n", - "num_pixels = 28 * 28 # Input size (MNIST images are 28x28 pixels)\n", - "data_dir = '/tmp/mnist_dataset' # Directory for storing the dataset\n", - "\n", - "# Initialize network parameters using the defined layer sizes and a random seed\n", - "params = init_network_params(layer_sizes, random.PRNGKey(0))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6Ci_CqW7q6XM" - }, - "source": [ - "## Model Prediction with Auto-Batching\n", - "\n", - "In this section, you'll define the `predict` function for your neural network. This function computes the output of the network for a single input image.\n", - "\n", - "To efficiently process multiple images simultaneously, you'll use [`vmap`](https://jax.readthedocs.io/en/latest/_autosummary/jax.vmap.html#jax.vmap), which allows you to vectorize the `predict` function and apply it across a batch of inputs. This technique, called auto-batching, improves computational efficiency by leveraging hardware acceleration." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "bKIYPSkvD1QV" - }, - "outputs": [], - "source": [ - "from jax.scipy.special import logsumexp\n", - "\n", - "def relu(x):\n", - " return jnp.maximum(0, x)\n", - "\n", - "def predict(params, image):\n", - " # per-example prediction\n", - " activations = image\n", - " for w, b in params[:-1]:\n", - " outputs = jnp.dot(w, activations) + b\n", - " activations = relu(outputs)\n", - "\n", - " final_w, final_b = params[-1]\n", - " logits = jnp.dot(final_w, activations) + final_b\n", - " return logits - logsumexp(logits)\n", - "\n", - "# Make a batched version of the `predict` function\n", - "batched_predict = vmap(predict, in_axes=(None, 0))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "niTSr34_sDZi" - }, - "source": [ - "## Utility and Loss Functions\n", - "\n", - "You'll now define utility functions for:\n", - "\n", - "- One-hot encoding: Converts class indices to binary vectors.\n", - "- Accuracy calculation: Measures the performance of the model on the dataset.\n", - "- Loss computation: Calculates the difference between predictions and targets.\n", - "\n", - "To optimize performance:\n", - "\n", - "- [`grad`](https://jax.readthedocs.io/en/latest/_autosummary/jax.grad.html#jax.grad) is used to compute gradients of the loss function with respect to network parameters.\n", - "- [`jit`](https://jax.readthedocs.io/en/latest/_autosummary/jax.jit.html#jax.jit) compiles the update function, enabling faster execution by leveraging JAX's [XLA](https://openxla.org/xla) compilation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "sA0a06raEQfS" - }, - "outputs": [], - "source": [ - "import time\n", - "\n", - "def one_hot(x, k, dtype=jnp.float32):\n", - " \"\"\"Create a one-hot encoding of x of size k.\"\"\"\n", - " return jnp.array(x[:, None] == jnp.arange(k), dtype)\n", - "\n", - "def accuracy(params, images, targets):\n", - " \"\"\"Calculate the accuracy of predictions.\"\"\"\n", - " target_class = jnp.argmax(targets, axis=1)\n", - " predicted_class = jnp.argmax(batched_predict(params, images), axis=1)\n", - " return jnp.mean(predicted_class == target_class)\n", - "\n", - "def loss(params, images, targets):\n", - " \"\"\"Calculate the loss between predictions and targets.\"\"\"\n", - " preds = batched_predict(params, images)\n", - " return -jnp.mean(preds * targets)\n", - "\n", - "@jit\n", - "def update(params, x, y):\n", - " \"\"\"Update the network parameters using gradient descent.\"\"\"\n", - " grads = grad(loss)(params, x, y)\n", - " return [(w - step_size * dw, b - step_size * db)\n", - " for (w, b), (dw, db) in zip(params, grads)]\n", - "\n", - "def reshape_and_one_hot(x, y):\n", - " \"\"\"Reshape and one-hot encode the inputs.\"\"\"\n", - " x = jnp.reshape(x, (len(x), num_pixels))\n", - " y = one_hot(y, n_targets)\n", - " return x, y\n", - "\n", - "def train_model(num_epochs, params, training_generator, data_loader_type='streamed'):\n", - " \"\"\"Train the model for a given number of epochs.\"\"\"\n", - " for epoch in range(num_epochs):\n", - " start_time = time.time()\n", - " for x, y in training_generator() if data_loader_type == 'streamed' else training_generator:\n", - " x, y = reshape_and_one_hot(x, y)\n", - " params = update(params, x, y)\n", - "\n", - " print(f\"Epoch {epoch + 1} in {time.time() - start_time:.2f} sec: \"\n", - " f\"Train Accuracy: {accuracy(params, train_images, train_labels):.4f}, \"\n", - " f\"Test Accuracy: {accuracy(params, test_images, test_labels):.4f}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Hsionp5IYsQ9" - }, - "source": [ - "## Loading Data with PyTorch DataLoader\n", - "\n", - "This section shows how to load the MNIST dataset using PyTorch's DataLoader, convert the data to NumPy arrays, and apply transformations to flatten and cast images." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "jmsfrWrHxIhC", - "outputId": "33dfeada-a763-4d26-f778-a27966e34d55" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.5.1+cu121)\n", - "Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (0.20.1+cu121)\n", - "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch) (3.16.1)\n", - "Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch) (4.12.2)\n", - "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.4.2)\n", - "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.4)\n", - "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch) (2024.10.0)\n", - "Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch) (1.13.1)\n", - "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch) (1.3.0)\n", - "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from torchvision) (1.26.4)\n", - "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.10/dist-packages (from torchvision) (11.0.0)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (3.0.2)\n" - ] - } - ], - "source": [ - "!pip install torch torchvision" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "kO5_WzwY59gE" - }, - "outputs": [], - "source": [ - "import numpy as np\n", - "from jax.tree_util import tree_map\n", - "from torch.utils import data\n", - "from torchvision.datasets import MNIST" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "6f6qU8PCc143" - }, - "outputs": [], - "source": [ - "def numpy_collate(batch):\n", - " \"\"\"Convert a batch of PyTorch data to NumPy arrays.\"\"\"\n", - " return tree_map(np.asarray, data.default_collate(batch))\n", - "\n", - "class NumpyLoader(data.DataLoader):\n", - " \"\"\"Custom DataLoader to return NumPy arrays from a PyTorch Dataset.\"\"\"\n", - " def __init__(self, dataset, batch_size=1, shuffle=False, **kwargs):\n", - " super().__init__(dataset, batch_size=batch_size, shuffle=shuffle, collate_fn=numpy_collate, **kwargs)\n", - "\n", - "class FlattenAndCast(object):\n", - " \"\"\"Transform class to flatten and cast images to float32.\"\"\"\n", - " def __call__(self, pic):\n", - " return np.ravel(np.array(pic, dtype=jnp.float32))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mfSnfJND6I8G" - }, - "source": [ - "### Load Dataset with Transformations\n", - "\n", - "Standardize the data by flattening the images, casting them to `float32`, and ensuring consistent data types." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "Kxbl6bcx6crv", - "outputId": "372bbf4c-3ad5-4fd8-cc5d-27b50f5e4f38" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz\n", - "Failed to download (trying next):\n", - "HTTP Error 403: Forbidden\n", - "\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/train-images-idx3-ubyte.gz\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 9.91M/9.91M [00:00<00:00, 49.4MB/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Extracting /tmp/mnist_dataset/MNIST/raw/train-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", - "\n", - "Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz\n", - "Failed to download (trying next):\n", - "HTTP Error 403: Forbidden\n", - "\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/train-labels-idx1-ubyte.gz\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 28.9k/28.9k [00:00<00:00, 2.09MB/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Extracting /tmp/mnist_dataset/MNIST/raw/train-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", - "\n", - "Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Failed to download (trying next):\n", - "HTTP Error 403: Forbidden\n", - "\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/t10k-images-idx3-ubyte.gz\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 1.65M/1.65M [00:00<00:00, 13.3MB/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Extracting /tmp/mnist_dataset/MNIST/raw/t10k-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", - "\n", - "Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz\n", - "Failed to download (trying next):\n", - "HTTP Error 403: Forbidden\n", - "\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 4.54k/4.54k [00:00<00:00, 8.81MB/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Extracting /tmp/mnist_dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", - "\n" - ] - } - ], - "source": [ - "mnist_dataset = MNIST(data_dir, download=True, transform=FlattenAndCast())" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "kbdsqvPZGrsa" - }, - "source": [ - "### Full Training Dataset for Accuracy Checks\n", - "\n", - "Convert the entire training dataset to JAX arrays." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "c9ZCJq_rzPck" - }, - "outputs": [], - "source": [ - "train_images = jnp.array(mnist_dataset.data.numpy().reshape(len(mnist_dataset.data), -1), dtype=jnp.float32)\n", - "train_labels = one_hot(np.array(mnist_dataset.targets), n_targets)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "WXUh0BwvG8Ko" - }, - "source": [ - "### Get Full Test Dataset\n", - "\n", - "Load and process the full test dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "brlLG4SqGphm" - }, - "outputs": [], - "source": [ - "mnist_dataset_test = MNIST(data_dir, download=True, train=False)\n", - "test_images = jnp.array(mnist_dataset_test.data.numpy().reshape(len(mnist_dataset_test.data), -1), dtype=jnp.float32)\n", - "test_labels = one_hot(np.array(mnist_dataset_test.targets), n_targets)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "Oz-UVnCxG5E8", - "outputId": "abbaa26d-491a-4e63-e8c9-d3c571f53a28" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Train: (60000, 784) (60000, 10)\n", - "Test: (10000, 784) (10000, 10)\n" - ] - } - ], - "source": [ - "print('Train:', train_images.shape, train_labels.shape)\n", - "print('Test:', test_images.shape, test_labels.shape)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "m3zfxqnMiCbm" - }, - "source": [ - "### Training Data Generator\n", - "\n", - "Define a generator function using PyTorch's DataLoader for batch training. Setting `num_workers > 0` enables multi-process data loading, which can accelerate data loading for larger datasets or intensive preprocessing tasks. Experiment with different values to find the optimal setting for your hardware and workload.\n", - "\n", - "Note: When setting `num_workers > 0`, you may see the following `RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.` This warning can be safely ignored since data loaders do not use JAX within the forked processes." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "B-fES82EiL6Z" - }, - "outputs": [], - "source": [ - "def pytorch_training_generator(mnist_dataset):\n", - " return NumpyLoader(mnist_dataset, batch_size=batch_size, num_workers=0)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Xzt2x9S1HC3T" - }, - "source": [ - "### Training Loop (PyTorch DataLoader)\n", - "\n", - "The training loop uses the PyTorch DataLoader to iterate through batches and update model parameters." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "vtUjHsh-rJs8", - "outputId": "4766333e-4366-493b-995a-102778d1345a" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Epoch 1 in 28.93 sec: Train Accuracy: 0.9158, Test Accuracy: 0.9196\n", - "Epoch 2 in 8.33 sec: Train Accuracy: 0.9372, Test Accuracy: 0.9384\n", - "Epoch 3 in 6.99 sec: Train Accuracy: 0.9492, Test Accuracy: 0.9468\n", - "Epoch 4 in 7.01 sec: Train Accuracy: 0.9569, Test Accuracy: 0.9532\n", - "Epoch 5 in 8.17 sec: Train Accuracy: 0.9630, Test Accuracy: 0.9579\n", - "Epoch 6 in 8.27 sec: Train Accuracy: 0.9674, Test Accuracy: 0.9615\n", - "Epoch 7 in 8.32 sec: Train Accuracy: 0.9708, Test Accuracy: 0.9650\n", - "Epoch 8 in 8.07 sec: Train Accuracy: 0.9737, Test Accuracy: 0.9671\n" - ] - } - ], - "source": [ - "train_model(num_epochs, params, pytorch_training_generator(mnist_dataset), data_loader_type='iterable')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Nm45ZTo6yrf5" - }, - "source": [ - "## Loading Data with TensorFlow Datasets (TFDS)\n", - "\n", - "This section demonstrates how to load the MNIST dataset using TFDS, fetch the full dataset for evaluation, and define a training generator for batch processing. GPU usage is explicitly disabled for TensorFlow." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "sGaQAk1DHMUx" - }, - "outputs": [], - "source": [ - "import tensorflow_datasets as tfds\n", - "import tensorflow as tf\n", - "\n", - "# Ensuring CPU-Only Execution, disable any GPU usage(if applicable) for TF\n", - "tf.config.set_visible_devices([], device_type='GPU')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "3xdQY7H6wr3n" - }, - "source": [ - "### Fetch Full Dataset for Evaluation\n", - "\n", - "Load the dataset with `tfds.load`, convert it to NumPy arrays, and process it for evaluation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 104, - "referenced_widgets": [ - "b8cdabf5c05848f38f03850cab08b56f", - "a8b76d5f93004c089676e5a2a9b3336c", - "119ac8428f9441e7a25eb0afef2fbb2a", - "76a9815e5c2b4764a13409cebaf66821", - "45ce8dd5c4b949afa957ec8ffb926060", - "05b7145fd62d4581b2123c7680f11cdd", - "b96267f014814ec5b96ad7e6165104b1", - "bce34bdbfbd64f1f8353a4e8515cee0b", - "93b8206f8c5841a692cdce985ae301d8", - "c95f592620c64da595cc787567b2c4db", - "8a97071f862c4ec3b4b4140d2e34eda2" - ] - }, - "id": "1hOamw_7C8Pb", - "outputId": "ca166490-22db-4732-b29f-866b7593e489" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /tmp/mnist_dataset/mnist/3.0.1...\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "b8cdabf5c05848f38f03850cab08b56f", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Dl Completed...: 0%| | 0/5 [00:00=9.1.0 in /usr/local/lib/python3.10/dist-packages (from grain) (10.5.0)\n", - "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from grain) (1.26.4)\n", - "Requirement already satisfied: typing_extensions in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (4.12.2)\n", - "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (2024.10.0)\n", - "Requirement already satisfied: importlib_resources in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (6.4.5)\n", - "Requirement already satisfied: zipp in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (3.21.0)\n", - "Downloading grain-0.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (418 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m419.0/419.0 kB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading jaxtyping-0.2.36-py3-none-any.whl (55 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m55.8/55.8 kB\u001b[0m \u001b[31m4.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hInstalling collected packages: jaxtyping, grain\n", - "Successfully installed grain-0.2.2 jaxtyping-0.2.36\n" - ] - } - ], - "source": [ - "!pip install grain" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "66bH3ZDJ7Iat" - }, - "source": [ - "Import Required Libraries (import MNIST dataset from torchvision)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "mS62eVL9Ifmz" - }, - "outputs": [], - "source": [ - "import numpy as np\n", - "import grain.python as pygrain\n", - "from torchvision.datasets import MNIST" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0h6mwVrspPA-" - }, - "source": [ - "### Define Dataset Class\n", - "\n", - "Create a custom dataset class to load MNIST data for Grain." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "bnrhac5Hh7y1" - }, - "outputs": [], - "source": [ - "class Dataset:\n", - " def __init__(self, data_dir, train=True):\n", - " self.data_dir = data_dir\n", - " self.train = train\n", - " self.load_data()\n", - "\n", - " def load_data(self):\n", - " self.dataset = MNIST(self.data_dir, download=True, train=self.train)\n", - "\n", - " def __len__(self):\n", - " return len(self.dataset)\n", - "\n", - " def __getitem__(self, index):\n", - " img, label = self.dataset[index]\n", - " return np.ravel(np.array(img, dtype=np.float32)), label" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "53mf8bWEsyTr" - }, - "source": [ - "### Initialize the Dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "pN3oF7-ostGE" - }, - "outputs": [], - "source": [ - "mnist_dataset = Dataset(data_dir)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GqD-ycgBuwv9" - }, - "source": [ - "### Get the full train and test dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "f1VnTuX3u_kL" - }, - "outputs": [], - "source": [ - "# Convert training data to JAX arrays and encode labels as one-hot vectors\n", - "train_images = jnp.array([mnist_dataset[i][0] for i in range(len(mnist_dataset))], dtype=jnp.float32)\n", - "train_labels = one_hot(np.array([mnist_dataset[i][1] for i in range(len(mnist_dataset))]), n_targets)\n", - "\n", - "# Load test dataset and process it\n", - "mnist_dataset_test = MNIST(data_dir, download=True, train=False)\n", - "test_images = jnp.array([np.ravel(np.array(mnist_dataset_test[i][0], dtype=np.float32)) for i in range(len(mnist_dataset_test))], dtype=jnp.float32)\n", - "test_labels = one_hot(np.array([mnist_dataset_test[i][1] for i in range(len(mnist_dataset_test))]), n_targets)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "a2NHlp9klrQL", - "outputId": "14be58c0-851e-4a44-dfcc-d02f0718dab5" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Train: (60000, 784) (60000, 10)\n", - "Test: (10000, 784) (10000, 10)\n" - ] - } - ], - "source": [ - "print(\"Train:\", train_images.shape, train_labels.shape)\n", - "print(\"Test:\", test_images.shape, test_labels.shape)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fETnWRo2crhf" - }, - "source": [ - "### Initialize PyGrain DataLoader\n", - "\n", - "Set up a PyGrain DataLoader for sequential batch sampling." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "9RuFTcsCs2Ac" - }, - "outputs": [], - "source": [ - "sampler = pygrain.SequentialSampler(\n", - " num_records=len(mnist_dataset),\n", - " shard_options=pygrain.NoSharding()) # Single-device, no sharding\n", - "\n", - "def pygrain_training_generator():\n", - " \"\"\"Grain DataLoader generator for training.\"\"\"\n", - " return pygrain.DataLoader(\n", - " data_source=mnist_dataset,\n", - " sampler=sampler,\n", - " operations=[pygrain.Batch(batch_size)],\n", - " )" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GvpJPHAbeuHW" - }, - "source": [ - "### Training Loop (Grain)\n", - "\n", - "Run the training loop using the Grain DataLoader." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "cjxJRtiTadEI", - "outputId": "3f624366-b683-4d20-9d0a-777d345b0e21" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Epoch 1 in 15.39 sec: Train Accuracy: 0.9158, Test Accuracy: 0.9196\n", - "Epoch 2 in 15.27 sec: Train Accuracy: 0.9372, Test Accuracy: 0.9384\n", - "Epoch 3 in 12.61 sec: Train Accuracy: 0.9492, Test Accuracy: 0.9468\n", - "Epoch 4 in 12.62 sec: Train Accuracy: 0.9569, Test Accuracy: 0.9532\n", - "Epoch 5 in 12.39 sec: Train Accuracy: 0.9630, Test Accuracy: 0.9579\n", - "Epoch 6 in 12.19 sec: Train Accuracy: 0.9674, Test Accuracy: 0.9615\n", - "Epoch 7 in 12.56 sec: Train Accuracy: 0.9708, Test Accuracy: 0.9650\n", - "Epoch 8 in 13.04 sec: Train Accuracy: 0.9737, Test Accuracy: 0.9671\n" - ] - } - ], - "source": [ - "train_model(num_epochs, params, pygrain_training_generator)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "oixvOI816qUn" - }, - "source": [ - "## Loading Data with Hugging Face\n", - "\n", - "This section demonstrates loading MNIST data using the Hugging Face `datasets` library. You'll format the dataset for JAX compatibility, prepare flattened images and one-hot-encoded labels, and define a training generator." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "o51P6lr86wz-" - }, - "source": [ - "Install the Hugging Face `datasets` library." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "19ipxPhI6oSN", - "outputId": "684e445f-d23e-4924-9e76-2c2c9359f0be" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Collecting datasets\n", - " Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)\n", - "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.16.1)\n", - "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.26.4)\n", - "Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (17.0.0)\n", - "Collecting dill<0.3.9,>=0.3.0 (from datasets)\n", - " Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)\n", - "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.2.2)\n", - "Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.32.3)\n", - "Requirement already satisfied: tqdm>=4.66.3 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.6)\n", - "Collecting xxhash (from datasets)\n", - " Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)\n", - "Collecting multiprocess<0.70.17 (from datasets)\n", - " Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)\n", - "Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)\n", - " Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)\n", - "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.11.2)\n", - "Requirement already satisfied: huggingface-hub>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.26.2)\n", - "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (24.2)\n", - "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.2)\n", - "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.4.3)\n", - "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n", - "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.5.0)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.1.0)\n", - "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (0.2.0)\n", - "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.17.2)\n", - "Requirement already satisfied: async-timeout<6.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0->datasets) (4.12.2)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.4.0)\n", - "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.10)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2.2.3)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2024.8.30)\n", - "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n", - "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)\n", - "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)\n", - "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n", - "Downloading datasets-3.1.0-py3-none-any.whl (480 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m9.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (179 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m179.3/179.3 kB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m9.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m15.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hInstalling collected packages: xxhash, fsspec, dill, multiprocess, datasets\n", - " Attempting uninstall: fsspec\n", - " Found existing installation: fsspec 2024.10.0\n", - " Uninstalling fsspec-2024.10.0:\n", - " Successfully uninstalled fsspec-2024.10.0\n", - "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", - "gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.\u001b[0m\u001b[31m\n", - "\u001b[0mSuccessfully installed datasets-3.1.0 dill-0.3.8 fsspec-2024.9.0 multiprocess-0.70.16 xxhash-3.5.0\n" - ] - } - ], - "source": [ - "!pip install datasets" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "be0h_dZv0593" - }, - "source": [ - "Import Library" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "8v1N59p76zn0" - }, - "outputs": [], - "source": [ - "from datasets import load_dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8Gaj11tO7C86" - }, - "source": [ - "### Load and Format MNIST Dataset\n", - "\n", - "Load the MNIST dataset from Hugging Face and format it as `numpy` arrays for quick access or `jax` to get JAX arrays." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 301, - "referenced_widgets": [ - "32f6132a31aa4c508d3c3c5ef70348bb", - "d7c2ffa6b143463c91cbf8befca6ca01", - "fd964ecd3926419d92927c67f955d5d0", - "60feca3fde7c4447ad8393b0542eb999", - "3354a0baeca94d18bc6b2a8b8b465b58", - "a0d0d052772b46deac7657ad052991a4", - "fb34783b9cba462e9b690e0979c4b07a", - "8d8170c1ed99490589969cd753c40748", - "f1ecb6db00a54e088f1e09164222d637", - "3cf5dd8d29aa4619b39dc2542df7e42e", - "2e5d42ca710441b389895f2d3b611d0a", - "5d8202da24244dc896e9a8cba6a4ed4f", - "a6d64c953631412b8bd8f0ba53ae4d32", - "69240c5cbfbb4e91961f5b49812a26f0", - "865f38532b784a7c971f5d33b87b443e", - "ceb1c004191947cdaa10af9b9c03c80d", - "64c6041037914779b5e8e9cf5a80ad04", - "562fa6a0e7b846a180ac4b423c5511c5", - "b3b922288f9c4df2a4088279ff6d1531", - "75a1a8ffda554318890cf74c345ed9a9", - "3bae06cacf394a5998c2326199da94f5", - "ff6428a3daa5496c81d5e664aba01f97", - "1ba3f86870724f55b94a35cb6b4173af", - "b3e163fd8b8a4f289d5a25611cb66d23", - "abd2daba215e4f7c9ddabde04d6eb382", - "e22ee019049144d5aba573cdf4dbe4fc", - "6ac765dac67841a69218140785f024c6", - "7b057411a54e434fb74804b90daa8d44", - "563f71b3c67d47c3ab1100f5dc1b98f3", - "d81a657361ab4bba8bcc0cf309d2ff64", - "20316312ab88471ba90cbb954be3e964", - "698fda742f834473a23fb7e5e4cf239c", - "289b52c5a38146b8b467a5f4678f6271", - "d07c2f37cf914894b1551a8104e6cb70", - "5b55c73d551d483baaa6a1411c2597b1", - "2308f77723f54ac898588f48d1853b65", - "54d2589714d04b2e928b816258cb0df4", - "f84b795348c04c7a950165301a643671", - "bc853a4a8d3c4dbda23d183f0a3b4f27", - "1012ddc0343842d8b913a7d85df8ab8f", - "771a73a8f5084a57afc5654d72e022f0", - "311a43449f074841b6df4130b0871ac9", - "cd4d29cb01134469b52d6936c35eb943", - "013cf89ee6174d29bb3f4fdff7b36049", - "9237d877d84e4b3ab69698ecf56915bb", - "337ef4d37e6b4ff6bf6e8bd4ca93383f", - "b4096d3837b84ccdb8f1186435c87281", - "7259d3b7e11b4736b4d2aa8e9c55e994", - "1ad1f8e99a864fc4a2bc532d9a4ff110", - "b2b50451eabd40978ef46db5e7dd08c4", - "2dad5c5541e243128e23c3dd3e420ac2", - "a3de458b61e5493081d6bb9cf7e923db", - "37760f8a7b164e6f9c1a23d621e9fe6b", - "745a2aedcfab491fb9cffba19958b0c5", - "2f6c670640d048d2af453638cfde3a1e" - ] - }, - "id": "a22kTvgk6_fJ", - "outputId": "35fc38b9-a6ab-4b02-ffa4-ab27fac69df4" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: \n", - "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", - "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", - "You will be able to reuse this secret in all of your notebooks.\n", - "Please note that authentication is recommended but still optional to access public models or datasets.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "32f6132a31aa4c508d3c3c5ef70348bb", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "README.md: 0%| | 0.00/6.97k [00:00 0` enables multi-process data loading, which can accelerate data loading for larger datasets or intensive preprocessing tasks. Experiment with different values to find the optimal setting for your hardware and workload. - -Note: When setting `num_workers > 0`, you may see the following `RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.` This warning can be safely ignored since data loaders do not use JAX within the forked processes. - -```{code-cell} -:id: B-fES82EiL6Z - -def pytorch_training_generator(mnist_dataset): - return NumpyLoader(mnist_dataset, batch_size=batch_size, num_workers=0) -``` - -+++ {"id": "Xzt2x9S1HC3T"} - -### Training Loop (PyTorch DataLoader) - -The training loop uses the PyTorch DataLoader to iterate through batches and update model parameters. - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: vtUjHsh-rJs8 -outputId: 4766333e-4366-493b-995a-102778d1345a ---- -train_model(num_epochs, params, pytorch_training_generator(mnist_dataset), data_loader_type='iterable') -``` - -+++ {"id": "Nm45ZTo6yrf5"} - -## Loading Data with TensorFlow Datasets (TFDS) - -This section demonstrates how to load the MNIST dataset using TFDS, fetch the full dataset for evaluation, and define a training generator for batch processing. GPU usage is explicitly disabled for TensorFlow. - -```{code-cell} -:id: sGaQAk1DHMUx - -import tensorflow_datasets as tfds -import tensorflow as tf - -# Ensuring CPU-Only Execution, disable any GPU usage(if applicable) for TF -tf.config.set_visible_devices([], device_type='GPU') -``` - -+++ {"id": "3xdQY7H6wr3n"} - -### Fetch Full Dataset for Evaluation - -Load the dataset with `tfds.load`, convert it to NumPy arrays, and process it for evaluation. - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ - height: 104 - referenced_widgets: [b8cdabf5c05848f38f03850cab08b56f, a8b76d5f93004c089676e5a2a9b3336c, - 119ac8428f9441e7a25eb0afef2fbb2a, 76a9815e5c2b4764a13409cebaf66821, 45ce8dd5c4b949afa957ec8ffb926060, - 05b7145fd62d4581b2123c7680f11cdd, b96267f014814ec5b96ad7e6165104b1, bce34bdbfbd64f1f8353a4e8515cee0b, - 93b8206f8c5841a692cdce985ae301d8, c95f592620c64da595cc787567b2c4db, 8a97071f862c4ec3b4b4140d2e34eda2] -id: 1hOamw_7C8Pb -outputId: ca166490-22db-4732-b29f-866b7593e489 ---- -# tfds.load returns tf.Tensors (or tf.data.Datasets if batch_size != -1) -mnist_data, info = tfds.load(name="mnist", batch_size=-1, data_dir=data_dir, with_info=True) -mnist_data = tfds.as_numpy(mnist_data) -train_data, test_data = mnist_data['train'], mnist_data['test'] - -# Full train set -train_images, train_labels = train_data['image'], train_data['label'] -train_images = jnp.reshape(train_images, (len(train_images), num_pixels)) -train_labels = one_hot(train_labels, n_targets) - -# Full test set -test_images, test_labels = test_data['image'], test_data['label'] -test_images = jnp.reshape(test_images, (len(test_images), num_pixels)) -test_labels = one_hot(test_labels, n_targets) -``` - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: Td3PiLdmEf7z -outputId: 96403b0f-6079-43ce-df16-d4583f09906b ---- -print('Train:', train_images.shape, train_labels.shape) -print('Test:', test_images.shape, test_labels.shape) -``` - -+++ {"id": "UWRSaalfdyDX"} - -### Define the Training Generator - -Create a generator function to yield batches of data for training. - -```{code-cell} -:id: vX59u8CqEf4J - -def training_generator(): - # as_supervised=True gives us the (image, label) as a tuple instead of a dict - ds = tfds.load(name='mnist', split='train', as_supervised=True, data_dir=data_dir) - # You can build up an arbitrary tf.data input pipeline - ds = ds.batch(batch_size).prefetch(1) - # tfds.dataset_as_numpy converts the tf.data.Dataset into an iterable of NumPy arrays - return tfds.as_numpy(ds) -``` - -+++ {"id": "EAWeUdnuFNBY"} - -### Training Loop (TFDS) - -Use the training generator in a custom training loop. - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: h2sO13XDGvq1 -outputId: a150246e-ceb5-46ac-db71-2a8177a9d04d ---- -train_model(num_epochs, params, training_generator) -``` - -+++ {"id": "-ryVkrAITS9Z"} - -## Loading Data with Grain - -This section demonstrates how to load MNIST data using Grain, a data-loading library. You'll define a custom dataset class for Grain and set up a Grain DataLoader for efficient training. - -+++ {"id": "waYhUMUGmhH-"} - -Install Grain - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: L78o7eeyGvn5 -outputId: 76d16565-0d9e-4f5f-c6b1-4cf4a683d0e7 ---- -!pip install grain -``` - -+++ {"id": "66bH3ZDJ7Iat"} - -Import Required Libraries (import MNIST dataset from torchvision) - -```{code-cell} -:id: mS62eVL9Ifmz - -import numpy as np -import grain.python as pygrain -from torchvision.datasets import MNIST -``` - -+++ {"id": "0h6mwVrspPA-"} - -### Define Dataset Class - -Create a custom dataset class to load MNIST data for Grain. - -```{code-cell} -:id: bnrhac5Hh7y1 - -class Dataset: - def __init__(self, data_dir, train=True): - self.data_dir = data_dir - self.train = train - self.load_data() - - def load_data(self): - self.dataset = MNIST(self.data_dir, download=True, train=self.train) - - def __len__(self): - return len(self.dataset) - - def __getitem__(self, index): - img, label = self.dataset[index] - return np.ravel(np.array(img, dtype=np.float32)), label -``` - -+++ {"id": "53mf8bWEsyTr"} - -### Initialize the Dataset - -```{code-cell} -:id: pN3oF7-ostGE - -mnist_dataset = Dataset(data_dir) -``` - -+++ {"id": "GqD-ycgBuwv9"} - -### Get the full train and test dataset - -```{code-cell} -:id: f1VnTuX3u_kL - -# Convert training data to JAX arrays and encode labels as one-hot vectors -train_images = jnp.array([mnist_dataset[i][0] for i in range(len(mnist_dataset))], dtype=jnp.float32) -train_labels = one_hot(np.array([mnist_dataset[i][1] for i in range(len(mnist_dataset))]), n_targets) - -# Load test dataset and process it -mnist_dataset_test = MNIST(data_dir, download=True, train=False) -test_images = jnp.array([np.ravel(np.array(mnist_dataset_test[i][0], dtype=np.float32)) for i in range(len(mnist_dataset_test))], dtype=jnp.float32) -test_labels = one_hot(np.array([mnist_dataset_test[i][1] for i in range(len(mnist_dataset_test))]), n_targets) -``` - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: a2NHlp9klrQL -outputId: 14be58c0-851e-4a44-dfcc-d02f0718dab5 ---- -print("Train:", train_images.shape, train_labels.shape) -print("Test:", test_images.shape, test_labels.shape) -``` - -+++ {"id": "fETnWRo2crhf"} - -### Initialize PyGrain DataLoader - -Set up a PyGrain DataLoader for sequential batch sampling. - -```{code-cell} -:id: 9RuFTcsCs2Ac - -sampler = pygrain.SequentialSampler( - num_records=len(mnist_dataset), - shard_options=pygrain.NoSharding()) # Single-device, no sharding - -def pygrain_training_generator(): - """Grain DataLoader generator for training.""" - return pygrain.DataLoader( - data_source=mnist_dataset, - sampler=sampler, - operations=[pygrain.Batch(batch_size)], - ) -``` - -+++ {"id": "GvpJPHAbeuHW"} - -### Training Loop (Grain) - -Run the training loop using the Grain DataLoader. - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: cjxJRtiTadEI -outputId: 3f624366-b683-4d20-9d0a-777d345b0e21 ---- -train_model(num_epochs, params, pygrain_training_generator) -``` - -+++ {"id": "oixvOI816qUn"} - -## Loading Data with Hugging Face - -This section demonstrates loading MNIST data using the Hugging Face `datasets` library. You'll format the dataset for JAX compatibility, prepare flattened images and one-hot-encoded labels, and define a training generator. - -+++ {"id": "o51P6lr86wz-"} - -Install the Hugging Face `datasets` library. - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: 19ipxPhI6oSN -outputId: 684e445f-d23e-4924-9e76-2c2c9359f0be ---- -!pip install datasets -``` - -+++ {"id": "be0h_dZv0593"} - -Import Library - -```{code-cell} -:id: 8v1N59p76zn0 - -from datasets import load_dataset -``` - -+++ {"id": "8Gaj11tO7C86"} - -### Load and Format MNIST Dataset - -Load the MNIST dataset from Hugging Face and format it as `numpy` arrays for quick access or `jax` to get JAX arrays. - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ - height: 301 - referenced_widgets: [32f6132a31aa4c508d3c3c5ef70348bb, d7c2ffa6b143463c91cbf8befca6ca01, - fd964ecd3926419d92927c67f955d5d0, 60feca3fde7c4447ad8393b0542eb999, 3354a0baeca94d18bc6b2a8b8b465b58, - a0d0d052772b46deac7657ad052991a4, fb34783b9cba462e9b690e0979c4b07a, 8d8170c1ed99490589969cd753c40748, - f1ecb6db00a54e088f1e09164222d637, 3cf5dd8d29aa4619b39dc2542df7e42e, 2e5d42ca710441b389895f2d3b611d0a, - 5d8202da24244dc896e9a8cba6a4ed4f, a6d64c953631412b8bd8f0ba53ae4d32, 69240c5cbfbb4e91961f5b49812a26f0, - 865f38532b784a7c971f5d33b87b443e, ceb1c004191947cdaa10af9b9c03c80d, 64c6041037914779b5e8e9cf5a80ad04, - 562fa6a0e7b846a180ac4b423c5511c5, b3b922288f9c4df2a4088279ff6d1531, 75a1a8ffda554318890cf74c345ed9a9, - 3bae06cacf394a5998c2326199da94f5, ff6428a3daa5496c81d5e664aba01f97, 1ba3f86870724f55b94a35cb6b4173af, - b3e163fd8b8a4f289d5a25611cb66d23, abd2daba215e4f7c9ddabde04d6eb382, e22ee019049144d5aba573cdf4dbe4fc, - 6ac765dac67841a69218140785f024c6, 7b057411a54e434fb74804b90daa8d44, 563f71b3c67d47c3ab1100f5dc1b98f3, - d81a657361ab4bba8bcc0cf309d2ff64, 20316312ab88471ba90cbb954be3e964, 698fda742f834473a23fb7e5e4cf239c, - 289b52c5a38146b8b467a5f4678f6271, d07c2f37cf914894b1551a8104e6cb70, 5b55c73d551d483baaa6a1411c2597b1, - 2308f77723f54ac898588f48d1853b65, 54d2589714d04b2e928b816258cb0df4, f84b795348c04c7a950165301a643671, - bc853a4a8d3c4dbda23d183f0a3b4f27, 1012ddc0343842d8b913a7d85df8ab8f, 771a73a8f5084a57afc5654d72e022f0, - 311a43449f074841b6df4130b0871ac9, cd4d29cb01134469b52d6936c35eb943, 013cf89ee6174d29bb3f4fdff7b36049, - 9237d877d84e4b3ab69698ecf56915bb, 337ef4d37e6b4ff6bf6e8bd4ca93383f, b4096d3837b84ccdb8f1186435c87281, - 7259d3b7e11b4736b4d2aa8e9c55e994, 1ad1f8e99a864fc4a2bc532d9a4ff110, b2b50451eabd40978ef46db5e7dd08c4, - 2dad5c5541e243128e23c3dd3e420ac2, a3de458b61e5493081d6bb9cf7e923db, 37760f8a7b164e6f9c1a23d621e9fe6b, - 745a2aedcfab491fb9cffba19958b0c5, 2f6c670640d048d2af453638cfde3a1e] -id: a22kTvgk6_fJ -outputId: 35fc38b9-a6ab-4b02-ffa4-ab27fac69df4 ---- -mnist_dataset = load_dataset("mnist").with_format("numpy") -``` - -+++ {"id": "IFjTyGxY19b0"} - -### Extract images and labels - -Get image shape and flatten for model input - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: NHrKatD_7HbH -outputId: deec1739-2fc0-4e71-8567-f2e0c9db198b ---- -train_images = mnist_dataset["train"]["image"] -train_labels = mnist_dataset["train"]["label"] -test_images = mnist_dataset["test"]["image"] -test_labels = mnist_dataset["test"]["label"] - -# Flatten images and one-hot encode labels -image_shape = train_images.shape[1:] -num_features = image_shape[0] * image_shape[1] - -train_images = train_images.reshape(-1, num_features) -test_images = test_images.reshape(-1, num_features) - -train_labels = one_hot(train_labels, n_targets) -test_labels = one_hot(test_labels, n_targets) - -print('Train:', train_images.shape, train_labels.shape) -print('Test:', test_images.shape, test_labels.shape) -``` - -+++ {"id": "kk_4zJlz7T1E"} - -### Define Training Generator - -Set up a generator to yield batches of images and labels for training. - -```{code-cell} -:id: -zLJhogj7RL- - -def hf_training_generator(): - """Yield batches for training.""" - for batch in mnist_dataset["train"].iter(batch_size): - x, y = batch["image"], batch["label"] - yield x, y -``` - -+++ {"id": "HIsGfkLI7dvZ"} - -### Training Loop (Hugging Face Datasets) - -Run the training loop using the Hugging Face training generator. - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: RhloYGsw6nPf -outputId: d49c1cd2-a546-46a6-84fb-d9507c38f4ca ---- -train_model(num_epochs, params, hf_training_generator) -``` - -+++ {"id": "qXylIOwidWI3"} - -## Summary - -This notebook has introduced efficient strategies for data loading on a CPU with JAX, demonstrating how to integrate popular libraries like PyTorch DataLoader, TensorFlow Datasets, Grain, and Hugging Face Datasets. Each library offers distinct advantages, enabling you to streamline the data loading process for machine learning tasks. By understanding the strengths of these methods, you can select the approach that best suits your project's specific requirements. From 48675e89a7bb600ee445f7511ce70df6becf6921 Mon Sep 17 00:00:00 2001 From: selamw1 Date: Wed, 4 Dec 2024 15:46:05 -0800 Subject: [PATCH 09/14] new_line_added_at_the_end --- docs/source/tutorials.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/tutorials.md b/docs/source/tutorials.md index 2fa663a..071343b 100644 --- a/docs/source/tutorials.md +++ b/docs/source/tutorials.md @@ -32,4 +32,4 @@ Once you've gone through this content, you can refer to package-specific documentation for resources that go into more depth on various topics: - [JAX tutorials](https://jax.readthedocs.io/en/latest/tutorials.html) -- [FLAX user guides](https://flax.readthedocs.io/en/latest/guides/index.html) \ No newline at end of file +- [FLAX user guides](https://flax.readthedocs.io/en/latest/guides/index.html) From 6600b2e86ec1619fe991d69f04e85d7f7cb654f7 Mon Sep 17 00:00:00 2001 From: Selam Waktola Date: Wed, 4 Dec 2024 15:47:35 -0800 Subject: [PATCH 10/14] Adding tutorial for data loaders on gpu with jax (#109) --- docs/source/conf.py | 2 ++ docs/source/tutorials.md | 1 + 2 files changed, 3 insertions(+) diff --git a/docs/source/conf.py b/docs/source/conf.py index 45b5040..7826921 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -68,6 +68,7 @@ 'data_loaders_on_cpu_with_jax.md', 'data_loaders_on_gpu_with_jax.md', 'data_loaders_for_multi_device_setups_with_jax.md', + 'data_loaders_on_gpu_with_jax.md', ] suppress_warnings = [ @@ -107,4 +108,5 @@ 'data_loaders_on_cpu_with_jax.ipynb', 'data_loaders_on_gpu_with_jax.ipynb', 'data_loaders_for_multi_device_setups_with_jax.ipynb', + 'data_loaders_on_gpu_with_jax.ipynb', ] diff --git a/docs/source/tutorials.md b/docs/source/tutorials.md index 071343b..36cd40f 100644 --- a/docs/source/tutorials.md +++ b/docs/source/tutorials.md @@ -26,6 +26,7 @@ JAX_transformer_text_classification data_loaders_on_cpu_with_jax data_loaders_on_gpu_with_jax data_loaders_for_multi_device_setups_with_jax +data_loaders_on_gpu_with_jax ``` Once you've gone through this content, you can refer to package-specific From 59a066f6d36ae8d1eb86565781542aed0e1ab2d3 Mon Sep 17 00:00:00 2001 From: selamw1 Date: Tue, 26 Nov 2024 12:36:38 -0800 Subject: [PATCH 11/14] file_conflict_resolved --- docs/source/conf.py | 2 ++ docs/source/tutorials.md | 1 + 2 files changed, 3 insertions(+) diff --git a/docs/source/conf.py b/docs/source/conf.py index 7826921..2421f84 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -69,6 +69,7 @@ 'data_loaders_on_gpu_with_jax.md', 'data_loaders_for_multi_device_setups_with_jax.md', 'data_loaders_on_gpu_with_jax.md', + 'data_loaders_for_multi_device_setups_with_jax.md', ] suppress_warnings = [ @@ -109,4 +110,5 @@ 'data_loaders_on_gpu_with_jax.ipynb', 'data_loaders_for_multi_device_setups_with_jax.ipynb', 'data_loaders_on_gpu_with_jax.ipynb', + 'data_loaders_for_multi_device_setups_with_jax.ipynb', ] diff --git a/docs/source/tutorials.md b/docs/source/tutorials.md index 36cd40f..20e3dae 100644 --- a/docs/source/tutorials.md +++ b/docs/source/tutorials.md @@ -27,6 +27,7 @@ data_loaders_on_cpu_with_jax data_loaders_on_gpu_with_jax data_loaders_for_multi_device_setups_with_jax data_loaders_on_gpu_with_jax +data_loaders_for_multi_device_setups_with_jax ``` Once you've gone through this content, you can refer to package-specific From 950043fd5dbe88313a8445e327e625b996c7bec9 Mon Sep 17 00:00:00 2001 From: selamw1 Date: Wed, 4 Dec 2024 14:24:31 -0800 Subject: [PATCH 12/14] files_rebased_from_docs_to_dosc_source --- docs/source/tutorials.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/tutorials.md b/docs/source/tutorials.md index 20e3dae..6eb0fab 100644 --- a/docs/source/tutorials.md +++ b/docs/source/tutorials.md @@ -34,4 +34,4 @@ Once you've gone through this content, you can refer to package-specific documentation for resources that go into more depth on various topics: - [JAX tutorials](https://jax.readthedocs.io/en/latest/tutorials.html) -- [FLAX user guides](https://flax.readthedocs.io/en/latest/guides/index.html) +- [FLAX user guides](https://flax.readthedocs.io/en/latest/guides/index.html) \ No newline at end of file From e88d55c335189d383aaeae59493f58d74bddccb5 Mon Sep 17 00:00:00 2001 From: selamw1 Date: Wed, 4 Dec 2024 17:35:09 -0800 Subject: [PATCH 13/14] file_conflict_resolved_and_old_files_removed --- docs/source/conf.py | 4 - .../source/data_loaders_on_cpu_with_jax.ipynb | 3576 ----------------- docs/source/data_loaders_on_cpu_with_jax.md | 691 ---- docs/source/tutorials.md | 2 - 4 files changed, 4273 deletions(-) delete mode 100644 docs/source/data_loaders_on_cpu_with_jax.ipynb delete mode 100644 docs/source/data_loaders_on_cpu_with_jax.md diff --git a/docs/source/conf.py b/docs/source/conf.py index 2421f84..45b5040 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -68,8 +68,6 @@ 'data_loaders_on_cpu_with_jax.md', 'data_loaders_on_gpu_with_jax.md', 'data_loaders_for_multi_device_setups_with_jax.md', - 'data_loaders_on_gpu_with_jax.md', - 'data_loaders_for_multi_device_setups_with_jax.md', ] suppress_warnings = [ @@ -109,6 +107,4 @@ 'data_loaders_on_cpu_with_jax.ipynb', 'data_loaders_on_gpu_with_jax.ipynb', 'data_loaders_for_multi_device_setups_with_jax.ipynb', - 'data_loaders_on_gpu_with_jax.ipynb', - 'data_loaders_for_multi_device_setups_with_jax.ipynb', ] diff --git a/docs/source/data_loaders_on_cpu_with_jax.ipynb b/docs/source/data_loaders_on_cpu_with_jax.ipynb deleted file mode 100644 index 34a8445..0000000 --- a/docs/source/data_loaders_on_cpu_with_jax.ipynb +++ /dev/null @@ -1,3576 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "PUFGZggH49zp" - }, - "source": [ - "# Introduction to Data Loaders on CPU with JAX" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "3ia4PKEV5Dr8" - }, - "source": [ - "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jax-ml/jax-ai-stack/blob/main/docs/data_loaders_on_cpu_with_jax.ipynb)\n", - "\n", - "This tutorial explores different data loading strategies for using **JAX** on a single [**CPU**](https://jax.readthedocs.io/en/latest/glossary.html#term-CPU). While JAX doesn't include a built-in data loader, it seamlessly integrates with popular data loading libraries, including:\n", - "\n", - "- [**PyTorch DataLoader**](https://github.com/pytorch/data)\n", - "- [**TensorFlow Datasets (TFDS)**](https://github.com/tensorflow/datasets)\n", - "- [**Grain**](https://github.com/google/grain)\n", - "- [**Hugging Face**](https://huggingface.co/docs/datasets/en/use_with_jax#data-loading)\n", - "\n", - "In this tutorial, you'll learn how to efficiently load data using these libraries for a simple image classification task based on the MNIST dataset.\n", - "\n", - "Compared to GPU or multi-device setups, CPU-based data loading is straightforward as it avoids challenges like GPU memory management and data synchronization across devices. This makes it ideal for smaller-scale tasks or scenarios where data resides exclusively on the CPU.\n", - "\n", - "If you're looking for GPU-specific data loading advice, see [Data Loaders on GPU](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_on_gpu_with_jax.html).\n", - "\n", - "If you're looking for a multi-device data loading strategy, see [Data Loaders on Multi-Device Setups](https://jax-ai-stack.readthedocs.io/en/latest/data_loaders_for_multi_device_setups_with_jax.html)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pEsb135zE-Jo" - }, - "source": [ - "## Setting JAX to Use CPU Only\n", - "\n", - "First, you'll restrict JAX to use only the CPU, even if a GPU is available. This ensures consistency and allows you to focus on CPU-based data loading." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "id": "vqP6xyObC0_9" - }, - "outputs": [], - "source": [ - "import os\n", - "os.environ['JAX_PLATFORM_NAME'] = 'cpu'" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-rsMgVtO6asW" - }, - "source": [ - "Import JAX API" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "id": "tDJNQ6V-Dg5g" - }, - "outputs": [], - "source": [ - "import jax\n", - "import jax.numpy as jnp\n", - "from jax import random, grad, jit, vmap" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TsFdlkSZKp9S" - }, - "source": [ - "### CPU Setup Verification" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "N3sqvaF3KJw1", - "outputId": "449c83d9-d050-4b15-9a8d-f71e340501f2" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[CpuDevice(id=0)]" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "jax.devices()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qyJ_WTghDnIc" - }, - "source": [ - "## Setting Hyperparameters and Initializing Parameters\n", - "\n", - "You'll define hyperparameters for your model and data loading, including layer sizes, learning rate, batch size, and the data directory. You'll also initialize the weights and biases for a fully-connected neural network." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "id": "qLNOSloFDka_" - }, - "outputs": [], - "source": [ - "# A helper function to randomly initialize weights and biases\n", - "# for a dense neural network layer\n", - "def random_layer_params(m, n, key, scale=1e-2):\n", - " w_key, b_key = random.split(key)\n", - " return scale * random.normal(w_key, (n, m)), scale * random.normal(b_key, (n,))\n", - "\n", - "# Function to initialize network parameters for all layers based on defined sizes\n", - "def init_network_params(sizes, key):\n", - " keys = random.split(key, len(sizes))\n", - " return [random_layer_params(m, n, k) for m, n, k in zip(sizes[:-1], sizes[1:], keys)]\n", - "\n", - "layer_sizes = [784, 512, 512, 10] # Layers of the network\n", - "step_size = 0.01 # Learning rate for optimization\n", - "num_epochs = 8 # Number of training epochs\n", - "batch_size = 128 # Batch size for training\n", - "n_targets = 10 # Number of classes (digits 0-9)\n", - "num_pixels = 28 * 28 # Input size (MNIST images are 28x28 pixels)\n", - "data_dir = '/tmp/mnist_dataset' # Directory for storing the dataset\n", - "\n", - "# Initialize network parameters using the defined layer sizes and a random seed\n", - "params = init_network_params(layer_sizes, random.PRNGKey(0))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6Ci_CqW7q6XM" - }, - "source": [ - "## Model Prediction with Auto-Batching\n", - "\n", - "In this section, you'll define the `predict` function for your neural network. This function computes the output of the network for a single input image.\n", - "\n", - "To efficiently process multiple images simultaneously, you'll use [`vmap`](https://jax.readthedocs.io/en/latest/_autosummary/jax.vmap.html#jax.vmap), which allows you to vectorize the `predict` function and apply it across a batch of inputs. This technique, called auto-batching, improves computational efficiency by leveraging hardware acceleration." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "id": "bKIYPSkvD1QV" - }, - "outputs": [], - "source": [ - "from jax.scipy.special import logsumexp\n", - "\n", - "def relu(x):\n", - " return jnp.maximum(0, x)\n", - "\n", - "def predict(params, image):\n", - " # per-example prediction\n", - " activations = image\n", - " for w, b in params[:-1]:\n", - " outputs = jnp.dot(w, activations) + b\n", - " activations = relu(outputs)\n", - "\n", - " final_w, final_b = params[-1]\n", - " logits = jnp.dot(final_w, activations) + final_b\n", - " return logits - logsumexp(logits)\n", - "\n", - "# Make a batched version of the `predict` function\n", - "batched_predict = vmap(predict, in_axes=(None, 0))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "niTSr34_sDZi" - }, - "source": [ - "## Utility and Loss Functions\n", - "\n", - "You'll now define utility functions for:\n", - "\n", - "- One-hot encoding: Converts class indices to binary vectors.\n", - "- Accuracy calculation: Measures the performance of the model on the dataset.\n", - "- Loss computation: Calculates the difference between predictions and targets.\n", - "\n", - "To optimize performance:\n", - "\n", - "- [`grad`](https://jax.readthedocs.io/en/latest/_autosummary/jax.grad.html#jax.grad) is used to compute gradients of the loss function with respect to network parameters.\n", - "- [`jit`](https://jax.readthedocs.io/en/latest/_autosummary/jax.jit.html#jax.jit) compiles the update function, enabling faster execution by leveraging JAX's [XLA](https://openxla.org/xla) compilation." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "id": "sA0a06raEQfS" - }, - "outputs": [], - "source": [ - "import time\n", - "\n", - "def one_hot(x, k, dtype=jnp.float32):\n", - " \"\"\"Create a one-hot encoding of x of size k.\"\"\"\n", - " return jnp.array(x[:, None] == jnp.arange(k), dtype)\n", - "\n", - "def accuracy(params, images, targets):\n", - " \"\"\"Calculate the accuracy of predictions.\"\"\"\n", - " target_class = jnp.argmax(targets, axis=1)\n", - " predicted_class = jnp.argmax(batched_predict(params, images), axis=1)\n", - " return jnp.mean(predicted_class == target_class)\n", - "\n", - "def loss(params, images, targets):\n", - " \"\"\"Calculate the loss between predictions and targets.\"\"\"\n", - " preds = batched_predict(params, images)\n", - " return -jnp.mean(preds * targets)\n", - "\n", - "@jit\n", - "def update(params, x, y):\n", - " \"\"\"Update the network parameters using gradient descent.\"\"\"\n", - " grads = grad(loss)(params, x, y)\n", - " return [(w - step_size * dw, b - step_size * db)\n", - " for (w, b), (dw, db) in zip(params, grads)]\n", - "\n", - "def reshape_and_one_hot(x, y):\n", - " \"\"\"Reshape and one-hot encode the inputs.\"\"\"\n", - " x = jnp.reshape(x, (len(x), num_pixels))\n", - " y = one_hot(y, n_targets)\n", - " return x, y\n", - "\n", - "def train_model(num_epochs, params, training_generator, data_loader_type='streamed'):\n", - " \"\"\"Train the model for a given number of epochs.\"\"\"\n", - " for epoch in range(num_epochs):\n", - " start_time = time.time()\n", - " for x, y in training_generator() if data_loader_type == 'streamed' else training_generator:\n", - " x, y = reshape_and_one_hot(x, y)\n", - " params = update(params, x, y)\n", - "\n", - " print(f\"Epoch {epoch + 1} in {time.time() - start_time:.2f} sec: \"\n", - " f\"Train Accuracy: {accuracy(params, train_images, train_labels):.4f}, \"\n", - " f\"Test Accuracy: {accuracy(params, test_images, test_labels):.4f}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Hsionp5IYsQ9" - }, - "source": [ - "## Loading Data with PyTorch DataLoader\n", - "\n", - "This section shows how to load the MNIST dataset using PyTorch's DataLoader, convert the data to NumPy arrays, and apply transformations to flatten and cast images." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "jmsfrWrHxIhC", - "outputId": "33dfeada-a763-4d26-f778-a27966e34d55" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.5.1+cu121)\n", - "Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (0.20.1+cu121)\n", - "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch) (3.16.1)\n", - "Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch) (4.12.2)\n", - "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.4.2)\n", - "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.4)\n", - "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch) (2024.10.0)\n", - "Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch) (1.13.1)\n", - "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch) (1.3.0)\n", - "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from torchvision) (1.26.4)\n", - "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.10/dist-packages (from torchvision) (11.0.0)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (3.0.2)\n" - ] - } - ], - "source": [ - "!pip install torch torchvision" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "id": "kO5_WzwY59gE" - }, - "outputs": [], - "source": [ - "import numpy as np\n", - "from jax.tree_util import tree_map\n", - "from torch.utils import data\n", - "from torchvision.datasets import MNIST" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "id": "6f6qU8PCc143" - }, - "outputs": [], - "source": [ - "def numpy_collate(batch):\n", - " \"\"\"Convert a batch of PyTorch data to NumPy arrays.\"\"\"\n", - " return tree_map(np.asarray, data.default_collate(batch))\n", - "\n", - "class NumpyLoader(data.DataLoader):\n", - " \"\"\"Custom DataLoader to return NumPy arrays from a PyTorch Dataset.\"\"\"\n", - " def __init__(self, dataset, batch_size=1, shuffle=False, **kwargs):\n", - " super().__init__(dataset, batch_size=batch_size, shuffle=shuffle, collate_fn=numpy_collate, **kwargs)\n", - "\n", - "class FlattenAndCast(object):\n", - " \"\"\"Transform class to flatten and cast images to float32.\"\"\"\n", - " def __call__(self, pic):\n", - " return np.ravel(np.array(pic, dtype=jnp.float32))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mfSnfJND6I8G" - }, - "source": [ - "### Load Dataset with Transformations\n", - "\n", - "Standardize the data by flattening the images, casting them to `float32`, and ensuring consistent data types." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "Kxbl6bcx6crv", - "outputId": "372bbf4c-3ad5-4fd8-cc5d-27b50f5e4f38" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz\n", - "Failed to download (trying next):\n", - "HTTP Error 403: Forbidden\n", - "\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/train-images-idx3-ubyte.gz\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 9.91M/9.91M [00:00<00:00, 49.4MB/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Extracting /tmp/mnist_dataset/MNIST/raw/train-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", - "\n", - "Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz\n", - "Failed to download (trying next):\n", - "HTTP Error 403: Forbidden\n", - "\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/train-labels-idx1-ubyte.gz\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 28.9k/28.9k [00:00<00:00, 2.09MB/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Extracting /tmp/mnist_dataset/MNIST/raw/train-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", - "\n", - "Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Failed to download (trying next):\n", - "HTTP Error 403: Forbidden\n", - "\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/t10k-images-idx3-ubyte.gz\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 1.65M/1.65M [00:00<00:00, 13.3MB/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Extracting /tmp/mnist_dataset/MNIST/raw/t10k-images-idx3-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", - "\n", - "Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz\n", - "Failed to download (trying next):\n", - "HTTP Error 403: Forbidden\n", - "\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz\n", - "Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 4.54k/4.54k [00:00<00:00, 8.81MB/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Extracting /tmp/mnist_dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz to /tmp/mnist_dataset/MNIST/raw\n", - "\n" - ] - } - ], - "source": [ - "mnist_dataset = MNIST(data_dir, download=True, transform=FlattenAndCast())" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "kbdsqvPZGrsa" - }, - "source": [ - "### Full Training Dataset for Accuracy Checks\n", - "\n", - "Convert the entire training dataset to JAX arrays." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "id": "c9ZCJq_rzPck" - }, - "outputs": [], - "source": [ - "train_images = jnp.array(mnist_dataset.data.numpy().reshape(len(mnist_dataset.data), -1), dtype=jnp.float32)\n", - "train_labels = one_hot(np.array(mnist_dataset.targets), n_targets)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "WXUh0BwvG8Ko" - }, - "source": [ - "### Get Full Test Dataset\n", - "\n", - "Load and process the full test dataset." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": { - "id": "brlLG4SqGphm" - }, - "outputs": [], - "source": [ - "mnist_dataset_test = MNIST(data_dir, download=True, train=False)\n", - "test_images = jnp.array(mnist_dataset_test.data.numpy().reshape(len(mnist_dataset_test.data), -1), dtype=jnp.float32)\n", - "test_labels = one_hot(np.array(mnist_dataset_test.targets), n_targets)" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "Oz-UVnCxG5E8", - "outputId": "abbaa26d-491a-4e63-e8c9-d3c571f53a28" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Train: (60000, 784) (60000, 10)\n", - "Test: (10000, 784) (10000, 10)\n" - ] - } - ], - "source": [ - "print('Train:', train_images.shape, train_labels.shape)\n", - "print('Test:', test_images.shape, test_labels.shape)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "m3zfxqnMiCbm" - }, - "source": [ - "### Training Data Generator\n", - "\n", - "Define a generator function using PyTorch's DataLoader for batch training. Setting `num_workers > 0` enables multi-process data loading, which can accelerate data loading for larger datasets or intensive preprocessing tasks. Experiment with different values to find the optimal setting for your hardware and workload.\n", - "\n", - "Note: When setting `num_workers > 0`, you may see the following `RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.` This warning can be safely ignored since data loaders do not use JAX within the forked processes." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": { - "id": "B-fES82EiL6Z" - }, - "outputs": [], - "source": [ - "def pytorch_training_generator(mnist_dataset):\n", - " return NumpyLoader(mnist_dataset, batch_size=batch_size, num_workers=0)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Xzt2x9S1HC3T" - }, - "source": [ - "### Training Loop (PyTorch DataLoader)\n", - "\n", - "The training loop uses the PyTorch DataLoader to iterate through batches and update model parameters." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "vtUjHsh-rJs8", - "outputId": "4766333e-4366-493b-995a-102778d1345a" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Epoch 1 in 28.93 sec: Train Accuracy: 0.9158, Test Accuracy: 0.9196\n", - "Epoch 2 in 8.33 sec: Train Accuracy: 0.9372, Test Accuracy: 0.9384\n", - "Epoch 3 in 6.99 sec: Train Accuracy: 0.9492, Test Accuracy: 0.9468\n", - "Epoch 4 in 7.01 sec: Train Accuracy: 0.9569, Test Accuracy: 0.9532\n", - "Epoch 5 in 8.17 sec: Train Accuracy: 0.9630, Test Accuracy: 0.9579\n", - "Epoch 6 in 8.27 sec: Train Accuracy: 0.9674, Test Accuracy: 0.9615\n", - "Epoch 7 in 8.32 sec: Train Accuracy: 0.9708, Test Accuracy: 0.9650\n", - "Epoch 8 in 8.07 sec: Train Accuracy: 0.9737, Test Accuracy: 0.9671\n" - ] - } - ], - "source": [ - "train_model(num_epochs, params, pytorch_training_generator(mnist_dataset), data_loader_type='iterable')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Nm45ZTo6yrf5" - }, - "source": [ - "## Loading Data with TensorFlow Datasets (TFDS)\n", - "\n", - "This section demonstrates how to load the MNIST dataset using TFDS, fetch the full dataset for evaluation, and define a training generator for batch processing. GPU usage is explicitly disabled for TensorFlow." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": { - "id": "sGaQAk1DHMUx" - }, - "outputs": [], - "source": [ - "import tensorflow_datasets as tfds\n", - "import tensorflow as tf\n", - "\n", - "# Ensuring CPU-Only Execution, disable any GPU usage(if applicable) for TF\n", - "tf.config.set_visible_devices([], device_type='GPU')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "3xdQY7H6wr3n" - }, - "source": [ - "### Fetch Full Dataset for Evaluation\n", - "\n", - "Load the dataset with `tfds.load`, convert it to NumPy arrays, and process it for evaluation." - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 104, - "referenced_widgets": [ - "b8cdabf5c05848f38f03850cab08b56f", - "a8b76d5f93004c089676e5a2a9b3336c", - "119ac8428f9441e7a25eb0afef2fbb2a", - "76a9815e5c2b4764a13409cebaf66821", - "45ce8dd5c4b949afa957ec8ffb926060", - "05b7145fd62d4581b2123c7680f11cdd", - "b96267f014814ec5b96ad7e6165104b1", - "bce34bdbfbd64f1f8353a4e8515cee0b", - "93b8206f8c5841a692cdce985ae301d8", - "c95f592620c64da595cc787567b2c4db", - "8a97071f862c4ec3b4b4140d2e34eda2" - ] - }, - "id": "1hOamw_7C8Pb", - "outputId": "ca166490-22db-4732-b29f-866b7593e489" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /tmp/mnist_dataset/mnist/3.0.1...\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "b8cdabf5c05848f38f03850cab08b56f", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Dl Completed...: 0%| | 0/5 [00:00=9.1.0 in /usr/local/lib/python3.10/dist-packages (from grain) (10.5.0)\n", - "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from grain) (1.26.4)\n", - "Requirement already satisfied: typing_extensions in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (4.12.2)\n", - "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (2024.10.0)\n", - "Requirement already satisfied: importlib_resources in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (6.4.5)\n", - "Requirement already satisfied: zipp in /usr/local/lib/python3.10/dist-packages (from etils[epath,epy]->grain) (3.21.0)\n", - "Downloading grain-0.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (418 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m419.0/419.0 kB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading jaxtyping-0.2.36-py3-none-any.whl (55 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m55.8/55.8 kB\u001b[0m \u001b[31m4.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hInstalling collected packages: jaxtyping, grain\n", - "Successfully installed grain-0.2.2 jaxtyping-0.2.36\n" - ] - } - ], - "source": [ - "!pip install grain" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "66bH3ZDJ7Iat" - }, - "source": [ - "Import Required Libraries (import MNIST dataset from torchvision)" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": { - "id": "mS62eVL9Ifmz" - }, - "outputs": [], - "source": [ - "import numpy as np\n", - "import grain.python as pygrain\n", - "from torchvision.datasets import MNIST" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0h6mwVrspPA-" - }, - "source": [ - "### Define Dataset Class\n", - "\n", - "Create a custom dataset class to load MNIST data for Grain." - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": { - "id": "bnrhac5Hh7y1" - }, - "outputs": [], - "source": [ - "class Dataset:\n", - " def __init__(self, data_dir, train=True):\n", - " self.data_dir = data_dir\n", - " self.train = train\n", - " self.load_data()\n", - "\n", - " def load_data(self):\n", - " self.dataset = MNIST(self.data_dir, download=True, train=self.train)\n", - "\n", - " def __len__(self):\n", - " return len(self.dataset)\n", - "\n", - " def __getitem__(self, index):\n", - " img, label = self.dataset[index]\n", - " return np.ravel(np.array(img, dtype=np.float32)), label" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "53mf8bWEsyTr" - }, - "source": [ - "### Initialize the Dataset" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": { - "id": "pN3oF7-ostGE" - }, - "outputs": [], - "source": [ - "mnist_dataset = Dataset(data_dir)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GqD-ycgBuwv9" - }, - "source": [ - "### Get the full train and test dataset" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": { - "id": "f1VnTuX3u_kL" - }, - "outputs": [], - "source": [ - "# Convert training data to JAX arrays and encode labels as one-hot vectors\n", - "train_images = jnp.array([mnist_dataset[i][0] for i in range(len(mnist_dataset))], dtype=jnp.float32)\n", - "train_labels = one_hot(np.array([mnist_dataset[i][1] for i in range(len(mnist_dataset))]), n_targets)\n", - "\n", - "# Load test dataset and process it\n", - "mnist_dataset_test = MNIST(data_dir, download=True, train=False)\n", - "test_images = jnp.array([np.ravel(np.array(mnist_dataset_test[i][0], dtype=np.float32)) for i in range(len(mnist_dataset_test))], dtype=jnp.float32)\n", - "test_labels = one_hot(np.array([mnist_dataset_test[i][1] for i in range(len(mnist_dataset_test))]), n_targets)" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "a2NHlp9klrQL", - "outputId": "14be58c0-851e-4a44-dfcc-d02f0718dab5" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Train: (60000, 784) (60000, 10)\n", - "Test: (10000, 784) (10000, 10)\n" - ] - } - ], - "source": [ - "print(\"Train:\", train_images.shape, train_labels.shape)\n", - "print(\"Test:\", test_images.shape, test_labels.shape)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fETnWRo2crhf" - }, - "source": [ - "### Initialize PyGrain DataLoader\n", - "\n", - "Set up a PyGrain DataLoader for sequential batch sampling." - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": { - "id": "9RuFTcsCs2Ac" - }, - "outputs": [], - "source": [ - "sampler = pygrain.SequentialSampler(\n", - " num_records=len(mnist_dataset),\n", - " shard_options=pygrain.NoSharding()) # Single-device, no sharding\n", - "\n", - "def pygrain_training_generator():\n", - " \"\"\"Grain DataLoader generator for training.\"\"\"\n", - " return pygrain.DataLoader(\n", - " data_source=mnist_dataset,\n", - " sampler=sampler,\n", - " operations=[pygrain.Batch(batch_size)],\n", - " )" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GvpJPHAbeuHW" - }, - "source": [ - "### Training Loop (Grain)\n", - "\n", - "Run the training loop using the Grain DataLoader." - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "cjxJRtiTadEI", - "outputId": "3f624366-b683-4d20-9d0a-777d345b0e21" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Epoch 1 in 15.39 sec: Train Accuracy: 0.9158, Test Accuracy: 0.9196\n", - "Epoch 2 in 15.27 sec: Train Accuracy: 0.9372, Test Accuracy: 0.9384\n", - "Epoch 3 in 12.61 sec: Train Accuracy: 0.9492, Test Accuracy: 0.9468\n", - "Epoch 4 in 12.62 sec: Train Accuracy: 0.9569, Test Accuracy: 0.9532\n", - "Epoch 5 in 12.39 sec: Train Accuracy: 0.9630, Test Accuracy: 0.9579\n", - "Epoch 6 in 12.19 sec: Train Accuracy: 0.9674, Test Accuracy: 0.9615\n", - "Epoch 7 in 12.56 sec: Train Accuracy: 0.9708, Test Accuracy: 0.9650\n", - "Epoch 8 in 13.04 sec: Train Accuracy: 0.9737, Test Accuracy: 0.9671\n" - ] - } - ], - "source": [ - "train_model(num_epochs, params, pygrain_training_generator)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "oixvOI816qUn" - }, - "source": [ - "## Loading Data with Hugging Face\n", - "\n", - "This section demonstrates loading MNIST data using the Hugging Face `datasets` library. You'll format the dataset for JAX compatibility, prepare flattened images and one-hot-encoded labels, and define a training generator." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "o51P6lr86wz-" - }, - "source": [ - "Install the Hugging Face `datasets` library." - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "19ipxPhI6oSN", - "outputId": "684e445f-d23e-4924-9e76-2c2c9359f0be" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Collecting datasets\n", - " Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)\n", - "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.16.1)\n", - "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.26.4)\n", - "Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (17.0.0)\n", - "Collecting dill<0.3.9,>=0.3.0 (from datasets)\n", - " Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)\n", - "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.2.2)\n", - "Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.32.3)\n", - "Requirement already satisfied: tqdm>=4.66.3 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.6)\n", - "Collecting xxhash (from datasets)\n", - " Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)\n", - "Collecting multiprocess<0.70.17 (from datasets)\n", - " Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)\n", - "Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)\n", - " Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)\n", - "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.11.2)\n", - "Requirement already satisfied: huggingface-hub>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.26.2)\n", - "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (24.2)\n", - "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.2)\n", - "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.4.3)\n", - "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n", - "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.5.0)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.1.0)\n", - "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (0.2.0)\n", - "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.17.2)\n", - "Requirement already satisfied: async-timeout<6.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0->datasets) (4.12.2)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.4.0)\n", - "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.10)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2.2.3)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2024.8.30)\n", - "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n", - "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)\n", - "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)\n", - "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n", - "Downloading datasets-3.1.0-py3-none-any.whl (480 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m9.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (179 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m179.3/179.3 kB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m9.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m15.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hInstalling collected packages: xxhash, fsspec, dill, multiprocess, datasets\n", - " Attempting uninstall: fsspec\n", - " Found existing installation: fsspec 2024.10.0\n", - " Uninstalling fsspec-2024.10.0:\n", - " Successfully uninstalled fsspec-2024.10.0\n", - "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", - "gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.\u001b[0m\u001b[31m\n", - "\u001b[0mSuccessfully installed datasets-3.1.0 dill-0.3.8 fsspec-2024.9.0 multiprocess-0.70.16 xxhash-3.5.0\n" - ] - } - ], - "source": [ - "!pip install datasets" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "be0h_dZv0593" - }, - "source": [ - "Import Library" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": { - "id": "8v1N59p76zn0" - }, - "outputs": [], - "source": [ - "from datasets import load_dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8Gaj11tO7C86" - }, - "source": [ - "### Load and Format MNIST Dataset\n", - "\n", - "Load the MNIST dataset from Hugging Face and format it as `numpy` arrays for quick access or `jax` to get JAX arrays." - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 301, - "referenced_widgets": [ - "32f6132a31aa4c508d3c3c5ef70348bb", - "d7c2ffa6b143463c91cbf8befca6ca01", - "fd964ecd3926419d92927c67f955d5d0", - "60feca3fde7c4447ad8393b0542eb999", - "3354a0baeca94d18bc6b2a8b8b465b58", - "a0d0d052772b46deac7657ad052991a4", - "fb34783b9cba462e9b690e0979c4b07a", - "8d8170c1ed99490589969cd753c40748", - "f1ecb6db00a54e088f1e09164222d637", - "3cf5dd8d29aa4619b39dc2542df7e42e", - "2e5d42ca710441b389895f2d3b611d0a", - "5d8202da24244dc896e9a8cba6a4ed4f", - "a6d64c953631412b8bd8f0ba53ae4d32", - "69240c5cbfbb4e91961f5b49812a26f0", - "865f38532b784a7c971f5d33b87b443e", - "ceb1c004191947cdaa10af9b9c03c80d", - "64c6041037914779b5e8e9cf5a80ad04", - "562fa6a0e7b846a180ac4b423c5511c5", - "b3b922288f9c4df2a4088279ff6d1531", - "75a1a8ffda554318890cf74c345ed9a9", - "3bae06cacf394a5998c2326199da94f5", - "ff6428a3daa5496c81d5e664aba01f97", - "1ba3f86870724f55b94a35cb6b4173af", - "b3e163fd8b8a4f289d5a25611cb66d23", - "abd2daba215e4f7c9ddabde04d6eb382", - "e22ee019049144d5aba573cdf4dbe4fc", - "6ac765dac67841a69218140785f024c6", - "7b057411a54e434fb74804b90daa8d44", - "563f71b3c67d47c3ab1100f5dc1b98f3", - "d81a657361ab4bba8bcc0cf309d2ff64", - "20316312ab88471ba90cbb954be3e964", - "698fda742f834473a23fb7e5e4cf239c", - "289b52c5a38146b8b467a5f4678f6271", - "d07c2f37cf914894b1551a8104e6cb70", - "5b55c73d551d483baaa6a1411c2597b1", - "2308f77723f54ac898588f48d1853b65", - "54d2589714d04b2e928b816258cb0df4", - "f84b795348c04c7a950165301a643671", - "bc853a4a8d3c4dbda23d183f0a3b4f27", - "1012ddc0343842d8b913a7d85df8ab8f", - "771a73a8f5084a57afc5654d72e022f0", - "311a43449f074841b6df4130b0871ac9", - "cd4d29cb01134469b52d6936c35eb943", - "013cf89ee6174d29bb3f4fdff7b36049", - "9237d877d84e4b3ab69698ecf56915bb", - "337ef4d37e6b4ff6bf6e8bd4ca93383f", - "b4096d3837b84ccdb8f1186435c87281", - "7259d3b7e11b4736b4d2aa8e9c55e994", - "1ad1f8e99a864fc4a2bc532d9a4ff110", - "b2b50451eabd40978ef46db5e7dd08c4", - "2dad5c5541e243128e23c3dd3e420ac2", - "a3de458b61e5493081d6bb9cf7e923db", - "37760f8a7b164e6f9c1a23d621e9fe6b", - "745a2aedcfab491fb9cffba19958b0c5", - "2f6c670640d048d2af453638cfde3a1e" - ] - }, - "id": "a22kTvgk6_fJ", - "outputId": "35fc38b9-a6ab-4b02-ffa4-ab27fac69df4" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: \n", - "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", - "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", - "You will be able to reuse this secret in all of your notebooks.\n", - "Please note that authentication is recommended but still optional to access public models or datasets.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "32f6132a31aa4c508d3c3c5ef70348bb", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "README.md: 0%| | 0.00/6.97k [00:00 0` enables multi-process data loading, which can accelerate data loading for larger datasets or intensive preprocessing tasks. Experiment with different values to find the optimal setting for your hardware and workload. - -Note: When setting `num_workers > 0`, you may see the following `RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.` This warning can be safely ignored since data loaders do not use JAX within the forked processes. - -```{code-cell} -:id: B-fES82EiL6Z - -def pytorch_training_generator(mnist_dataset): - return NumpyLoader(mnist_dataset, batch_size=batch_size, num_workers=0) -``` - -+++ {"id": "Xzt2x9S1HC3T"} - -### Training Loop (PyTorch DataLoader) - -The training loop uses the PyTorch DataLoader to iterate through batches and update model parameters. - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: vtUjHsh-rJs8 -outputId: 4766333e-4366-493b-995a-102778d1345a ---- -train_model(num_epochs, params, pytorch_training_generator(mnist_dataset), data_loader_type='iterable') -``` - -+++ {"id": "Nm45ZTo6yrf5"} - -## Loading Data with TensorFlow Datasets (TFDS) - -This section demonstrates how to load the MNIST dataset using TFDS, fetch the full dataset for evaluation, and define a training generator for batch processing. GPU usage is explicitly disabled for TensorFlow. - -```{code-cell} -:id: sGaQAk1DHMUx - -import tensorflow_datasets as tfds -import tensorflow as tf - -# Ensuring CPU-Only Execution, disable any GPU usage(if applicable) for TF -tf.config.set_visible_devices([], device_type='GPU') -``` - -+++ {"id": "3xdQY7H6wr3n"} - -### Fetch Full Dataset for Evaluation - -Load the dataset with `tfds.load`, convert it to NumPy arrays, and process it for evaluation. - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ - height: 104 - referenced_widgets: [b8cdabf5c05848f38f03850cab08b56f, a8b76d5f93004c089676e5a2a9b3336c, - 119ac8428f9441e7a25eb0afef2fbb2a, 76a9815e5c2b4764a13409cebaf66821, 45ce8dd5c4b949afa957ec8ffb926060, - 05b7145fd62d4581b2123c7680f11cdd, b96267f014814ec5b96ad7e6165104b1, bce34bdbfbd64f1f8353a4e8515cee0b, - 93b8206f8c5841a692cdce985ae301d8, c95f592620c64da595cc787567b2c4db, 8a97071f862c4ec3b4b4140d2e34eda2] -id: 1hOamw_7C8Pb -outputId: ca166490-22db-4732-b29f-866b7593e489 ---- -# tfds.load returns tf.Tensors (or tf.data.Datasets if batch_size != -1) -mnist_data, info = tfds.load(name="mnist", batch_size=-1, data_dir=data_dir, with_info=True) -mnist_data = tfds.as_numpy(mnist_data) -train_data, test_data = mnist_data['train'], mnist_data['test'] - -# Full train set -train_images, train_labels = train_data['image'], train_data['label'] -train_images = jnp.reshape(train_images, (len(train_images), num_pixels)) -train_labels = one_hot(train_labels, n_targets) - -# Full test set -test_images, test_labels = test_data['image'], test_data['label'] -test_images = jnp.reshape(test_images, (len(test_images), num_pixels)) -test_labels = one_hot(test_labels, n_targets) -``` - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: Td3PiLdmEf7z -outputId: 96403b0f-6079-43ce-df16-d4583f09906b ---- -print('Train:', train_images.shape, train_labels.shape) -print('Test:', test_images.shape, test_labels.shape) -``` - -+++ {"id": "UWRSaalfdyDX"} - -### Define the Training Generator - -Create a generator function to yield batches of data for training. - -```{code-cell} -:id: vX59u8CqEf4J - -def training_generator(): - # as_supervised=True gives us the (image, label) as a tuple instead of a dict - ds = tfds.load(name='mnist', split='train', as_supervised=True, data_dir=data_dir) - # You can build up an arbitrary tf.data input pipeline - ds = ds.batch(batch_size).prefetch(1) - # tfds.dataset_as_numpy converts the tf.data.Dataset into an iterable of NumPy arrays - return tfds.as_numpy(ds) -``` - -+++ {"id": "EAWeUdnuFNBY"} - -### Training Loop (TFDS) - -Use the training generator in a custom training loop. - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: h2sO13XDGvq1 -outputId: a150246e-ceb5-46ac-db71-2a8177a9d04d ---- -train_model(num_epochs, params, training_generator) -``` - -+++ {"id": "-ryVkrAITS9Z"} - -## Loading Data with Grain - -This section demonstrates how to load MNIST data using Grain, a data-loading library. You'll define a custom dataset class for Grain and set up a Grain DataLoader for efficient training. - -+++ {"id": "waYhUMUGmhH-"} - -Install Grain - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: L78o7eeyGvn5 -outputId: 76d16565-0d9e-4f5f-c6b1-4cf4a683d0e7 ---- -!pip install grain -``` - -+++ {"id": "66bH3ZDJ7Iat"} - -Import Required Libraries (import MNIST dataset from torchvision) - -```{code-cell} -:id: mS62eVL9Ifmz - -import numpy as np -import grain.python as pygrain -from torchvision.datasets import MNIST -``` - -+++ {"id": "0h6mwVrspPA-"} - -### Define Dataset Class - -Create a custom dataset class to load MNIST data for Grain. - -```{code-cell} -:id: bnrhac5Hh7y1 - -class Dataset: - def __init__(self, data_dir, train=True): - self.data_dir = data_dir - self.train = train - self.load_data() - - def load_data(self): - self.dataset = MNIST(self.data_dir, download=True, train=self.train) - - def __len__(self): - return len(self.dataset) - - def __getitem__(self, index): - img, label = self.dataset[index] - return np.ravel(np.array(img, dtype=np.float32)), label -``` - -+++ {"id": "53mf8bWEsyTr"} - -### Initialize the Dataset - -```{code-cell} -:id: pN3oF7-ostGE - -mnist_dataset = Dataset(data_dir) -``` - -+++ {"id": "GqD-ycgBuwv9"} - -### Get the full train and test dataset - -```{code-cell} -:id: f1VnTuX3u_kL - -# Convert training data to JAX arrays and encode labels as one-hot vectors -train_images = jnp.array([mnist_dataset[i][0] for i in range(len(mnist_dataset))], dtype=jnp.float32) -train_labels = one_hot(np.array([mnist_dataset[i][1] for i in range(len(mnist_dataset))]), n_targets) - -# Load test dataset and process it -mnist_dataset_test = MNIST(data_dir, download=True, train=False) -test_images = jnp.array([np.ravel(np.array(mnist_dataset_test[i][0], dtype=np.float32)) for i in range(len(mnist_dataset_test))], dtype=jnp.float32) -test_labels = one_hot(np.array([mnist_dataset_test[i][1] for i in range(len(mnist_dataset_test))]), n_targets) -``` - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: a2NHlp9klrQL -outputId: 14be58c0-851e-4a44-dfcc-d02f0718dab5 ---- -print("Train:", train_images.shape, train_labels.shape) -print("Test:", test_images.shape, test_labels.shape) -``` - -+++ {"id": "fETnWRo2crhf"} - -### Initialize PyGrain DataLoader - -Set up a PyGrain DataLoader for sequential batch sampling. - -```{code-cell} -:id: 9RuFTcsCs2Ac - -sampler = pygrain.SequentialSampler( - num_records=len(mnist_dataset), - shard_options=pygrain.NoSharding()) # Single-device, no sharding - -def pygrain_training_generator(): - """Grain DataLoader generator for training.""" - return pygrain.DataLoader( - data_source=mnist_dataset, - sampler=sampler, - operations=[pygrain.Batch(batch_size)], - ) -``` - -+++ {"id": "GvpJPHAbeuHW"} - -### Training Loop (Grain) - -Run the training loop using the Grain DataLoader. - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: cjxJRtiTadEI -outputId: 3f624366-b683-4d20-9d0a-777d345b0e21 ---- -train_model(num_epochs, params, pygrain_training_generator) -``` - -+++ {"id": "oixvOI816qUn"} - -## Loading Data with Hugging Face - -This section demonstrates loading MNIST data using the Hugging Face `datasets` library. You'll format the dataset for JAX compatibility, prepare flattened images and one-hot-encoded labels, and define a training generator. - -+++ {"id": "o51P6lr86wz-"} - -Install the Hugging Face `datasets` library. - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: 19ipxPhI6oSN -outputId: 684e445f-d23e-4924-9e76-2c2c9359f0be ---- -!pip install datasets -``` - -+++ {"id": "be0h_dZv0593"} - -Import Library - -```{code-cell} -:id: 8v1N59p76zn0 - -from datasets import load_dataset -``` - -+++ {"id": "8Gaj11tO7C86"} - -### Load and Format MNIST Dataset - -Load the MNIST dataset from Hugging Face and format it as `numpy` arrays for quick access or `jax` to get JAX arrays. - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ - height: 301 - referenced_widgets: [32f6132a31aa4c508d3c3c5ef70348bb, d7c2ffa6b143463c91cbf8befca6ca01, - fd964ecd3926419d92927c67f955d5d0, 60feca3fde7c4447ad8393b0542eb999, 3354a0baeca94d18bc6b2a8b8b465b58, - a0d0d052772b46deac7657ad052991a4, fb34783b9cba462e9b690e0979c4b07a, 8d8170c1ed99490589969cd753c40748, - f1ecb6db00a54e088f1e09164222d637, 3cf5dd8d29aa4619b39dc2542df7e42e, 2e5d42ca710441b389895f2d3b611d0a, - 5d8202da24244dc896e9a8cba6a4ed4f, a6d64c953631412b8bd8f0ba53ae4d32, 69240c5cbfbb4e91961f5b49812a26f0, - 865f38532b784a7c971f5d33b87b443e, ceb1c004191947cdaa10af9b9c03c80d, 64c6041037914779b5e8e9cf5a80ad04, - 562fa6a0e7b846a180ac4b423c5511c5, b3b922288f9c4df2a4088279ff6d1531, 75a1a8ffda554318890cf74c345ed9a9, - 3bae06cacf394a5998c2326199da94f5, ff6428a3daa5496c81d5e664aba01f97, 1ba3f86870724f55b94a35cb6b4173af, - b3e163fd8b8a4f289d5a25611cb66d23, abd2daba215e4f7c9ddabde04d6eb382, e22ee019049144d5aba573cdf4dbe4fc, - 6ac765dac67841a69218140785f024c6, 7b057411a54e434fb74804b90daa8d44, 563f71b3c67d47c3ab1100f5dc1b98f3, - d81a657361ab4bba8bcc0cf309d2ff64, 20316312ab88471ba90cbb954be3e964, 698fda742f834473a23fb7e5e4cf239c, - 289b52c5a38146b8b467a5f4678f6271, d07c2f37cf914894b1551a8104e6cb70, 5b55c73d551d483baaa6a1411c2597b1, - 2308f77723f54ac898588f48d1853b65, 54d2589714d04b2e928b816258cb0df4, f84b795348c04c7a950165301a643671, - bc853a4a8d3c4dbda23d183f0a3b4f27, 1012ddc0343842d8b913a7d85df8ab8f, 771a73a8f5084a57afc5654d72e022f0, - 311a43449f074841b6df4130b0871ac9, cd4d29cb01134469b52d6936c35eb943, 013cf89ee6174d29bb3f4fdff7b36049, - 9237d877d84e4b3ab69698ecf56915bb, 337ef4d37e6b4ff6bf6e8bd4ca93383f, b4096d3837b84ccdb8f1186435c87281, - 7259d3b7e11b4736b4d2aa8e9c55e994, 1ad1f8e99a864fc4a2bc532d9a4ff110, b2b50451eabd40978ef46db5e7dd08c4, - 2dad5c5541e243128e23c3dd3e420ac2, a3de458b61e5493081d6bb9cf7e923db, 37760f8a7b164e6f9c1a23d621e9fe6b, - 745a2aedcfab491fb9cffba19958b0c5, 2f6c670640d048d2af453638cfde3a1e] -id: a22kTvgk6_fJ -outputId: 35fc38b9-a6ab-4b02-ffa4-ab27fac69df4 ---- -mnist_dataset = load_dataset("mnist").with_format("numpy") -``` - -+++ {"id": "IFjTyGxY19b0"} - -### Extract images and labels - -Get image shape and flatten for model input - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: NHrKatD_7HbH -outputId: deec1739-2fc0-4e71-8567-f2e0c9db198b ---- -train_images = mnist_dataset["train"]["image"] -train_labels = mnist_dataset["train"]["label"] -test_images = mnist_dataset["test"]["image"] -test_labels = mnist_dataset["test"]["label"] - -# Flatten images and one-hot encode labels -image_shape = train_images.shape[1:] -num_features = image_shape[0] * image_shape[1] - -train_images = train_images.reshape(-1, num_features) -test_images = test_images.reshape(-1, num_features) - -train_labels = one_hot(train_labels, n_targets) -test_labels = one_hot(test_labels, n_targets) - -print('Train:', train_images.shape, train_labels.shape) -print('Test:', test_images.shape, test_labels.shape) -``` - -+++ {"id": "kk_4zJlz7T1E"} - -### Define Training Generator - -Set up a generator to yield batches of images and labels for training. - -```{code-cell} -:id: -zLJhogj7RL- - -def hf_training_generator(): - """Yield batches for training.""" - for batch in mnist_dataset["train"].iter(batch_size): - x, y = batch["image"], batch["label"] - yield x, y -``` - -+++ {"id": "HIsGfkLI7dvZ"} - -### Training Loop (Hugging Face Datasets) - -Run the training loop using the Hugging Face training generator. - -```{code-cell} ---- -colab: - base_uri: https://localhost:8080/ -id: RhloYGsw6nPf -outputId: d49c1cd2-a546-46a6-84fb-d9507c38f4ca ---- -train_model(num_epochs, params, hf_training_generator) -``` - -+++ {"id": "qXylIOwidWI3"} - -## Summary - -This notebook has introduced efficient strategies for data loading on a CPU with JAX, demonstrating how to integrate popular libraries like PyTorch DataLoader, TensorFlow Datasets, Grain, and Hugging Face Datasets. Each library offers distinct advantages, enabling you to streamline the data loading process for machine learning tasks. By understanding the strengths of these methods, you can select the approach that best suits your project's specific requirements. diff --git a/docs/source/tutorials.md b/docs/source/tutorials.md index 6eb0fab..2fa663a 100644 --- a/docs/source/tutorials.md +++ b/docs/source/tutorials.md @@ -26,8 +26,6 @@ JAX_transformer_text_classification data_loaders_on_cpu_with_jax data_loaders_on_gpu_with_jax data_loaders_for_multi_device_setups_with_jax -data_loaders_on_gpu_with_jax -data_loaders_for_multi_device_setups_with_jax ``` Once you've gone through this content, you can refer to package-specific From 0de7c8fcad6a19f6ddc61ad863541e1f053c2369 Mon Sep 17 00:00:00 2001 From: selamw1 Date: Wed, 4 Dec 2024 17:41:35 -0800 Subject: [PATCH 14/14] new_line_added_at_the_end_of_tutorials --- docs/source/tutorials.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/tutorials.md b/docs/source/tutorials.md index 2fa663a..071343b 100644 --- a/docs/source/tutorials.md +++ b/docs/source/tutorials.md @@ -32,4 +32,4 @@ Once you've gone through this content, you can refer to package-specific documentation for resources that go into more depth on various topics: - [JAX tutorials](https://jax.readthedocs.io/en/latest/tutorials.html) -- [FLAX user guides](https://flax.readthedocs.io/en/latest/guides/index.html) \ No newline at end of file +- [FLAX user guides](https://flax.readthedocs.io/en/latest/guides/index.html)