JuliaML · rkube · Feb 2, 2023
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -58,6 +58,150 @@ copied. In fact, while `x` and `y` are materialized arrays,
 all the rest are data views. 
 
 
+A common task when training Convolutional Neural Networks for image representations 
+is to apply random augmentations to the training data. These augmentations are often
+operations such as flipping the image or applying Gaussian Blur. This example shows
+how to lazily apply such transformations at the time where the batch is loaded using
+[`Augmentor.jl`](https://evizero.github.io/Augmentor.jl/stable/).
+When training the model, one commonly iterates over mini-batches of data and applies the
+augmentations batch-wise. Here we show how `MLUtils.jl` allows to implement this using a
+custom dataset.
+
+First, we import the packages we are using. Besides `MLUtils`, we are using `Random` for random
+number generation, `Augmentor` for the augmentations, and `ImageCore` to convert numerical arrays
+into images. For more details on using images in the Julia eco-system see the 
+[JuliaImages documentation](https://juliaimages.org/stable/tutorials/quickstart/).
+
+```julia
+using MLUtils
+using Random
+using Augmentor
+using ImageCore
+```
+
+The first step is to define a custom [type](https://docs.julialang.org/en/v1/manual/types/) that defines our dataset:
+
+```julia
+struct my_dset{T}
+    data_arr::T
+    trf
+end
+```
+
+The structure takes a type parameter `T`, for numerical image data this could be `Array{Float32}(4)`.
+That is, we specify that the numerical base type is `Float32`. The four dimensions correspond to
+width, height, channels, and number of observations. The field `trf` stores the transformation we will
+apply to the images. No type parameter is provided here, which allows to be more general for the type of
+transformations we will apply.
+
+The data we operate on is a 4-dimensional numerical array, that represents a large collection of color images:
+
+```julia
+num_samples = 100
+num_channels = 3
+width = height = 28
+d = randn(Float32, width, height, num_channels, num_samples)
+```
+
+Now we can define a composition of transformations we wish to apply to the data. In this example we 
+compose a horizontal, vertical flip or no operation, followed by a gaussian blur. A complete
+list of available augmentations in `Augmentor.jl` is provided [here](https://evizero.github.io/Augmentor.jl/stable/operations/).
+
+```julia
+pl = FlipX() * FlipY() * NoOp() |> GaussianBlur(3:2:5, 1f0:1f-1:2f0)
+```
+
+With the data and transformation in place, we can instantiate the dataset
+
+```julia
+ds = my_dset(d, pl)
+```
+
+To instantiate a `DataLoader`to iterate over this simple dataset we need to implement custom
+`numobs` and `getobs` methods:
+
+```julia
+function MLUtils.getobs(dset::my_dset, ix::Int)
+    obs = dset.data_arr[:, :, :, ix]                     # Fetch a single observation from the dataset
+    obs_c = colorview(RGB, permutedims(obs, (3, 1, 2)))  # Convert it into an image so that the transformation can be applied to it
+    obs_trf = augment(obs_c, dset.trf)                   # Apply the augmentations
+    permutedims(channelview(obs_trf), (2, 3, 1))         # Convert the augmented observation into numerical data
+end
+
+MLUtils.numobs(data::my_dset) = size(data.data_arr)[end]
+```
+
+The `numobs` function just returns the number of samples in the dataset. Which is just the extend of the
+last dimnension of the data array field of `my_dset`. The `getobs` function takes the dataset and an
+integer index as input and return the augmented array. Internally, we first fetch a single observation
+from the dataset. Then we convert it into an image, apply the augmentation, and convert the augmented
+observation back into a numerical type.
+
+With these methods implemented, we can now construct a `DataLoader` and iterate over the dataset.
+The augmentations will be applied lazily at the time a observation is accessed.
+
+```julia
+loader = DataLoader(ds, batchsize=-1)
+
+for (ix, obs) ∈ enumerate(loader)
+    @show ix, size(obs)
+end
+```
+
+Now we focus on batching. In practice we want to train on a batch of multiple images.
+`MLUtils.jl` provides `BatchView` that allows to fetch batches of images at a time.
+To make `BatchView` work on our dataset it needs to implement the data container
+interface as described in `ObsView`. In particular, we need to implement a
+`getobs` and `getobs!` method that fetch multiple observations.
+
+The difference between `getobs!` and `getobs` is that `getobs!` returns multiple
+observations in a pre-allocated buffer. We can therefore implement `getobs!` first
+and let `getobs` allocate a buffer and just call `getobs!`.
+
+```julia
+function MLUtils.getobs!(buffer, dset, ix::AbstractVector)
+    batch = dset.data_arr[:, :, :, ix]                           # Load selected observations
+    batch_img = colorview(RGB, permutedims(batch, (3, 1, 2, 4))) # Convert to image
+    augmentbatch!(CPUThreads(), buffer, batch_img, dset.trf)     # Augment entire batch
+    permutedims(channelview(buffer), (2, 3, 1, 4))               # Convert augmented batch to numerical type
+end
+
+function MLUtils.getobs(dset::my_dset, ix::AbstractVector)
+    # Get the size of the dataset array, sans the number of batches. THat is defined by 
+    # the length of the index vector
+
+    batch_dim = [size(ds.data_arr)[[3, 1, 2]]..., length(ix)]
+    buffer = colorview(RGB, zeros(eltype(dset.data_arr), batch_dim...))
+    MLUtils.getobs!(buffer, dset, ix)
+end
+```
+
+`getobs!` takes as input the pre-allocated buffer, the dataset, and vector of
+indices that specify the desired observations to fetch. The function then
+copies the specified observations into a buffer and converts the datatype of the
+buffer into an image type. Then, the entire batch is augmented and the result stored
+in the pre-alloacted buffer. The first argument `CPUThreads()` in the call to
+`augmentbatch!` allows individual augmentations to be performed in parallel.
+The result is converted back into a numerical array which is then returned by the function.
+
+
+The `getobs` method is essentially a wrapper around `getobs!` but also allocates
+a buffer. The number of observations requested in each iteration of the `DataLoader`
+can vary when there is a remainder for dividing number of total observations by the
+batch size.
+
+With these methods implemented, we can now lazily apply random augmentations to each
+batch of the dataset:
+
+```julia
+loader_batch = DataLoader(ds, batchsize=27, shuffle=true)
+for (ix, bobs) ∈ enumerate(loader_batch)
+    @show ix, size(bobs)
+end
+```
+
+
+
 ## Related Packages
 
 `MLUtils.jl` brings together functionalities previously found in [LearnBase.jl](https://github.com/JuliaML/LearnBase.jl) , [MLDataPattern.jl](https://github.com/JuliaML/MLDataPattern.jl) and [MLLabelUtils.jl](https://github.com/JuliaML/MLLabelUtils.jl). These packages are now discontinued.