|
18 | 18 | # For instance, in autoregressive models, we cannot interpolate between two images because of the lack of a latent representation.
|
19 | 19 | # We will explore and discuss these benefits and drawbacks alongside with our implementation.
|
20 | 20 | #
|
21 |
| -# Our implementation will focus on the [PixelCNN](https://arxiv.org/pdf/1606.05328.pdf) [2] model which has been discussed in detail in the lecture. |
| 21 | +# Our implementation will focus on the [PixelCNN](https://arxiv.org/abs/1606.05328) [2] model which has been discussed in detail in the lecture. |
22 | 22 | # Most current SOTA models use PixelCNN as their fundamental architecture,
|
23 | 23 | # and various additions have been proposed to improve the performance
|
24 |
| -# (e.g. [PixelCNN++](https://arxiv.org/pdf/1701.05517.pdf) and [PixelSNAIL](http://proceedings.mlr.press/v80/chen18h/chen18h.pdf)). |
| 24 | +# (e.g. [PixelCNN++](https://arxiv.org/abs/1701.05517) and [PixelSNAIL](http://proceedings.mlr.press/v80/chen18h/chen18h.pdf)). |
25 | 25 | # Hence, implementing PixelCNN is a good starting point for our short tutorial.
|
26 | 26 | #
|
27 | 27 | # First of all, we need to import our standard libraries. Similarly as in
|
@@ -173,7 +173,7 @@ def show_imgs(imgs):
|
173 | 173 | # If we now want to apply this to our convolutions, we need to ensure that the prediction of pixel 1
|
174 | 174 | # is not influenced by its own "true" input, and all pixels on its right and in any lower row.
|
175 | 175 | # In convolutions, this means that we want to set those entries of the weight matrix to zero that take pixels on the right and below into account.
|
176 |
| -# As an example for a 5x5 kernel, see a mask below (figure credit - [Aaron van den Oord](https://arxiv.org/pdf/1606.05328.pdf)): |
| 176 | +# As an example for a 5x5 kernel, see a mask below (figure credit - [Aaron van den Oord](https://arxiv.org/abs/1606.05328)): |
177 | 177 | #
|
178 | 178 | # <center width="100%" style="padding: 10px"><img src="masked_convolution.svg" width="150px"></center>
|
179 | 179 | #
|
@@ -216,10 +216,10 @@ def forward(self, x):
|
216 | 216 | #
|
217 | 217 | # To build our own autoregressive image model, we could simply stack a few masked convolutions on top of each other.
|
218 | 218 | # This was actually the case for the original PixelCNN model, discussed in the paper
|
219 |
| -# [Pixel Recurrent Neural Networks](https://arxiv.org/pdf/1601.06759.pdf), but this leads to a considerable issue. |
| 219 | +# [Pixel Recurrent Neural Networks](https://arxiv.org/abs/1601.06759), but this leads to a considerable issue. |
220 | 220 | # When sequentially applying a couple of masked convolutions, the receptive field of a pixel
|
221 | 221 | # show to have a "blind spot" on the right upper side, as shown in the figure below
|
222 |
| -# (figure credit - [Aaron van den Oord et al. ](https://arxiv.org/pdf/1606.05328.pdf)): |
| 222 | +# (figure credit - [Aaron van den Oord et al. ](https://arxiv.org/abs/1606.05328)): |
223 | 223 | #
|
224 | 224 | # <center width="100%" style="padding: 10px"><img src="pixelcnn_blind_spot.svg" width="275px"></center>
|
225 | 225 | #
|
@@ -445,7 +445,7 @@ def show_center_recep_field(img, out):
|
445 | 445 | # For visualizing the receptive field, we assumed a very simplified stack of vertical and horizontal convolutions.
|
446 | 446 | # Obviously, there are more sophisticated ways of doing it, and PixelCNN uses gated convolutions for this.
|
447 | 447 | # Specifically, the Gated Convolution block in PixelCNN looks as follows
|
448 |
| -# (figure credit - [Aaron van den Oord et al. ](https://arxiv.org/pdf/1606.05328.pdf)): |
| 448 | +# (figure credit - [Aaron van den Oord et al. ](https://arxiv.org/abs/1606.05328)): |
449 | 449 | #
|
450 | 450 | # <center width="100%"><img src="PixelCNN_GatedConv.svg" width="700px" style="padding: 15px"/></center>
|
451 | 451 | #
|
@@ -506,7 +506,7 @@ def forward(self, v_stack, h_stack):
|
506 | 506 | # The architecture consists of multiple stacked GatedMaskedConv blocks, where we add an additional dilation factor to a few convolutions.
|
507 | 507 | # This is used to increase the receptive field of the model and allows to take a larger context into account during generation.
|
508 | 508 | # As a reminder, dilation on a convolution works looks as follows
|
509 |
| -# (figure credit - [Vincent Dumoulin and Francesco Visin](https://arxiv.org/pdf/1603.07285.pdf)): |
| 509 | +# (figure credit - [Vincent Dumoulin and Francesco Visin](https://arxiv.org/abs/1603.07285)): |
510 | 510 | #
|
511 | 511 | # <center width="100%"><img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/dilation.gif" width="250px"></center>
|
512 | 512 | #
|
@@ -655,7 +655,7 @@ def test_step(self, batch, batch_idx):
|
655 | 655 | # %% [markdown]
|
656 | 656 | # The visualization shows that for predicting any pixel, we can take almost half of the image into account.
|
657 | 657 | # However, keep in mind that this is the "theoretical" receptive field and not necessarily
|
658 |
| -# the [effective receptive field](https://arxiv.org/pdf/1701.04128.pdf), which is usually much smaller. |
| 658 | +# the [effective receptive field](https://arxiv.org/abs/1701.04128), which is usually much smaller. |
659 | 659 | # For a stronger model, we should therefore try to increase the receptive
|
660 | 660 | # field even further. Especially, for the pixel on the bottom right, the
|
661 | 661 | # very last pixel, we would be allowed to take into account the whole
|
@@ -869,7 +869,7 @@ def autocomplete_image(img):
|
869 | 869 | # Interestingly, the pixel values 64, 128 and 191 also stand out which is likely due to the quantization used during the creation of the dataset.
|
870 | 870 | # For RGB images, we would also see two peaks around 0 and 255,
|
871 | 871 | # but the values in between would be much more frequent than in MNIST
|
872 |
| -# (see Figure 1 in the [PixelCNN++](https://arxiv.org/pdf/1701.05517.pdf) for a visualization on CIFAR10). |
| 872 | +# (see Figure 1 in the [PixelCNN++](https://arxiv.org/abs/1701.05517) for a visualization on CIFAR10). |
873 | 873 | #
|
874 | 874 | # Next, we can visualize the distribution our model predicts (in average):
|
875 | 875 |
|
|
0 commit comments