Skip to content

Commit

Permalink
chapter 2: add new section on Environments
Browse files Browse the repository at this point in the history
  • Loading branch information
drupol committed Sep 19, 2024
1 parent 583529d commit 514962b
Show file tree
Hide file tree
Showing 3 changed files with 132 additions and 0 deletions.
11 changes: 11 additions & 0 deletions resources/sourcecode/python.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
FROM buildpack-deps:bookworm
# ...
RUN set -eux; \
apt-get update; \
apt-get install -y --no-install-recommends \
libbluetooth-dev \
tk-dev \
uuid-dev \
; \
rm -rf /var/lib/apt/lists/*
# ...
107 changes: 107 additions & 0 deletions src/thesis/2-reproducibility.typ
Original file line number Diff line number Diff line change
Expand Up @@ -1450,6 +1450,113 @@ and at any point in the past or future​​​​.
environments or machines.
]

=== Environments <ch2-environments>

Environments where a build or computational process occurs can be broadly
categorised into two types: hardware and software environments. While software
environments can be managed to a high degree of consistency, achieving
reproducibility across different hardware, particularly different #gls("CPU")
architectures #eg[`x86`, `ARM`], is essentially impossible. Tasks like
instruction execution, memory management, and floating-point calculations are
handled in distinct ways. Even small variations in these processes can lead to
differences in output. Consequently, even with identical software, builds on
different types of #gls("CPU") architectures will produce different results.
When something is said to be reproducible, it typically means reproducible
within the same #gls("CPU") architecture. Therefore, this section will focus
exclusively on the reproducibility challenges within software environments.

A software environment is composed of the #gls("OS"), along with the set of
tools, libraries, and dependencies required to build or run a specific
application. Any change in these components can influence the outcome of a
software build or execution. For example, a minor update to a library could
potentially alter the behaviour of the software, producing different outcomes
across different executions​​ or more importantly, have an impact on the security
level.

To enhance reproducibility, it is critical to ensure that the software
environment remains stable and unaltered during both the build and execution
phases. Unfortunately, conventional #glspl("OS") such as Linux distributions,
Microsoft Windows, and macOS, are #emph[mutable] by default. This mutability is
primarily facilitated through package managers, which enable users to easily
modify their environments by installing or upgrading software packages​. As a
result, uncontrolled changes to dependencies may also lead to inconsistencies in
software behaviour, or have a impact on the security level, undermining
reproducibility​.

To mitigate these issues, #emph[immutable] environments have gained popularity.
Tools such as Docker provide mechanisms to encapsulate software and its
dependencies in containers, thus creating environments that remain unchanged
after creation. Once a container is built, it can be shared and executed across
different systems with the guarantee that it will function identically, given
the same environment. This characteristic makes containers highly suitable for
distributing software.

Despite the advantages of immutability, it does not guarantee reproducibility by
default. For instance, container images hosted on platforms like Docker Hub
#cite(<dockerhub>,form:"normal"), including popular language interpreters
#eg[Python, Node, PHP], may not be reproducible due to non-deterministic
steps during the image creation. A specific example can be found in
#ref(<python-dockerfile>), which runs `apt-get update` at line 4 as part of the
image build process. Since `apt-get` pulls the latest version of package lists
at build-time, it is impossible to reproduce the same image later, compromising
Docker's build-time reproducibility.

#figure(
sourcefile(
lang: "dockerfile",
read("../../resources/sourcecode/python.dockerfile"),
),
caption: [
An excerpt of the Python's Dockerfile
#cite(<python-dockerfile-repository>,form:"normal") used to build the
#emph[official] Python images.
],
) <python-dockerfile>

Docker images, once built, are immutable. While Docker does not guarantee
build-time reproducibility, it has the potential to ensure run-time
reproducibility, reflecting Docker's philosophy of
#emph["build once, use everywhere"]. This distinction between build-time
reproducibility (@def-reproducibility-build-time) and run-time reproducibility
(@def-reproducibility-run-time) is key. Docker does not ensure that an image
will always be built consistently, often due to the base image used (as
specified in the `FROM` directive of a `Dockerfile`), as seen in
@python-dockerfile. Although building a reproducible image with Docker is
technically possible, it would require additional effort, external tools, and a
more complex setup. Therefore, we assume that build-time reproducibility is not
guaranteed, but the immutability of the environment significantly enhances the
potential for reproducibility at run-time.

#info-box[
Docker is a platform for building, shipping, and running applications in
containers, with Docker Hub #cite(<dockerhub>,form:"normal") providing a large
repository of container images, which has significantly contributed to
Docker's popularity. Among these are "official" Docker images
#cite(<dockerofficialimages>,form:"normal"), which are curated and reviewed by
Docker Inc. These images offer standard environments for popular software and
adhere to some quality standards.

However, the term "official" can be misleading. One might suggest that these
images are maintained by the original software's developers, but it's not
always the case. For example, the PHP Docker image is not maintained by the
core PHP development team. This means updates or fixes may not be as prompt or
specific as if the software’s developers maintained the image.

While Docker vets these images for quality, responsibility for the contents
rests with the maintainers. Users should be aware that official images are not
immune to security risks or outdated software, and reviewing the documentation
for issues is advisable.

In summary, "official" Docker images are trusted but may not be maintained by
the software’s creators. Developers should use them with care, especially in
production environments, and verify that the images meet their security and
functionality needs.
]

Package managers are a critical aspect of the reproducibility puzzle. Without
proper control over how dependencies are resolved and installed, achieving
consistent and reproducible builds becomes difficult​.

=== Sources Of Non-Determinism

In this section we will explore the sources of non-determinism in software
Expand Down
14 changes: 14 additions & 0 deletions src/thesis/literature.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1048,3 +1048,17 @@ @article{8509170
keywords = {Microsoft Windows;History;Software development;Software maintenance;Software engineering},
doi = {10.1109/MAHC.2018.2877913}
}

@misc{python-dockerfile-repository,
title = {Python 3.12 Dockerfile},
author = {docker-library project1},
year = 2024,
url = {https://github.com/docker-library/python/blame/31bbb37b797bd5521d6622c6d54052d6d0ede585/3.12/bookworm/Dockerfile}
}

@misc{dockerofficialimages,
title = {What are official images},
author = {Docker Inc.},
year = 2024,
url = {https://github.com/docker-library/official-images/blob/6b4803e65a2c56f15b91f8a11bd90f0bcb756c1c/README.md#what-are-official-images},
}

0 comments on commit 514962b

Please sign in to comment.