diff --git a/resources/sourcecode/python.dockerfile b/resources/sourcecode/python.dockerfile new file mode 100644 index 0000000..df83e9c --- /dev/null +++ b/resources/sourcecode/python.dockerfile @@ -0,0 +1,11 @@ +FROM buildpack-deps:bookworm +# ... +RUN set -eux; \ + apt-get update; \ + apt-get install -y --no-install-recommends \ + libbluetooth-dev \ + tk-dev \ + uuid-dev \ + ; \ + rm -rf /var/lib/apt/lists/* +# ... diff --git a/src/thesis/2-reproducibility.typ b/src/thesis/2-reproducibility.typ index 00381c1..43cc9c3 100644 --- a/src/thesis/2-reproducibility.typ +++ b/src/thesis/2-reproducibility.typ @@ -1472,6 +1472,116 @@ and at any point in the past or future​​​​. environments or machines. ] +=== Computational Environments + +Environments where a build or computational process occurs can be broadly +categorised into two types: hardware and software environments +#cite(,form:"normal", supplement: "p. 8, section 2.1"). While +software environments can be managed to a high degree of consistency, achieving +reproducibility across different hardware, particularly different #gls("CPU") +architectures #eg[`x86`, `ARM`], is essentially impossible. Tasks like +instruction execution, memory management, and floating-point calculations are +handled in distinct ways. Even small variations in these processes can lead to +differences in output. Consequently, even with identical software, builds on +different types of #gls("CPU") architectures will produce different results. +When something is said to be reproducible, it typically means reproducible +within the same #gls("CPU") architecture. Therefore, this section will focus +exclusively on the reproducibility challenges within software environments. + +A software environment is composed of the #gls("OS"), along with the set of +tools, libraries, and dependencies required to build or run a specific +application. Any change in these components can influence the outcome of a +software build or execution. For example, a minor update to a library could +potentially alter the behaviour of the software, producing different outcomes +across different executions​​ or more importantly, have an impact on the security +level. + +To enhance reproducibility, it is critical to ensure that the software +environment remains stable and unaltered during both the build and execution +phases. Unfortunately, conventional #glspl("OS") such as Linux distributions, +Microsoft Windows, and macOS, are #emph[mutable] by default. This mutability is +primarily facilitated through package managers, which enable users to easily +modify their environments by installing or upgrading software packages​. As a +result, uncontrolled changes to dependencies may also lead to inconsistencies in +software behaviour, or have a impact on the security level, undermining +reproducibility​. + +To mitigate these issues, #emph[immutable] environments have gained popularity. +Tools such as Docker #cite(,form:"normal") provide mechanisms to +encapsulate software and their dependencies in containers, thus creating +environments that remain unchanged after creation. Once a container is built, it +can be shared and executed across different systems with the guarantee that it +will function identically, given the same environment. This characteristic makes +containers highly suitable for distributing software. + +Despite the advantages of immutability, it does not guarantee reproducibility. +For instance, container images hosted on platforms like Docker Hub +#cite(,form:"normal"), including popular language interpreters +#eg[Python, NodeJS, PHP], may not be reproducible due to non-deterministic +steps during the image creation (at build-time). A specific example can be found +in #ref(), which runs `apt-get update` at line 4 as part of +the image build process. Since `apt-get` pulls the very latest version of +package index during its creation, it is impossible to build again the same +image later, compromising Docker's build-time reproducibility. + +#figure( + sourcefile( + lang: "dockerfile", + read("../../resources/sourcecode/python.dockerfile"), + ), + caption: [ + An excerpt of the Python's Dockerfile + #cite(,form:"normal") used to build the + #emph[official] Python images. + ], +) + +Docker images, once built, are immutable. While Docker does not guarantee +build-time reproducibility, it has the potential to ensure run-time +reproducibility, reflecting Docker's philosophy of +#emph["build once, use everywhere"]. This distinction between build-time +reproducibility (@def-reproducibility-build-time) and run-time reproducibility +(@def-reproducibility-run-time) is key. Docker does not ensure that an image +will always be built consistently, often due to the base image used (as +declared in the `FROM` directive of a `Dockerfile`), as seen in +@python-dockerfile. Although building a reproducible image with Docker is +technically possible, it would require additional effort, external tools, and a +more complex setup. Therefore, we assume that build-time reproducibility is not +guaranteed, but the immutability of the environment significantly enhances the +potential for reproducibility at run-time. + +#info-box(kind: "important")[ + Docker is a platform for building, shipping, and running applications in + containers, with Docker Hub #cite(,form:"normal") providing a large + repository of container images, which has significantly contributed to + Docker's popularity. Among these are the #emph[Docker "official" images] + #cite(,form:"normal"), which are curated and reviewed by + the Docker community. These images offer standard environments for popular + software and adhere to some quality standards. + + However, the term "official" can be misleading. One might suggest that these + images are maintained by the original software's developers, but it's not + always the case. For example, the PHP Docker image + #cite(,form:"normal") is not maintained by the core PHP + development team. This means updates or fixes may not be as prompt or + specific as if the software’s developers maintained the image. + + While Docker vets these images for quality, responsibility for the contents + rests with the maintainers. Users should be aware that official images are not + immune to security risks or outdated software, and reviewing the documentation + for issues is advisable. + + In summary, Docker "official" images are trusted but may not be maintained by + the original software’s maintainers. Developers must use them with caution and + full awareness, particularly in production environments, and ensure that the + images meet their security and functionality requirements. +] + +Package managers are a critical aspect of the reproducibility puzzle since they +can manage the state of a computational environment. Without proper control over +how software and their dependencies are resolved and installed, achieving +consistent and reproducible builds becomes difficult​. + === Sources Of Non-Determinism In this section we will explore the sources of non-determinism in software diff --git a/src/thesis/literature.bib b/src/thesis/literature.bib index 5e72c2a..3632619 100644 --- a/src/thesis/literature.bib +++ b/src/thesis/literature.bib @@ -1067,3 +1067,33 @@ @article{4785860 keywords = {Integrated circuits;Computers;Silicon;Films;Heating;Microwave amplifiers;Data mining}, doi = {10.1109/N-SSC.2006.4785860} } + +@misc{python-dockerfile-repository, + title = {Python 3.12 Dockerfile}, + author = {docker-library project1}, + year = 2024, + url = {https://github.com/docker-library/python/blame/31bbb37b797bd5521d6622c6d54052d6d0ede585/3.12/bookworm/Dockerfile} +} + +@misc{dockerofficialimages, + title = {What are official images}, + author = {Docker Inc.}, + year = 2024, + url = {https://github.com/docker-library/official-images/blob/6b4803e65a2c56f15b91f8a11bd90f0bcb756c1c/README.md#what-are-official-images}, +} + +@misc{dockerhubphpimage, + title = {Docker PHP images}, + author = {{Docker, Inc.}}, + year = 2013, + url = {https://hub.docker.com/_/php/} +} + +@article{strangfeld_2024, + author = {Strangfeld, Marvin}, + title = {{Reproducibility of Computational Environments for Software Development}}, + school = {RWTH Aachen University}, + year = 2024, + month = oct, + doi = {10.5281/zenodo.13843189}, +}