Python for data science 1e

aeturrell · aeturrell · commit cf098dc571a8 · 2023-08-26T12:08:11.000+01:00
diff --git a/Dockerfile b/Dockerfile
@@ -30,8 +30,6 @@ RUN mamba env create -f environment.yml
 # Make RUN commands use the new environment:
 SHELL ["conda", "run", "-n", "python4DS", "/bin/bash", "-c"]
 
-RUN pip install --pre -U seaborn
-
 RUN mamba list
 
 # Copy the current directory contents into the container at /app
diff --git a/boolean-data.ipynb b/boolean-data.ipynb
@@ -715,7 +715,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.10.12"
   },
   "toc-showtags": true
  },
diff --git a/command-line.md b/command-line.md
@@ -5,7 +5,7 @@ jupytext:
     extension: .md
     format_name: myst
 kernelspec:
-  display_name: Python4DS
+  display_name: py4ds2e
   language: python
   name: python3
 ---
diff --git a/data-import.ipynb b/data-import.ipynb
@@ -94,9 +94,7 @@
     "```python\n",
     "import os\n",
     "\n",
-    "# get current working directory (cwd)\n",
-    "os.getcwd()\n",
-    "\n",
+    "os.getcwd()  # get current working directory (cwd)\n",
     "```\n",
     "\n",
     "Say this comes back with 'python4DS', then your downloaded data should be in 'python4DS/data/students.csv'."
@@ -441,7 +439,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.10.12"
   },
   "toc-showtags": true
  },
diff --git a/data-tidy.ipynb b/data-tidy.ipynb
@@ -366,7 +366,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.10.12"
   },
   "toc-showtags": true
  },
diff --git a/data-visualise.ipynb b/data-visualise.ipynb
@@ -118,8 +118,7 @@
     "\n",
     "In this context, a variable refers to an attribute of all the penguins, and an observation refers to all the attributes of a single penguin.\n",
     "\n",
-    "Type the name of the data frame in the interactive window and Python will print a preview of its contents.\n",
-    "Note that it says `shape` on top of this preview: that's the shape of your data (344 rows, 8 columns)."
+    "Type the name of the data frame in the interactive window and Python will print a preview of its contents."
    ]
   },
   {
diff --git a/dates-and-times.ipynb b/dates-and-times.ipynb
@@ -1093,7 +1093,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.10.12"
   },
   "toc-showtags": true
  },
diff --git a/functions.ipynb b/functions.ipynb
@@ -496,7 +496,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.10.12"
   },
   "toc-showtags": true
  },
diff --git a/introduction.ipynb b/introduction.ipynb
@@ -120,15 +120,15 @@
     "\n",
     "Another possibility is that your big data problem is actually a large number of small data problems in disguise. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. This would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately, each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like [Hadoop](https://hadoop.apache.org/) or [Spark](https://spark.apache.org/)) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer your question for a single subset using the tools described in this book, you can learn new tools like **pyspark** to solve it for the full dataset.\n",
     "\n",
-    "### R, Julia, and friends\n",
+    "### Julia and R\n",
     "\n",
-    "In this book, you won't learn anything about R, Julia, or any other programming language useful for data science. This isn't because we think these tools are bad. They're not! And in practice, most data science teams use a mix of languages. However, you may find it easier to learn one set of tools at a time. In this book you'll see what we think of as the three critical tools for data science:\n",
+    "In this book, you won't learn anything about R or Julia, which are both sometimes used for data science. This isn't because we think these tools are bad. They're not! In this book you'll see what we think of as the three critical tools for data science:\n",
     "\n",
     "- Python\n",
     "- SQL\n",
     "- command line scripting\n",
     "\n",
-    "This book predominantly uses Python, which is usually ranked as the first or second most popular programming language in the world and, just as importantly, it’s also one of the easiest to learn. It’s a general purpose language, which means it can perform a wide range of tasks. This combination of features is why people say Python has a low floor and a high ceiling. It’s also very versatile; the joke goes that Python is the 2nd best language at everything, and there’s some truth to that (although Python is 1st best at some tasks, like machine learning). But a language that covers such a lot of ground is also very useful; and Python is widely used across industry, academia, and the public sector, and is often taught in schools too.\n",
+    "These are the three languages that will get you a job as a data scientist, and that's a very good reason to focus on them. We'll spend most of our time with Python, and for good reason. Python is usually ranked as the first or second most popular programming language in the world and, just as importantly, it’s also one of the easiest to learn. It’s a general purpose language, which means it can perform a wide range of tasks. This combination of features is why people say Python has a low floor and a high ceiling. It’s also very versatile; the joke goes that Python is the 2nd best language at everything, and there’s some truth to that (although Python is 1st best at some tasks, like machine learning). But a language that covers such a lot of ground is also very useful; and Python is widely used across industry, academia, and the public sector, and is often taught in schools too.\n",
     "\n",
     "We think Python is a great place to start your data science journey because it is the most popular tool for data science and programming more generally, with a large community behind it.\n",
     "\n",
@@ -191,7 +191,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.10.12"
   },
   "toc-showtags": true
  },
diff --git a/iteration.ipynb b/iteration.ipynb
@@ -688,7 +688,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.10.12"
   },
   "toc-showtags": true
  },
diff --git a/joins.ipynb b/joins.ipynb
@@ -254,7 +254,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.10.12"
   },
   "toc-showtags": true
  },
diff --git a/missing-values.ipynb b/missing-values.ipynb
@@ -597,7 +597,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.10.12"
   },
   "toc-showtags": true
  },
diff --git a/numbers.ipynb b/numbers.ipynb
@@ -793,7 +793,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.10.12"
   },
   "toc-showtags": true
  },
diff --git a/prerequisites.ipynb b/prerequisites.ipynb
@@ -41,7 +41,7 @@
     "\n",
     "### A Python interpreter\n",
     "\n",
-    "Python is both a programming language that you can read, and a language that computers can read, interpret, and then carry out instructions based on. For your computer to be able to read and execute Python code, you will need to get Python installed on your computer. There are lots of ways to install a Python \"interpreter\" on your computer, this book recommends the *Anaconda distribution* of Python for its flexibility and simplicity.\n",
+    "Python is both a programming language that you can read, and a language that computers can read, interpret, and then carry out instructions based on. For your computer to be able to read and execute Python code, you will need to get Python installed on your computer. There are lots of ways to install a Python \"interpreter\" on your computer, but this book recommends the *Anaconda distribution* of Python for its flexibility and simplicity.\n",
     "\n",
     "### Packages\n",
     "\n",
@@ -72,6 +72,8 @@
    "source": [
     "## How to get started on your own computer\n",
     "\n",
+    "These instructions are for if you're going to work with Python locally, on your own computer.\n",
+    "\n",
     "### Installing Python\n",
     "\n",
     "To download and install Python, we'll use the Anaconda \"distribution\" of Python, which is available on all major operating systems. To install it, follow the instructions below or watch this video on *[how to install Python using the Anaconda distribution of Python](https://www.youtube.com/watch?v=ZWQwGR5ppnk)*.\n",
@@ -111,6 +113,8 @@
    "source": [
     "## Data science in the cloud\n",
     "\n",
+    "These instructions are for if you're planning to do your data science remotely, using a computer in the cloud.\n",
+    "\n",
     "There are many ways to do data science in the cloud, but we're going to share with you the absolute simplest. For this, you will need to sign up for a [Github Account](https://github.com/). Github is an organisation that's owned by Microsoft and which provides a range of services including a way to back-up code on the cloud, and cloud computing. One of the services offered is *Github Codespaces*. A GitHub Codespace is an online cloud computer that you connect to from your browser window. It has a generous 60 hours free of computing per month.\n",
     "\n",
     "```{note}\n",
@@ -248,7 +252,7 @@
     "\n",
     "As well as following this book using your own computer or on the cloud via Github Codespaces, you can run the code online through a few other options. The first is the easiest to get started with.\n",
     "\n",
-    "1. [Google Colab notebooks](https://research.google.com/colaboratory/). Free for most use. You can launch most pages in this book interactively by using the 'Colab' button under the rocket symbol at the top of the page. It will be in the form of a notebook (which mixes code and text) rather than a script (.py file) but the code you write is the same.\n",
+    "1. [Google Colab notebooks](https://research.google.com/colaboratory/). Free for most use. You can launch most pages in this book interactively by using the 'Colab' button under the rocket symbol at the top of the page. It will be in the form of a notebook (which mixes code and text) rather than a script (.py file) but the code you write is the same. Note that you may need to update packages to the most recent versions. On Colab, you can do this by runnin `!pip install **packagename**` in a code cell—note the extra exclamation mark, which tells Colab that this is an instruction for the operating system rather than for Python.\n",
     "2. [Gitpod Workspace](https://www.gitpod.io/). An alternative to Codespaces. This is a remote, cloud-based version of Visual Studio Code with Python installed and will run Python scripts. Note that the free tier covers 50 hours per month."
    ]
   }
@@ -278,7 +282,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.10.12"
   },
   "toc-showtags": true
  },
diff --git a/rectangling.ipynb b/rectangling.ipynb
@@ -615,7 +615,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.10.12"
   },
   "toc-showtags": true
  },
diff --git a/regex.ipynb b/regex.ipynb
@@ -219,8 +219,14 @@
   }
  ],
  "metadata": {
+  "kernelspec": {
+   "display_name": "py4ds2e",
+   "language": "python",
+   "name": "python3"
+  },
   "language_info": {
-   "name": "python"
+   "name": "python",
+   "version": "3.10.12"
   }
  },
  "nbformat": 4,
diff --git a/spreadsheets.ipynb b/spreadsheets.ipynb
@@ -56,7 +56,7 @@
     "\n",
     "To show how this works, we'll work with an example spreadsheet called \"students.xlsx\". The figure below shows what the spreadsheet looks like.\n",
     "\n",
-    "![A look at the students spreadsheet in Excel. The spreadsheet contains information on 6 students, their ID, full name, favourite food, meal plan, and age.](https://github.com/hadley/r4ds/raw/main/images/import-spreadsheets-students.png)"
+    "![A look at the students spreadsheet in Excel. The spreadsheet contains information on 6 students, their ID, full name, favourite food, meal plan, and age.](https://github.com/hadley/r4ds/raw/main/screenshots/import-spreadsheets-students.png)"
    ]
   },
   {
@@ -440,7 +440,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.10.12"
   },
   "toc-showtags": true
  },
diff --git a/strings.ipynb b/strings.ipynb
@@ -1090,7 +1090,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.10.12"
   },
   "toc-showtags": true
  },
diff --git a/vis-layers.ipynb b/vis-layers.ipynb
@@ -919,6 +919,90 @@
     "    Create a visualisation of the `mpg` dataset that demonstrates it."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "353de15d",
+   "metadata": {},
+   "source": [
+    "### Co-ordinate Systems\n",
+    "\n",
+    "Note that, to create maps, **lets-plot** requires **geopandas**, and we strongly recommend you install **geopandas** using `conda install geopandas`, as it has a lot of complicated dependencies.\n",
+    "\n",
+    "Co-ordinate systems can certainly get complicated! They're very important in mathematics and physics but, more obviously, for maps. Most of the time, we're working in *cartesian* co-ordinates where everything is mapped out by x, y, and sometimes z (usually height or colour on continuous scales). But because the Earth is not always well-represented by cartesian co-ordinates (especially as you \"zoom out\"), you often need a different co-ordinate system to make visualisations of it. We don't want to spend too long on this, because it's a big topic, but what's important for you to know here is that **lets-plot** does have the capability work with map-based co-ordinate systems.\n",
+    "\n",
+    "The first element of plotting any map is retrieving the geographic information. Let's grab info on the UK"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a96d3d89",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "uk = geocode(\"state\").scope(\"United Kingdom\").get_boundaries(6)\n",
+    "uk"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4b0add0f",
+   "metadata": {},
+   "source": [
+    "Now let's pass this to `geom_map`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "44102ffc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "(\n",
+    "    ggplot()\n",
+    "    + geom_map(map=uk, fill=\"gray\", color=\"white\")\n",
+    "    + coord_map(xlim=(-10, 6), ylim=(50, 59))\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "003be0d6",
+   "metadata": {},
+   "source": [
+    "We can spruce this up with some colours, and with some cities!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "487eda0c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "names = [\"London\", \"Edinburgh\", \"Belfast\", \"Cardiff\"]\n",
+    "states = [\"England\", \"Scotland\", \"Northern Ireland\", \"Wales\"]\n",
+    "cities = geocode(names=names, states=states).ignore_not_found().get_centroids()\n",
+    "cities"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b3cf5712",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "(\n",
+    "    ggplot()\n",
+    "    + geom_map(aes(fill=\"found name\"), map=uk, color=\"white\")\n",
+    "    + coord_map(xlim=(-10, 6), ylim=(50, 59))\n",
+    "    + geom_point(data=cities, size=5)\n",
+    "    + theme(legend_position=\"none\")\n",
+    ")"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "7d302c31",
@@ -940,7 +1024,7 @@
     "\n",
     "\n",
     "Our new template takes six parameters, the bracketed words that appear in the template.\n",
-    "In practice, you rarely need to supply all seven parameters to make a graph because **lets-plot88 will provide useful defaults for everything except the data, the mappings, and the geom function.\n",
+    "In practice, you rarely need to supply all seven parameters to make a graph because **lets-plot** will provide useful defaults for everything except the data, the mappings, and the geom function.\n",
     "\n",
     "The six parameters in the template compose the grammar of graphics, a formal system for building plots.\n",
     "The grammar of graphics is based on the insight that you can uniquely describe *any* plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, a faceting scheme, and a theme.\n",
diff --git a/webscraping-and-apis.ipynb b/webscraping-and-apis.ipynb
diff --git a/whole-game.ipynb b/whole-game.ipynb
diff --git a/workflow-help.md b/workflow-help.md
diff --git a/workflow-packages-and-environments.md b/workflow-packages-and-environments.md
diff --git a/workflow-style.ipynb b/workflow-style.ipynb
diff --git a/workflow-writing-code.ipynb b/workflow-writing-code.ipynb
diff --git a/zreferences.md b/zreferences.md

Original file line number	Diff line number	Diff line change
`@@ -118,8 +118,7 @@`
`118`	`118`	`"\n",`
`119`	`119`	`"In this context, a variable refers to an attribute of all the penguins, and an observation refers to all the attributes of a single penguin.\n",`
`120`	`120`	`"\n",`
`121`		`- "Type the name of the data frame in the interactive window and Python will print a preview of its contents.\n",`
`122`		- "Note that it says `shape` on top of this preview: that's the shape of your data (344 rows, 8 columns)."
	`121`	`+ "Type the name of the data frame in the interactive window and Python will print a preview of its contents."`
`123`	`122`	`]`
`124`	`123`	`},`
`125`	`124`	`{`