From f5e5e471c4211f0a2b4da9d7bad6509c1895e8e1 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Fri, 24 Jul 2020 12:57:36 +0100 Subject: [PATCH 01/24] Update with venue. --- _dsa/ml-systems.md | 540 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 540 insertions(+) create mode 100644 _dsa/ml-systems.md diff --git a/_dsa/ml-systems.md b/_dsa/ml-systems.md new file mode 100644 index 0000000..f32ca77 --- /dev/null +++ b/_dsa/ml-systems.md @@ -0,0 +1,540 @@ +--- +title: "Introduction to Machine Learning Systems" +abstract: "This notebook introduces some of the challenges of building machine learning data systems. It will introduce you to concepts around joining of databases together. The storage and manipulation of data is at the core of machine learning systems and data science. The goal of this notebook is to introduce the reader to these concepts, not to authoritatively answer any questions about the state of Nigerian health facilities or Covid19, but it may give you ideas about how to try and do that in your own country." +layout: talk +author: +- given: Eric + family: Meissner +- given: Andrei + family: Paleyes +- given: Neil D. + family: Lawrence +date: 2020-07-24 +ipynb: true +venue: Virtual DSA +transition: None +--- + +\include{talk-macros.tex} + +\slides{\section{AI via ML Systems} + +\include{_ai/includes/supply-chain-system.md} +\include{_ai/includes/aws-soa.md} +\include{_ai/includes/dsa-systems.md} +} + + +\notes{\subsection{Question} + +In this notebook, we explore the question of health facility distribution in Nigeria, spatially, and in relation to population density. + +We answer and visualize the question "How does the number of health facilities per capita vary across Nigeria?" + +Rather than focussing purely on using tools like ```pandas``` to manipulate the data, our focus will be on introducing some concepts from databases. + +Machine learning can be summarized as +$$ +\text{model} + \text{data} \xrightarrow{\text{compute}} \text{prediction} +$$ +and many machine learning courses focus a lot on the model part. But to build a machine learning system in practice, a lot of work has to be put into the data part. This notebook gives some pointers on that work and how to think about your machine learning systems design.} + +\notes{ +\subsection{Datasets} + +In this notebook , we download 3 datasets: +* Nigeria NMIS health facility data +* Population data for Administrative Zone 1 (states) areas in Nigeria +* Map boundaries for Nigerian states (for plotting and binning) +* Covid cases across Nigeria (as of May 20, 2020) + +But joining these data sets together is just an example. As another example, you could think of [SafeBoda](https://safeboda.com/ng/), a ride-hailing app that's available in Lagos and Kampala. As well as looking at the health examples, try to imagine how SafeBoda may have had to design their systems to be scalable and reliable for storing and sharing data.} + +\notes{ +\subsection{Imports, Installs, and Downloads} + +First, we're going to download some particular python libraries for dealing with geospatial data. We're dowloading [```geopandas```](https://geopandas.org) which will help us deal with 'shape files' that give the geographical lay out of Nigeria. We'll also download [```pygeos```](https://pygeos.readthedocs.io/en/latest/), a library for dealing with points rapidly in python. And finally, to get a small database set up running quickly, we're installing [```csv-to-sqlite```](https://pypi.org/project/csv-to-sqlite/) which allows us to convert CSV data to a simple database.} + +\notes{ +\code{%pip install geopandas pygeos} +} + +\notes{ +\subsection{Databases and Joins} + +The main idea we will be working with today is called the 'join'. A join does exactly what it sounds like, it combines two database tables. + +You have already started to look at data structures, in particular you have been learning about ```pandas``` which is a great way of storing and structuring your data set to make it easier to plot and manipulate your data. + +Pandas is great for the data scientist to analyze data because it makes many operations easier. But it is not so good for building the machine learning system. In a machine learning system, you may have to handle a lot of data. Even if you start with building a system where you only have a few customers, perhaps you build an online taxi system (like SafeBoda) for Kampala. Maybe you will have 50 customers. Then maybe your system can be handled with some python scripts and pandas. + +\subsection{Scaling ML Systems} + +But what if you are succesful? What if everyone in Kampala wants to use your system? There are 1.5 million people in Kampala and maybe 100,000 Boda Boda drivers. + +What if you are even more succesful? What if everyone in Lagos wants to use your system? There are around 20 million people in Lagos ... and maybe as many Okada drivers as people in Kampala! + +We want to build safe and reliable machine learning systems. Building them from pandas and python is about as safe and reliable as [taking six children to school on a boda boda](https://www.monitor.co.ug/News/National/Boda-accidents-kill-10-city-UN-report-Kampala/688334-4324032-15oru2dz/index.html). + +To build a reliable system, we need to turn to *databases*. In this notebook [we'll be focussing on SQL databases](https://en.wikipedia.org/wiki/Join_(SQL)) and how you bring together different streams of data in a Machine Learning System. + +In a machine learning system, you will need to bring different data sets together. In database terminology this is known as a 'join'. You have two different data sets, and you want to join them together. Just like you can join two pieces of metal using a welder, or two pieces of wood with screws. + +But instead of using a welder or screws to join data, we join it using particular columns of the data. We can join data together using people's names. One database may contain where people live, another database may contain where they go to school. If we join these two databases we can have a database which shows where people live and where they got to school. + +In the notebook, we will join together some data about where the health centres are in Nigeria and where the have been cases of Covid19. There are other challenges in the ML System Design that are not going to be covered here. They include: how to update the data bases, and how to control access to the data bases from different users (boda boda drivers, riders, administrators etc). } + + +\notes{ +\subsection{Hospital Data} + +The first and primary dataset we use is the NMIS health facility dataset, which contains data on the location, type, and staffing of health facilities across Nigeria. } + +\notes{ +\setupcode{import urllib.request +import pandas as pd} + +\code{urllib.request.urlretrieve('https://energydata.info/dataset/f85d1796-e7f2-4630-be84-79420174e3bd/resource/6e640a13-cab4-457b-b9e6-0336051bac27/download/healthmopupandbaselinenmisfacility.csv', 'healthmopupandbaselinenmisfacility.csv') +hospital_data = pd.read_csv('healthmopupandbaselinenmisfacility.csv')} +} + +\notes{It's always a good idea to inspect your data once it's downloaded to check it contains what you expect. In ```pandas``` you can do this with the ```.head()``` method. That allows us to see the first few entries of the ```pandas``` data structure.} + +\notes{ +\code{hospital_data.head()} +} + +\notes{We can also check in ```pandas``` what the different columns of the data frame are to see what it contains. } + +\notes{ +\code{hospital_data.columns} +} + +\notes{We can immiediately see that there are facility names, dates, and some characteristics of each health center such as number of doctors etc. As well as all that, we have two fields, ```latitude``` and ```longitude``` that likely give us the hospital locaiton. Let's plot them to have a look. } + +\notes{ +\setupcode{import matplotlib.pyplot as plt} + +\code{plt.plot(hospital_data.longitude, hospital_data.latitude,'ro', alpha=0.01)} +} + +\notes{There we have the location of these different hospitals. We set alpha in the plot to 0.01 to make the dots transparent, so we can see the locations of each health center.} + + +\notes{ +\subsection{Administrative Zone Geo Data} + +A very common operation is the need to map from locations in a country to the administrative regions. If we were building a ride sharing app, we might also want to map riders to locations in the city, so that we could know how many riders we had in different city areas. + +Administrative regions have various names like cities, counties, districts or states. These conversions for the administrative regions are important for getting the right information to the right people. + +Of course, if we had a knowlegdeable Nigerian, we could ask her about what the right location for each of these health facilities is, which state is it in? But given that we have the latitude and longitude, we should be able to find out automatically what the different states are. + +This is where "geo" data becomes important. We need to download a dataset that stores the location of the different states in Nigeria. These files are known as 'outline' files. Because the draw the different states of different countries in outline. + +There are special databases for storing this type of information, the database we are using is in the ```gdb``` or GeoDataBase format. It comes in a zip file. Let's download the outline files for the Nigerian states. They have been made available by the [Humanitarian Data Exchange](https://data.humdata.org/), you can also find other states data from the same site.} + +\notes{ +\setupcode{import zipfile} + +\code{admin_zones_url = 'https://data.humdata.org/dataset/81ac1d38-f603-4a98-804d-325c658599a3/resource/0bc2f7bb-9ff6-40db-a569-1989b8ffd3bc/download/nga_admbnda_osgof_eha_itos.gdb.zip' +_, msg = urllib.request.urlretrieve(admin_zones_url, 'nga_admbnda_osgof_eha_itos.gdb.zip') +with zipfile.ZipFile('/content/nga_admbnda_osgof_eha_itos.gdb.zip', 'r') as zip_ref: + zip_ref.extractall('/content/nga_admbnda_osgof_eha_itos.gdb')} + } +\notes{Now we have this data of the outlines of the different states in Nigeria. + +The next thing we need to know is how these health facilities map onto different states in Nigeria. Without "binning" facilities somehow, it's difficult to effectively see how they are distributed across the country. + +We do this by finding a "geo" dataset that contains the spatial outlay of Nigerian states by latitude/longitude coordinates. The dataset we use is of the "gdb" (GeoDataBase) type and comes as a zip file. We don't need to worry much about this datatype for this notebook, only noting that geopandas knows how to load in the dataset, and that it contains different "layers" for the same map. In this case, each layer is a different degree of granularity of the political boundaries, with layer 0 being the whole country, 1 is by state, or 2 is by local government. We'll go with a state level view for simplicity, but as an excercise you can change it to layer 2 to view the distribution by local government. + +Once we have these ```MultiPolygon``` objects that define the boundaries of different states, we can perform a spatial join (sjoin) from the coordinates of individual health facilities (which we already converted to the appropriate ```Point``` type when moving the health data to a GeoDataFrame.)} + +\notes{\subsection{Joining a GeoDataFrame} + +The first database join we're going to do is a special one, it's a 'spatial join'. We're going to join together the locations of the hospitals with their states. + +This join is unusual because it requires some mathematics to get right. The outline files give us the borders of the different states in latitude and longitude, the health facilities have given locations in the country. + +A spatial join involves finding out which state each health facility belongs to. Fortunately, the mathematics you need is already programmed for you in GeoPandas. That means all we need to do is convert our ```pandas``` dataframe of health facilities into a ```GeoDataFrame``` which allows us to do the spatial join. } + +\notes{ +\setupcode{import geopandas as gpd} +\code{hosp_gdf = gpd.GeoDataFrame( + hospital_data, geometry=gpd.points_from_xy(hospital_data.longitude, hospital_data.latitude)) +hosp_gdf.crs = "EPSG:4326"} +} + +\notes{There are some technial details here: the ```crs``` refers to the coordinate system in use by a particular GeoDataFrame. ```EPSG:4326``` is the standard coordinate system of latitude/longitude.} + +\notes{\subsection{Your First Join: Converting GPS Coordinates to States} + +Now we have the data in the ```GeoPandas``` format, we can start converting into states. We will use the [```fiona```](https://pypi.org/project/Fiona/) library for reading the right layers from the files. Before we do the join, lets plot the location of health centers and states on the same map.} + +\notes{ +\setupcode{import fiona} + +\code{states_file = "/content/nga_admbnda_osgof_eha_itos.gdb/nga_admbnda_osgof_eha_itos.gdb/nga_admbnda_osgof_eha_itos.gdb/nga_admbnda_osgof_eha_itos.gdb/" + +# geopandas included map, filtered to just Nigeria +world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')) +world.crs = "EPSG:4326" +nigeria = world[(world['name'] == 'Nigeria')] +base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11)) + +layers = fiona.listlayers(states_file) +zones_gdf = gpd.read_file(states_file, layer=1) +zones_gdf.crs = "EPSG:4326" +zones_gdf = zones_gdf.set_index('admin1Name_en') +zones_gdf.plot(ax=base, color='white', edgecolor='black') + +# We can now plot our ``GeoDataFrame``. +hosp_gdf.plot(ax=base, color='b', alpha=0.02, ) + +plt.show()} +} + +\notes{\subsection{Performing the Spatial Join} + +We've now plotted the different health center locations across the states. You can clearly see that each of the dots falls within a different state. For helping the visualisation, we've made the dots somewhat transparent (we set the ```alpha``` in the plot). This means that we can see the regions where there are more health centers, you should be able to spot where the major cities in Nigeria are given the increased number of health centers in those regions. + +Of course, we can now see by eye, which of the states each of the health centers belongs to. But we want the computer to do our join for us. ```GeoPandas``` provides us with the spatial join. Here we're going to do a [```left``` or ```outer``` join](https://en.wikipedia.org/wiki/Join_(SQL)#Left_outer_join). } + +\notes{ +\setupcode{from geopandas.tools import sjoin} +} + +\notes{We have two GeoPandas data frames, ```hosp_gdf``` and ```zones_gdf```. Let's have a look at the columns the contain.} + +\notes{ +\code{hosp_gdf.columns} +} + +\notes{We can see that this is the GeoDataFrame containing the information about the hospital. Now let's have a look at the ```zones_gdf``` data frame.} + +\notes{ +\code{zones_gdf.columns} +} + +\notes{You can see that this data frame has a different set of columns. It has all the different administrative regions. But there is one column name that overlaps. We can find it by looking for the intersection between the two sets.} + +\notes{ +\code{set(hosp_gdf.columns).intersection(set(zones_gdf.columns))} +} + +\notes{Here we've converted the lists of columns into python 'sets', and then looked for the intersection. The *join* will occur on the intersection between these columns. It will try and match the geometry of the hospitals (their location) to the geometry of the states (their outlines). This match is done in one line in GeoPandas. + +We're having to use GeoPandas because this join is a special one based on geographical locations, if the join was on customer name or some other discrete variable, we could do the join in pandas or directly in SQL. } + +\notes{ +\code{hosp_state_joined = sjoin(hosp_gdf, zones_gdf, how='left')} +} + +\notes{The intersection of the two data frames indicates how the two data frames will be joined (if there's no intersection, they can't be joined). It's like indicating the two holes that would need to be bolted together on two pieces of metal. If the holes don't match, the join can't be done. There has to be an intersection. + +But what will the result look like? Well the join should be the 'union' of the two data frames. We can have a look at what the union should be by (again) converting the columns to sets.} + +\notes{ +\code{set(hosp_gdf.columns).union(set(zones_gdf.columns))} +} + +\notes{That gives a list of all the columns (notice that 'geometry' only appears once). + +Let's check that's what the join command did, by looking at the columns of our new data frame, ```hosp_state_joined```. Notice also that there's a new column: ```index_right```. The two original data bases had separate indices. The ```index_right``` column represents the index from the ```zones_gdf```, which is the Nigerian state.} + +\notes{ +\code{set(hosp_state_joined.columns)} +} + +\notes{Great! They are all there! We have completed our join. We had two separate data frames with information about states and information about hospitals. But by performing an 'outer' or a 'left' join, we now have a single data frame with all the information in the same place! Let's have a look at the first frew entries in the new data frame.} + +\notes{ +\code{hosp_state_joined.head()} +} + +\notes{\subsection{SQL Database} + +Our first join was a special one, because it involved spatial data. That meant using the special ```gdb``` format and the ```GeoPandas``` tool for manipulating that data. But we've now saved our updated data in a new file. + +To do this, we use the command line utility that comes standard for SQLite database creation. SQLite is a simple database that's useful for playing with database commands on your local machine. For a real system, you would need to set up a server to run the database. The server is a separate machine with the job of answering database queries. SQLite pretends to be a proper database, but doesn't require us to go to the extra work of setting up a server. Popular SQL server software includes [```MySQL```](https://www.mysql.com/) which is free or [Microsoft's SQL Server](https://www.microsoft.com/en-gb/sql-server/sql-server-2019). + +A typical machine learning installation might have you running a database from a cloud service (such as AWS, Azure or Google Cloud Platform). That cloud service would host the database for you and you would pay according to the number of queries made. + +Many start-up companies were formed on the back of a ```MySQL``` server hosted on top of AWS. You can [read how to do that here](https://aws.amazon.com/getting-started/hands-on/create-mysql-db/). + +If you were designing your own ride hailing app, or any other major commercial software you would want to investigate whether you would need to set up a central SQL server in one of these frameworks. + +Today though, we'll just stick to SQLite which gives you a sense of the database without the time and expense of setting it up on the cloud. As well as showing you the SQL commands (which is often what's used in a production ML system) we'll also give the equivalent ```pandas``` commands, which would often be what you would use when you're doing data analysis in ```python``` and ```Jupyter```.} + +\notes{\subsection{Create the SQLite Database} + +The beautiful thing about SQLite is that it allows us to play with SQL without going to the work of setting up a proper SQL server. Creating a data base in SQLite is as simple as writing a new file. To create the database, we'll first write our joined data to a CSV file, then we'll use a little utility to convert our hospital database into a SQLite database. +} + +\notes{\code{hosp_state_joined.to_csv('facilities.csv')}} + +\notes{\code{%pip install csv-to-sqlite}} + +\notes{\code{!csv-to-sqlite -f facilities.csv -t full -o db.sqlite}} + +\notes{Rather than being installed on a separate server, SQLite simply stores the database locally in a file called ```db.sqlite```. + +In the database there can be several 'tables'. Each table can be thought of as like a separate dataframe. The table name we've just saved is 'hospitals_zones_joined'. +} + +\notes{\subsection{Accessing the SQL Database} + +Now that we have a SQL database, we can create a connection to it and query it using SQL commands. Let's try to simply select the data we wrote to it, to make sure its the same. + +Start by making a connection to the database. This will often be done via remote connections, but for this example we'll connect locally to the database using the filepath directly.} + +\notes{ +\setuphelpercode{import sqlite3} + +\helpercode{def create_connection(db_file): + """ create a database connection to the SQLite database + specified by the db_file + :param db_file: database file + :return: Connection object or None + """ + conn = None + try: + conn = sqlite3.connect(db_file) + except Error as e: + print(e) + + return conn} + +\code{conn = create_connection("db.sqlite")} +} + +\notes{Now that we have a connection, we can write a command and pass it to the database. + +To access a data base, the first thing that is made is a connection. Then SQL is used to extract the information required. A typical SQL command is ```SELECT```. It allows us to extract rows from a given table. It operates a bit like the ```.head()``` method in ```pandas```, it will return the first ```N``` rows (by default the ```.head()``` command returns the first 5 rows, but you can set ```n``` to whatever you like. Here we've included a default value of 5 to make it match the ```pandas``` command. + +The python library, ```sqlite3```, allows us to access the SQL database directly from python. We do this using an ```execute``` command on the connection. + +Typically, its good software engineering practice to 'wrap' the database command in some python code. This allows the commands to be maintained. Below we wrap the SQL command + +``` +SELECT * FROM [table_name] LIMIT : N +``` +in python code. This SQL command selects the first ```N``` entries from a given database called ```table_name```. + +We can pass the ```table_name``` and number of rows, ```N``` to the python command.} + +\notes{ +\helpercode{def select_top(conn, table, n): + """ + Query n first rows of the table + :param conn: the Connection object + :param table: The table to query + :param n: Number of rows to query + """ + cur = conn.cursor() + cur.execute(f"SELECT * FROM [{table}] LIMIT :limitNum", {"limitNum": n}) + + rows = cur.fetchall() + return rows} +} + +\notes{Let's have a go at calling the command to extract the first three facilities from our health center database. Let's try creating a function that does the same thing the pandas .head() method does so we can inspect our database.} + +\notes{ +\setupcode{def head(conn, table, n=5): + rows = select_top(conn, table, n) + for r in rows: + print(r)} + +\code{head(conn, 'facilities')} +} + +\notes{Great! We now have the data base in SQLite, and some python functions that operate on the data base by wrapping SQL commands. + +We will return to the SQL command style after download and add the other datasets to the database using a combination of ```pandas``` and the ```csv-to-sqlite``` utility. + +Our next task will be to introduce data on COVID19 so that we can join that to our other data sets.} + +\notes{\subsection{Covid Data} + +Now we have the health data, we're going to combine it with [data about COVID-19 cases in Nigeria over time](https://github.com/dsfsi/covid19africa). This data is kindly provided by Africa open COVID-19 data working group, which Elaine Nsoesie has been working with. The data is taken from Twitter, and only goes up until May 2020. + +They provide their data in github. We can access the cases we're interested in from the following URL. + +For convenience, we'll load the data into pandas first, but our next step will be to create a new SQLite table containing the data. Then we'll join that table to our existing tables.} + +\notes{ +\code{covid_data_url = 'https://raw.githubusercontent.com/dsfsi/covid19africa/master/data/line_lists/line-list-nigeria.csv' +covid_data_csv = 'cases.csv' +urllib.request.urlretrieve(covid_data_url, covid_data_csv) +covid_data = pd.read_csv(covid_data_csv) +}} + +\notes{As normal, we should inspect our data to check that it contains what we expect. } + +\notes{\code{covid_data.head()}} + +\notes{And we can get an idea of all the information in the data from looking at the columns.} + +\notes{\code{covid_data.columns}} + +\notes{Now we convert this CSV file we've downloaded into a new table in the database file. We can do this, again, with the csv-to-sqlite script.} + +\notes{\code{!csv-to-sqlite -f cases.csv -t full -o db.sqlite}} + +\notes{\subsection{Population Data} + +Now we have information about COVID cases, and we have information about how many health centers and how many doctors and nurses there are in each health center. But unless we understand how many people there are in each state, then we cannot make decisions about where they may be problems with the disease. + +If we were running our ride hailing service, we would also need information about how many people there were in different areas, so we could understand what the *demand* for the boda boda rides might be. + +To access the number of people we can get population statistics from the [Humanitarian Data Exchange](https://data.humdata.org/). + +We also want to have population data for each state in Nigeria, so that we can see attributes like whether there are zones of high health facility density but low population density.} + +\notes{\code{pop_url = 'https://data.humdata.org/dataset/a7c3de5e-ff27-4746-99cd-05f2ad9b1066/resource/d9fc551a-b5e4-4bed-9d0d-b047b6961817/download/nga_pop_adm1_2016.csv' +_, msg = urllib.request.urlretrieve(pop_url,'nga_pop_adm1_2016.csv') +pop_data = pd.read_csv('nga_pop_adm1_2016.csv')}} + +\notes{\code{pop_data.head()}} + +\notes{To do joins with this data, we must first make sure that the columns have the right names. The name should match the same name of the column in our existing data. So we reset the column names, and the name of the index, as follows.} + +\notes{\code{pop_data.columns = ['admin1Name_en', 'admin1Pcode', 'admin0Name_en', 'admin0Pcode', 'population'] +pop_data = pop_data.set_index('admin1Name_en')}} + +\notes{When doing this for real world data, you should also make sure that the names used in the rows are the same across the different data bases. For example, has someone decided to use an abbreviation for 'Federal Capital Territory' and set it as 'FCT'. The computer won't understand these are the same states, and if you do a join with such data you can get duplicate entries or missing entries. This sort of thing happens a lot in real world data and takes a lot of time to sort out. Fortunately, in this case, the data is well curated and we don't have these problems.} + +\notes{\subsection{Save to database file} + +The next step is to add this new CSV file as an additional table in our SQLite database. This is done using the script as before.} + +\notes{\code{pop_data.to_csv('pop_data.csv')}} + +\notes{\code{!csv-to-sqlite -f pop_data.csv -t full -o db.sqlite}} + +\notes{\subsection{Computing per capita hospitals and COVID} + +The Minister of Health in Abuja may be interested in which states are most vulnerable to COVID19. We now have all the information in our SQL data bases to compute what our health center provision is per capita, and what the COVID19 situation is. + +To do this, we will use the ```JOIN``` operation from SQL and introduce a new operation called ```GROUPBY```.} + +\notes{#### Joining in Pandas + +As before, these operations can be done in pandas or GeoPandas. Before we create the SQL commands, we'll show how you can do that in pandas. + +In pandas, the equivalent of a database table is a dataframe. So the JOIN operation takes two dataframes and joins them based on the key. The key is that special shared column between the two tables. The place where the 'holes align' so the two databases can be joined together. + +In GeoPandas we used an outer join. In an outer join you keep all rows from both tables, even if there is no match on the key. In an inner join, you only keep the rows if the two tables have a matching key. + +This is sometimes where problems can creep in. If in one table Lagos's state is encoded as 'FCT', and in another table it's encoded as 'Federal Capital Territory', they won't match and that data wouldn't appear in the joined table. + +In simple terms, a JOIN operation takes two tables (or dataframes) and combines them based on some key, in this case the index of the Pandas data frame which is the state name.} + +\notes{\code{pop_joined = zones_gdf.join(pop_data['population'], how='inner')}} + +\notes{\subsection{GroupBy in Pandas} + +Our COVID19 data is in the form of individual cases. But we are interested in total case counts for each state. There is a special data base operation known as ```GROUP BY``` for collecting information about the individual states. The type of information you might want could be a sum, the maximum value, an average, the minimum value. We can use a GroupBy operation in ```pandas``` and SQL to summarize the counts of covid cases in each state. + +A ```GROUPBY``` operation groups rows with the same key (in this case 'province/state') into separate objects, that we can operate on further such as to count the rows in each group, or to sum or take the mean over the values in some column (imagine each case row had the age of the patient, and you were interested in the mean age of patients.)} + +\notes{\code{covid_cases_by_state = covid_data.groupby(['province/state']).count()['case_id']}} + +\notes{The ```.groupby()``` method on the dataframe has now given us a new data series that contains the total number of covid cases in each state. We can examine it to check we have something sensible.} + +\notes{\code{covid_cases_by_state}} + +\notes{Now we have this new data series, it can be added to the pandas data frame as a new column.} + +\notes{\code{pop_joined['covid_cases_by_state'] = covid_cases_by_state}} + +\notes{The spatial join we did on the original data frame to obtain hosp_state_joined introduced a new column, index_right which contains the state of each of the hospitals. Let's have a quick look at it below.} + +\notes{\code{hosp_state_joined['index_right']} + +\notes{To count the hospitals in each of the states, we first create a grouped series where we've grouped on these states.} + +\notes{\code{grouped = hosp_state_joined.groupby('index_right')}} + +\notes{This python operation now goes through each of the groups and counts how many hospitals there are in each state. It stores the result in a dictionary. If you're new to Python, then to understand this code you need to understand what a 'dictionary comprehension' is. In this case the dictionary comprehension is being used to create a python dictionary of states and total hospital counts. That's then being converted into a ```pandas``` Data Series and added to the ```pop_joined``` dataframe.} + +\notes{\code{counted_groups = {k: len(v) for k, v in grouped.groups.items()} +pop_joined['hosp_state'] = pd.Series(counted_groups)}} + +\notes{For convenience, we can now add a new data series to the data frame that contains the per capita information about hospitals. that makes it easy to retrieve later.} + +\notes{\code{pop_joined['hosp_per_capita_10k'] = (pop_joined['hosp_state'] * 10000 )/ pop_joined['population']}} + +\notes{\subsection{SQL-style} + +That's the ```pandas``` approach to doing it. But ```pandas``` itself is inspired by database language, in particular relational databases such as SQL. To do these types of joins at scale, e.g. for our ride hailing app, we need to see how to do these joins in a database. + +As before, we'll wrap the underlying SQL commands with a convenient python command. + +What you see below gives the full SQL command. There is a [```SELECT``` command](https://www.w3schools.com/sql/sql_select.asp), which extracts ```FROM``` a particular table. It then completes an [```INNER JOIN```](https://www.w3schools.com/sql/sql_join_inner.asp) using particular columns (```provice/state``` and ```index_right```)} + +\helpercode{def join_counts(conn): + """ + Calculate counts of cases and facilities per state, join results + """ + cur = conn.cursor() + cur.execute(""" + SELECT ct.[province/state] as [state], ct.[case_count], ft.[facility_count] + FROM + (SELECT [province/state], COUNT(*) as [case_count] FROM [cases] GROUP BY [province/state]) ct + INNER JOIN + (SELECT [index_right], COUNT(*) as [facility_count] FROM [facilities] GROUP BY [index_right]) ft + ON + ct.[province/state] = ft.[index_right] + """) + + rows = cur.fetchall() + return rows}} + +\notes{Now we've created our python wrapper, we can connect to the data base and run our SQL command on the database using the wrapper.} + +\notes{\code{conn = create_connection("db.sqlite")}} + +\notes{\code{state_cases_hosps = join_counts(conn)}} + +\notes{\code{for row in state_cases_hosps: + print("State {} \t\t Covid Cases {} \t\t Health Facilities {}".format(row[0], row[1], row[2]))}} + + +\notes{\code{base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11)) +pop_joined.plot(ax=base, column='population', edgecolor='black', legend=True) +base.set_title("Population of Nigerian States")}} + +\notes{\code{base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11)) +pop_joined.plot(ax=base, column='hosp_per_capita_10k', edgecolor='black', legend=True) +base.set_title("Hospitals Per Capita (10k) of Nigerian States")}} + +\notes{\subsection{Exercise} + +1. Add a new column the dataframe for covid cases per 10,000 population, in the same way we computed health facilities per 10k capita. + +2. Add a new column for covid cases per health facility. + +Do this in both the SQL and the Pandas styles to get a feel for how they differ.} + +\notes{\code{# pop_joined['cases_per_capita_10k'] = ??? +# pop_joined['cases_per_facility'] = ???} + +\notes{\code{base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11)) +pop_joined.plot(ax=base, column='cases_per_capita_10k', edgecolor='black', legend=True) +base.set_title("Covid Cases Per Capita (10k) of Nigerian States")}} + +\notes{\code{base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11)) +pop_joined.plot(ax=base, column='covid_cases_by_state', edgecolor='black', legend=True) +base.set_title("Covid Cases by State")}} + +\notes{\code{base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11)) +pop_joined.plot(ax=base, column='cases_per_facility', edgecolor='black', legend=True) +base.set_title("Covid Cases per Health Facility")}} + +\thanks + + + +\references From 3766f82f1c2fb143d6ce0ee5c0ccfacf9f565884 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Sat, 25 Jul 2020 12:09:26 +0100 Subject: [PATCH 02/24] Fix FCT's city. --- _dsa/ml-systems.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/_dsa/ml-systems.md b/_dsa/ml-systems.md index f32ca77..87c6f5d 100644 --- a/_dsa/ml-systems.md +++ b/_dsa/ml-systems.md @@ -5,10 +5,15 @@ layout: talk author: - given: Eric family: Meissner + url: https://www.linkedin.com/in/meissnereric/ + twitter: meissner_eric_7 - given: Andrei family: Paleyes + url: https://www.linkedin.com/in/andreipaleyes/ - given: Neil D. family: Lawrence + twitter: lawrennd + url: http://inverseprobability.com date: 2020-07-24 ipynb: true venue: Virtual DSA @@ -427,7 +432,7 @@ In pandas, the equivalent of a database table is a dataframe. So the JOIN operat In GeoPandas we used an outer join. In an outer join you keep all rows from both tables, even if there is no match on the key. In an inner join, you only keep the rows if the two tables have a matching key. -This is sometimes where problems can creep in. If in one table Lagos's state is encoded as 'FCT', and in another table it's encoded as 'Federal Capital Territory', they won't match and that data wouldn't appear in the joined table. +This is sometimes where problems can creep in. If in one table Abuja's state is encoded as 'FCT' or 'FCT-Abuja', and in another table it's encoded as 'Federal Capital Territory', they won't match and that data wouldn't appear in the joined table. In simple terms, a JOIN operation takes two tables (or dataframes) and combines them based on some key, in this case the index of the Pandas data frame which is the state name.} From bdde4a4f3e0c0def66ce77e2926dcd3c7fde340d Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Sun, 23 Aug 2020 22:19:50 +0100 Subject: [PATCH 03/24] Update local changes. --- _dsa/ml-systems.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) mode change 100644 => 100755 _dsa/ml-systems.md diff --git a/_dsa/ml-systems.md b/_dsa/ml-systems.md old mode 100644 new mode 100755 From 465b7f713d3c26be76f335a634bd7f7501852fc8 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Sun, 6 Sep 2020 13:34:35 +0100 Subject: [PATCH 04/24] Update with typos Vikum spotted. --- _dsa/ml-systems.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/_dsa/ml-systems.md b/_dsa/ml-systems.md index 87c6f5d..092549a 100755 --- a/_dsa/ml-systems.md +++ b/_dsa/ml-systems.md @@ -47,7 +47,8 @@ and many machine learning courses focus a lot on the model part. But to build a \notes{ \subsection{Datasets} -In this notebook , we download 3 datasets: +In this notebook , we download 4 datasets: + * Nigeria NMIS health facility data * Population data for Administrative Zone 1 (states) areas in Nigeria * Map boundaries for Nigerian states (for plotting and binning) @@ -135,7 +136,7 @@ Administrative regions have various names like cities, counties, districts or st Of course, if we had a knowlegdeable Nigerian, we could ask her about what the right location for each of these health facilities is, which state is it in? But given that we have the latitude and longitude, we should be able to find out automatically what the different states are. -This is where "geo" data becomes important. We need to download a dataset that stores the location of the different states in Nigeria. These files are known as 'outline' files. Because the draw the different states of different countries in outline. +This is where "geo" data becomes important. We need to download a dataset that stores the location of the different states in Nigeria. These files are known as 'outline' files. Because they draw the different states of different countries in outline. There are special databases for storing this type of information, the database we are using is in the ```gdb``` or GeoDataBase format. It comes in a zip file. Let's download the outline files for the Nigerian states. They have been made available by the [Humanitarian Data Exchange](https://data.humdata.org/), you can also find other states data from the same site.} From 52f6acaa24fdddbb698a9e62ee8ada32a54afa6d Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Sun, 25 Oct 2020 08:44:29 +0000 Subject: [PATCH 05/24] Start creating sub files for ML systems talk. --- _dsa/ml-systems.md | 338 +-------------------------------------------- 1 file changed, 6 insertions(+), 332 deletions(-) diff --git a/_dsa/ml-systems.md b/_dsa/ml-systems.md index 092549a..d23a97c 100755 --- a/_dsa/ml-systems.md +++ b/_dsa/ml-systems.md @@ -29,338 +29,12 @@ transition: None \include{_ai/includes/dsa-systems.md} } - -\notes{\subsection{Question} - -In this notebook, we explore the question of health facility distribution in Nigeria, spatially, and in relation to population density. - -We answer and visualize the question "How does the number of health facilities per capita vary across Nigeria?" - -Rather than focussing purely on using tools like ```pandas``` to manipulate the data, our focus will be on introducing some concepts from databases. - -Machine learning can be summarized as -$$ -\text{model} + \text{data} \xrightarrow{\text{compute}} \text{prediction} -$$ -and many machine learning courses focus a lot on the model part. But to build a machine learning system in practice, a lot of work has to be put into the data part. This notebook gives some pointers on that work and how to think about your machine learning systems design.} - -\notes{ -\subsection{Datasets} - -In this notebook , we download 4 datasets: - -* Nigeria NMIS health facility data -* Population data for Administrative Zone 1 (states) areas in Nigeria -* Map boundaries for Nigerian states (for plotting and binning) -* Covid cases across Nigeria (as of May 20, 2020) - -But joining these data sets together is just an example. As another example, you could think of [SafeBoda](https://safeboda.com/ng/), a ride-hailing app that's available in Lagos and Kampala. As well as looking at the health examples, try to imagine how SafeBoda may have had to design their systems to be scalable and reliable for storing and sharing data.} - -\notes{ -\subsection{Imports, Installs, and Downloads} - -First, we're going to download some particular python libraries for dealing with geospatial data. We're dowloading [```geopandas```](https://geopandas.org) which will help us deal with 'shape files' that give the geographical lay out of Nigeria. We'll also download [```pygeos```](https://pygeos.readthedocs.io/en/latest/), a library for dealing with points rapidly in python. And finally, to get a small database set up running quickly, we're installing [```csv-to-sqlite```](https://pypi.org/project/csv-to-sqlite/) which allows us to convert CSV data to a simple database.} - -\notes{ -\code{%pip install geopandas pygeos} -} - -\notes{ -\subsection{Databases and Joins} - -The main idea we will be working with today is called the 'join'. A join does exactly what it sounds like, it combines two database tables. - -You have already started to look at data structures, in particular you have been learning about ```pandas``` which is a great way of storing and structuring your data set to make it easier to plot and manipulate your data. - -Pandas is great for the data scientist to analyze data because it makes many operations easier. But it is not so good for building the machine learning system. In a machine learning system, you may have to handle a lot of data. Even if you start with building a system where you only have a few customers, perhaps you build an online taxi system (like SafeBoda) for Kampala. Maybe you will have 50 customers. Then maybe your system can be handled with some python scripts and pandas. - -\subsection{Scaling ML Systems} - -But what if you are succesful? What if everyone in Kampala wants to use your system? There are 1.5 million people in Kampala and maybe 100,000 Boda Boda drivers. - -What if you are even more succesful? What if everyone in Lagos wants to use your system? There are around 20 million people in Lagos ... and maybe as many Okada drivers as people in Kampala! - -We want to build safe and reliable machine learning systems. Building them from pandas and python is about as safe and reliable as [taking six children to school on a boda boda](https://www.monitor.co.ug/News/National/Boda-accidents-kill-10-city-UN-report-Kampala/688334-4324032-15oru2dz/index.html). - -To build a reliable system, we need to turn to *databases*. In this notebook [we'll be focussing on SQL databases](https://en.wikipedia.org/wiki/Join_(SQL)) and how you bring together different streams of data in a Machine Learning System. - -In a machine learning system, you will need to bring different data sets together. In database terminology this is known as a 'join'. You have two different data sets, and you want to join them together. Just like you can join two pieces of metal using a welder, or two pieces of wood with screws. - -But instead of using a welder or screws to join data, we join it using particular columns of the data. We can join data together using people's names. One database may contain where people live, another database may contain where they go to school. If we join these two databases we can have a database which shows where people live and where they got to school. - -In the notebook, we will join together some data about where the health centres are in Nigeria and where the have been cases of Covid19. There are other challenges in the ML System Design that are not going to be covered here. They include: how to update the data bases, and how to control access to the data bases from different users (boda boda drivers, riders, administrators etc). } - - -\notes{ -\subsection{Hospital Data} - -The first and primary dataset we use is the NMIS health facility dataset, which contains data on the location, type, and staffing of health facilities across Nigeria. } - -\notes{ -\setupcode{import urllib.request -import pandas as pd} - -\code{urllib.request.urlretrieve('https://energydata.info/dataset/f85d1796-e7f2-4630-be84-79420174e3bd/resource/6e640a13-cab4-457b-b9e6-0336051bac27/download/healthmopupandbaselinenmisfacility.csv', 'healthmopupandbaselinenmisfacility.csv') -hospital_data = pd.read_csv('healthmopupandbaselinenmisfacility.csv')} -} - -\notes{It's always a good idea to inspect your data once it's downloaded to check it contains what you expect. In ```pandas``` you can do this with the ```.head()``` method. That allows us to see the first few entries of the ```pandas``` data structure.} - -\notes{ -\code{hospital_data.head()} -} - -\notes{We can also check in ```pandas``` what the different columns of the data frame are to see what it contains. } - -\notes{ -\code{hospital_data.columns} -} - -\notes{We can immiediately see that there are facility names, dates, and some characteristics of each health center such as number of doctors etc. As well as all that, we have two fields, ```latitude``` and ```longitude``` that likely give us the hospital locaiton. Let's plot them to have a look. } - -\notes{ -\setupcode{import matplotlib.pyplot as plt} - -\code{plt.plot(hospital_data.longitude, hospital_data.latitude,'ro', alpha=0.01)} -} - -\notes{There we have the location of these different hospitals. We set alpha in the plot to 0.01 to make the dots transparent, so we can see the locations of each health center.} - - -\notes{ -\subsection{Administrative Zone Geo Data} - -A very common operation is the need to map from locations in a country to the administrative regions. If we were building a ride sharing app, we might also want to map riders to locations in the city, so that we could know how many riders we had in different city areas. - -Administrative regions have various names like cities, counties, districts or states. These conversions for the administrative regions are important for getting the right information to the right people. - -Of course, if we had a knowlegdeable Nigerian, we could ask her about what the right location for each of these health facilities is, which state is it in? But given that we have the latitude and longitude, we should be able to find out automatically what the different states are. - -This is where "geo" data becomes important. We need to download a dataset that stores the location of the different states in Nigeria. These files are known as 'outline' files. Because they draw the different states of different countries in outline. - -There are special databases for storing this type of information, the database we are using is in the ```gdb``` or GeoDataBase format. It comes in a zip file. Let's download the outline files for the Nigerian states. They have been made available by the [Humanitarian Data Exchange](https://data.humdata.org/), you can also find other states data from the same site.} - -\notes{ -\setupcode{import zipfile} - -\code{admin_zones_url = 'https://data.humdata.org/dataset/81ac1d38-f603-4a98-804d-325c658599a3/resource/0bc2f7bb-9ff6-40db-a569-1989b8ffd3bc/download/nga_admbnda_osgof_eha_itos.gdb.zip' -_, msg = urllib.request.urlretrieve(admin_zones_url, 'nga_admbnda_osgof_eha_itos.gdb.zip') -with zipfile.ZipFile('/content/nga_admbnda_osgof_eha_itos.gdb.zip', 'r') as zip_ref: - zip_ref.extractall('/content/nga_admbnda_osgof_eha_itos.gdb')} - } -\notes{Now we have this data of the outlines of the different states in Nigeria. - -The next thing we need to know is how these health facilities map onto different states in Nigeria. Without "binning" facilities somehow, it's difficult to effectively see how they are distributed across the country. - -We do this by finding a "geo" dataset that contains the spatial outlay of Nigerian states by latitude/longitude coordinates. The dataset we use is of the "gdb" (GeoDataBase) type and comes as a zip file. We don't need to worry much about this datatype for this notebook, only noting that geopandas knows how to load in the dataset, and that it contains different "layers" for the same map. In this case, each layer is a different degree of granularity of the political boundaries, with layer 0 being the whole country, 1 is by state, or 2 is by local government. We'll go with a state level view for simplicity, but as an excercise you can change it to layer 2 to view the distribution by local government. - -Once we have these ```MultiPolygon``` objects that define the boundaries of different states, we can perform a spatial join (sjoin) from the coordinates of individual health facilities (which we already converted to the appropriate ```Point``` type when moving the health data to a GeoDataFrame.)} - -\notes{\subsection{Joining a GeoDataFrame} - -The first database join we're going to do is a special one, it's a 'spatial join'. We're going to join together the locations of the hospitals with their states. - -This join is unusual because it requires some mathematics to get right. The outline files give us the borders of the different states in latitude and longitude, the health facilities have given locations in the country. - -A spatial join involves finding out which state each health facility belongs to. Fortunately, the mathematics you need is already programmed for you in GeoPandas. That means all we need to do is convert our ```pandas``` dataframe of health facilities into a ```GeoDataFrame``` which allows us to do the spatial join. } - -\notes{ -\setupcode{import geopandas as gpd} -\code{hosp_gdf = gpd.GeoDataFrame( - hospital_data, geometry=gpd.points_from_xy(hospital_data.longitude, hospital_data.latitude)) -hosp_gdf.crs = "EPSG:4326"} -} - -\notes{There are some technial details here: the ```crs``` refers to the coordinate system in use by a particular GeoDataFrame. ```EPSG:4326``` is the standard coordinate system of latitude/longitude.} - -\notes{\subsection{Your First Join: Converting GPS Coordinates to States} - -Now we have the data in the ```GeoPandas``` format, we can start converting into states. We will use the [```fiona```](https://pypi.org/project/Fiona/) library for reading the right layers from the files. Before we do the join, lets plot the location of health centers and states on the same map.} - -\notes{ -\setupcode{import fiona} - -\code{states_file = "/content/nga_admbnda_osgof_eha_itos.gdb/nga_admbnda_osgof_eha_itos.gdb/nga_admbnda_osgof_eha_itos.gdb/nga_admbnda_osgof_eha_itos.gdb/" - -# geopandas included map, filtered to just Nigeria -world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')) -world.crs = "EPSG:4326" -nigeria = world[(world['name'] == 'Nigeria')] -base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11)) - -layers = fiona.listlayers(states_file) -zones_gdf = gpd.read_file(states_file, layer=1) -zones_gdf.crs = "EPSG:4326" -zones_gdf = zones_gdf.set_index('admin1Name_en') -zones_gdf.plot(ax=base, color='white', edgecolor='black') - -# We can now plot our ``GeoDataFrame``. -hosp_gdf.plot(ax=base, color='b', alpha=0.02, ) - -plt.show()} -} - -\notes{\subsection{Performing the Spatial Join} - -We've now plotted the different health center locations across the states. You can clearly see that each of the dots falls within a different state. For helping the visualisation, we've made the dots somewhat transparent (we set the ```alpha``` in the plot). This means that we can see the regions where there are more health centers, you should be able to spot where the major cities in Nigeria are given the increased number of health centers in those regions. - -Of course, we can now see by eye, which of the states each of the health centers belongs to. But we want the computer to do our join for us. ```GeoPandas``` provides us with the spatial join. Here we're going to do a [```left``` or ```outer``` join](https://en.wikipedia.org/wiki/Join_(SQL)#Left_outer_join). } - -\notes{ -\setupcode{from geopandas.tools import sjoin} -} - -\notes{We have two GeoPandas data frames, ```hosp_gdf``` and ```zones_gdf```. Let's have a look at the columns the contain.} - -\notes{ -\code{hosp_gdf.columns} -} - -\notes{We can see that this is the GeoDataFrame containing the information about the hospital. Now let's have a look at the ```zones_gdf``` data frame.} - -\notes{ -\code{zones_gdf.columns} -} - -\notes{You can see that this data frame has a different set of columns. It has all the different administrative regions. But there is one column name that overlaps. We can find it by looking for the intersection between the two sets.} - -\notes{ -\code{set(hosp_gdf.columns).intersection(set(zones_gdf.columns))} -} - -\notes{Here we've converted the lists of columns into python 'sets', and then looked for the intersection. The *join* will occur on the intersection between these columns. It will try and match the geometry of the hospitals (their location) to the geometry of the states (their outlines). This match is done in one line in GeoPandas. - -We're having to use GeoPandas because this join is a special one based on geographical locations, if the join was on customer name or some other discrete variable, we could do the join in pandas or directly in SQL. } - -\notes{ -\code{hosp_state_joined = sjoin(hosp_gdf, zones_gdf, how='left')} -} - -\notes{The intersection of the two data frames indicates how the two data frames will be joined (if there's no intersection, they can't be joined). It's like indicating the two holes that would need to be bolted together on two pieces of metal. If the holes don't match, the join can't be done. There has to be an intersection. - -But what will the result look like? Well the join should be the 'union' of the two data frames. We can have a look at what the union should be by (again) converting the columns to sets.} - -\notes{ -\code{set(hosp_gdf.columns).union(set(zones_gdf.columns))} -} - -\notes{That gives a list of all the columns (notice that 'geometry' only appears once). - -Let's check that's what the join command did, by looking at the columns of our new data frame, ```hosp_state_joined```. Notice also that there's a new column: ```index_right```. The two original data bases had separate indices. The ```index_right``` column represents the index from the ```zones_gdf```, which is the Nigerian state.} - -\notes{ -\code{set(hosp_state_joined.columns)} -} - -\notes{Great! They are all there! We have completed our join. We had two separate data frames with information about states and information about hospitals. But by performing an 'outer' or a 'left' join, we now have a single data frame with all the information in the same place! Let's have a look at the first frew entries in the new data frame.} - -\notes{ -\code{hosp_state_joined.head()} -} - -\notes{\subsection{SQL Database} - -Our first join was a special one, because it involved spatial data. That meant using the special ```gdb``` format and the ```GeoPandas``` tool for manipulating that data. But we've now saved our updated data in a new file. - -To do this, we use the command line utility that comes standard for SQLite database creation. SQLite is a simple database that's useful for playing with database commands on your local machine. For a real system, you would need to set up a server to run the database. The server is a separate machine with the job of answering database queries. SQLite pretends to be a proper database, but doesn't require us to go to the extra work of setting up a server. Popular SQL server software includes [```MySQL```](https://www.mysql.com/) which is free or [Microsoft's SQL Server](https://www.microsoft.com/en-gb/sql-server/sql-server-2019). - -A typical machine learning installation might have you running a database from a cloud service (such as AWS, Azure or Google Cloud Platform). That cloud service would host the database for you and you would pay according to the number of queries made. - -Many start-up companies were formed on the back of a ```MySQL``` server hosted on top of AWS. You can [read how to do that here](https://aws.amazon.com/getting-started/hands-on/create-mysql-db/). - -If you were designing your own ride hailing app, or any other major commercial software you would want to investigate whether you would need to set up a central SQL server in one of these frameworks. - -Today though, we'll just stick to SQLite which gives you a sense of the database without the time and expense of setting it up on the cloud. As well as showing you the SQL commands (which is often what's used in a production ML system) we'll also give the equivalent ```pandas``` commands, which would often be what you would use when you're doing data analysis in ```python``` and ```Jupyter```.} - -\notes{\subsection{Create the SQLite Database} - -The beautiful thing about SQLite is that it allows us to play with SQL without going to the work of setting up a proper SQL server. Creating a data base in SQLite is as simple as writing a new file. To create the database, we'll first write our joined data to a CSV file, then we'll use a little utility to convert our hospital database into a SQLite database. -} - -\notes{\code{hosp_state_joined.to_csv('facilities.csv')}} - -\notes{\code{%pip install csv-to-sqlite}} - -\notes{\code{!csv-to-sqlite -f facilities.csv -t full -o db.sqlite}} - -\notes{Rather than being installed on a separate server, SQLite simply stores the database locally in a file called ```db.sqlite```. - -In the database there can be several 'tables'. Each table can be thought of as like a separate dataframe. The table name we've just saved is 'hospitals_zones_joined'. -} - -\notes{\subsection{Accessing the SQL Database} - -Now that we have a SQL database, we can create a connection to it and query it using SQL commands. Let's try to simply select the data we wrote to it, to make sure its the same. - -Start by making a connection to the database. This will often be done via remote connections, but for this example we'll connect locally to the database using the filepath directly.} - -\notes{ -\setuphelpercode{import sqlite3} - -\helpercode{def create_connection(db_file): - """ create a database connection to the SQLite database - specified by the db_file - :param db_file: database file - :return: Connection object or None - """ - conn = None - try: - conn = sqlite3.connect(db_file) - except Error as e: - print(e) - - return conn} - -\code{conn = create_connection("db.sqlite")} -} - -\notes{Now that we have a connection, we can write a command and pass it to the database. - -To access a data base, the first thing that is made is a connection. Then SQL is used to extract the information required. A typical SQL command is ```SELECT```. It allows us to extract rows from a given table. It operates a bit like the ```.head()``` method in ```pandas```, it will return the first ```N``` rows (by default the ```.head()``` command returns the first 5 rows, but you can set ```n``` to whatever you like. Here we've included a default value of 5 to make it match the ```pandas``` command. - -The python library, ```sqlite3```, allows us to access the SQL database directly from python. We do this using an ```execute``` command on the connection. - -Typically, its good software engineering practice to 'wrap' the database command in some python code. This allows the commands to be maintained. Below we wrap the SQL command - -``` -SELECT * FROM [table_name] LIMIT : N -``` -in python code. This SQL command selects the first ```N``` entries from a given database called ```table_name```. - -We can pass the ```table_name``` and number of rows, ```N``` to the python command.} - -\notes{ -\helpercode{def select_top(conn, table, n): - """ - Query n first rows of the table - :param conn: the Connection object - :param table: The table to query - :param n: Number of rows to query - """ - cur = conn.cursor() - cur.execute(f"SELECT * FROM [{table}] LIMIT :limitNum", {"limitNum": n}) - - rows = cur.fetchall() - return rows} -} - -\notes{Let's have a go at calling the command to extract the first three facilities from our health center database. Let's try creating a function that does the same thing the pandas .head() method does so we can inspect our database.} - -\notes{ -\setupcode{def head(conn, table, n=5): - rows = select_top(conn, table, n) - for r in rows: - print(r)} - -\code{head(conn, 'facilities')} -} - -\notes{Great! We now have the data base in SQLite, and some python functions that operate on the data base by wrapping SQL commands. - -We will return to the SQL command style after download and add the other datasets to the database using a combination of ```pandas``` and the ```csv-to-sqlite``` utility. - -Our next task will be to introduce data on COVID19 so that we can join that to our other data sets.} - +\include{_systems/includes/nigeria-health-intro.md} +\include{_systems/includes/nigeria-nmis-installs.md} +\include{_systems/includes/databases-and-joins.md} +\include{_systems/includes/nigeria-nmis-data-systems.md} +\include{_systems/includes/nigeria-nmis-spatial-join.md} +\include{_systems/includes/nigeria-nmis-sqlite.md} \notes{\subsection{Covid Data} Now we have the health data, we're going to combine it with [data about COVID-19 cases in Nigeria over time](https://github.com/dsfsi/covid19africa). This data is kindly provided by Africa open COVID-19 data working group, which Elaine Nsoesie has been working with. The data is taken from Twitter, and only goes up until May 2020. From 18d0fdd990722f74ecb65a436e2a71ca3e7d49b6 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Sun, 25 Oct 2020 08:58:55 +0000 Subject: [PATCH 06/24] Start creating sub files for ML systems talk. --- _dsa/ml-systems.md | 182 +-------------------------------------------- 1 file changed, 3 insertions(+), 179 deletions(-) diff --git a/_dsa/ml-systems.md b/_dsa/ml-systems.md index d23a97c..51c86b9 100755 --- a/_dsa/ml-systems.md +++ b/_dsa/ml-systems.md @@ -29,192 +29,16 @@ transition: None \include{_ai/includes/dsa-systems.md} } +\notes{ \include{_systems/includes/nigeria-health-intro.md} \include{_systems/includes/nigeria-nmis-installs.md} \include{_systems/includes/databases-and-joins.md} \include{_systems/includes/nigeria-nmis-data-systems.md} \include{_systems/includes/nigeria-nmis-spatial-join.md} \include{_systems/includes/nigeria-nmis-sqlite.md} -\notes{\subsection{Covid Data} - -Now we have the health data, we're going to combine it with [data about COVID-19 cases in Nigeria over time](https://github.com/dsfsi/covid19africa). This data is kindly provided by Africa open COVID-19 data working group, which Elaine Nsoesie has been working with. The data is taken from Twitter, and only goes up until May 2020. - -They provide their data in github. We can access the cases we're interested in from the following URL. - -For convenience, we'll load the data into pandas first, but our next step will be to create a new SQLite table containing the data. Then we'll join that table to our existing tables.} - -\notes{ -\code{covid_data_url = 'https://raw.githubusercontent.com/dsfsi/covid19africa/master/data/line_lists/line-list-nigeria.csv' -covid_data_csv = 'cases.csv' -urllib.request.urlretrieve(covid_data_url, covid_data_csv) -covid_data = pd.read_csv(covid_data_csv) -}} - -\notes{As normal, we should inspect our data to check that it contains what we expect. } - -\notes{\code{covid_data.head()}} - -\notes{And we can get an idea of all the information in the data from looking at the columns.} - -\notes{\code{covid_data.columns}} - -\notes{Now we convert this CSV file we've downloaded into a new table in the database file. We can do this, again, with the csv-to-sqlite script.} - -\notes{\code{!csv-to-sqlite -f cases.csv -t full -o db.sqlite}} - -\notes{\subsection{Population Data} - -Now we have information about COVID cases, and we have information about how many health centers and how many doctors and nurses there are in each health center. But unless we understand how many people there are in each state, then we cannot make decisions about where they may be problems with the disease. - -If we were running our ride hailing service, we would also need information about how many people there were in different areas, so we could understand what the *demand* for the boda boda rides might be. - -To access the number of people we can get population statistics from the [Humanitarian Data Exchange](https://data.humdata.org/). - -We also want to have population data for each state in Nigeria, so that we can see attributes like whether there are zones of high health facility density but low population density.} - -\notes{\code{pop_url = 'https://data.humdata.org/dataset/a7c3de5e-ff27-4746-99cd-05f2ad9b1066/resource/d9fc551a-b5e4-4bed-9d0d-b047b6961817/download/nga_pop_adm1_2016.csv' -_, msg = urllib.request.urlretrieve(pop_url,'nga_pop_adm1_2016.csv') -pop_data = pd.read_csv('nga_pop_adm1_2016.csv')}} - -\notes{\code{pop_data.head()}} - -\notes{To do joins with this data, we must first make sure that the columns have the right names. The name should match the same name of the column in our existing data. So we reset the column names, and the name of the index, as follows.} - -\notes{\code{pop_data.columns = ['admin1Name_en', 'admin1Pcode', 'admin0Name_en', 'admin0Pcode', 'population'] -pop_data = pop_data.set_index('admin1Name_en')}} - -\notes{When doing this for real world data, you should also make sure that the names used in the rows are the same across the different data bases. For example, has someone decided to use an abbreviation for 'Federal Capital Territory' and set it as 'FCT'. The computer won't understand these are the same states, and if you do a join with such data you can get duplicate entries or missing entries. This sort of thing happens a lot in real world data and takes a lot of time to sort out. Fortunately, in this case, the data is well curated and we don't have these problems.} - -\notes{\subsection{Save to database file} - -The next step is to add this new CSV file as an additional table in our SQLite database. This is done using the script as before.} - -\notes{\code{pop_data.to_csv('pop_data.csv')}} - -\notes{\code{!csv-to-sqlite -f pop_data.csv -t full -o db.sqlite}} - -\notes{\subsection{Computing per capita hospitals and COVID} - -The Minister of Health in Abuja may be interested in which states are most vulnerable to COVID19. We now have all the information in our SQL data bases to compute what our health center provision is per capita, and what the COVID19 situation is. - -To do this, we will use the ```JOIN``` operation from SQL and introduce a new operation called ```GROUPBY```.} - -\notes{#### Joining in Pandas - -As before, these operations can be done in pandas or GeoPandas. Before we create the SQL commands, we'll show how you can do that in pandas. - -In pandas, the equivalent of a database table is a dataframe. So the JOIN operation takes two dataframes and joins them based on the key. The key is that special shared column between the two tables. The place where the 'holes align' so the two databases can be joined together. - -In GeoPandas we used an outer join. In an outer join you keep all rows from both tables, even if there is no match on the key. In an inner join, you only keep the rows if the two tables have a matching key. - -This is sometimes where problems can creep in. If in one table Abuja's state is encoded as 'FCT' or 'FCT-Abuja', and in another table it's encoded as 'Federal Capital Territory', they won't match and that data wouldn't appear in the joined table. - -In simple terms, a JOIN operation takes two tables (or dataframes) and combines them based on some key, in this case the index of the Pandas data frame which is the state name.} - -\notes{\code{pop_joined = zones_gdf.join(pop_data['population'], how='inner')}} - -\notes{\subsection{GroupBy in Pandas} - -Our COVID19 data is in the form of individual cases. But we are interested in total case counts for each state. There is a special data base operation known as ```GROUP BY``` for collecting information about the individual states. The type of information you might want could be a sum, the maximum value, an average, the minimum value. We can use a GroupBy operation in ```pandas``` and SQL to summarize the counts of covid cases in each state. - -A ```GROUPBY``` operation groups rows with the same key (in this case 'province/state') into separate objects, that we can operate on further such as to count the rows in each group, or to sum or take the mean over the values in some column (imagine each case row had the age of the patient, and you were interested in the mean age of patients.)} - -\notes{\code{covid_cases_by_state = covid_data.groupby(['province/state']).count()['case_id']}} - -\notes{The ```.groupby()``` method on the dataframe has now given us a new data series that contains the total number of covid cases in each state. We can examine it to check we have something sensible.} - -\notes{\code{covid_cases_by_state}} - -\notes{Now we have this new data series, it can be added to the pandas data frame as a new column.} - -\notes{\code{pop_joined['covid_cases_by_state'] = covid_cases_by_state}} - -\notes{The spatial join we did on the original data frame to obtain hosp_state_joined introduced a new column, index_right which contains the state of each of the hospitals. Let's have a quick look at it below.} - -\notes{\code{hosp_state_joined['index_right']} - -\notes{To count the hospitals in each of the states, we first create a grouped series where we've grouped on these states.} - -\notes{\code{grouped = hosp_state_joined.groupby('index_right')}} - -\notes{This python operation now goes through each of the groups and counts how many hospitals there are in each state. It stores the result in a dictionary. If you're new to Python, then to understand this code you need to understand what a 'dictionary comprehension' is. In this case the dictionary comprehension is being used to create a python dictionary of states and total hospital counts. That's then being converted into a ```pandas``` Data Series and added to the ```pop_joined``` dataframe.} - -\notes{\code{counted_groups = {k: len(v) for k, v in grouped.groups.items()} -pop_joined['hosp_state'] = pd.Series(counted_groups)}} - -\notes{For convenience, we can now add a new data series to the data frame that contains the per capita information about hospitals. that makes it easy to retrieve later.} - -\notes{\code{pop_joined['hosp_per_capita_10k'] = (pop_joined['hosp_state'] * 10000 )/ pop_joined['population']}} - -\notes{\subsection{SQL-style} - -That's the ```pandas``` approach to doing it. But ```pandas``` itself is inspired by database language, in particular relational databases such as SQL. To do these types of joins at scale, e.g. for our ride hailing app, we need to see how to do these joins in a database. - -As before, we'll wrap the underlying SQL commands with a convenient python command. - -What you see below gives the full SQL command. There is a [```SELECT``` command](https://www.w3schools.com/sql/sql_select.asp), which extracts ```FROM``` a particular table. It then completes an [```INNER JOIN```](https://www.w3schools.com/sql/sql_join_inner.asp) using particular columns (```provice/state``` and ```index_right```)} - -\helpercode{def join_counts(conn): - """ - Calculate counts of cases and facilities per state, join results - """ - cur = conn.cursor() - cur.execute(""" - SELECT ct.[province/state] as [state], ct.[case_count], ft.[facility_count] - FROM - (SELECT [province/state], COUNT(*) as [case_count] FROM [cases] GROUP BY [province/state]) ct - INNER JOIN - (SELECT [index_right], COUNT(*) as [facility_count] FROM [facilities] GROUP BY [index_right]) ft - ON - ct.[province/state] = ft.[index_right] - """) - - rows = cur.fetchall() - return rows}} - -\notes{Now we've created our python wrapper, we can connect to the data base and run our SQL command on the database using the wrapper.} - -\notes{\code{conn = create_connection("db.sqlite")}} - -\notes{\code{state_cases_hosps = join_counts(conn)}} - -\notes{\code{for row in state_cases_hosps: - print("State {} \t\t Covid Cases {} \t\t Health Facilities {}".format(row[0], row[1], row[2]))}} - - -\notes{\code{base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11)) -pop_joined.plot(ax=base, column='population', edgecolor='black', legend=True) -base.set_title("Population of Nigerian States")}} - -\notes{\code{base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11)) -pop_joined.plot(ax=base, column='hosp_per_capita_10k', edgecolor='black', legend=True) -base.set_title("Hospitals Per Capita (10k) of Nigerian States")}} - -\notes{\subsection{Exercise} - -1. Add a new column the dataframe for covid cases per 10,000 population, in the same way we computed health facilities per 10k capita. - -2. Add a new column for covid cases per health facility. - -Do this in both the SQL and the Pandas styles to get a feel for how they differ.} - -\notes{\code{# pop_joined['cases_per_capita_10k'] = ??? -# pop_joined['cases_per_facility'] = ???} - -\notes{\code{base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11)) -pop_joined.plot(ax=base, column='cases_per_capita_10k', edgecolor='black', legend=True) -base.set_title("Covid Cases Per Capita (10k) of Nigerian States")}} - -\notes{\code{base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11)) -pop_joined.plot(ax=base, column='covid_cases_by_state', edgecolor='black', legend=True) -base.set_title("Covid Cases by State")}} - -\notes{\code{base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11)) -pop_joined.plot(ax=base, column='cases_per_facility', edgecolor='black', legend=True) -base.set_title("Covid Cases per Health Facility")}} +\include{_systems/includes/nigeria-nmis-covid-join.md} +} \thanks - - \references From f34593cd50b2c269eecac96328842614d125b366 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Sun, 1 Nov 2020 13:19:58 +0000 Subject: [PATCH 07/24] Update local file. --- _dsa/bayesian-methods-abuja.md | 48 +++++++++++++ _dsa/gaussian-processes.md | 28 ++++++++ _dsa/ml-systems.md | 2 +- _dsa/probabilistic-machine-learning.md | 52 ++++++++++++++ _dsa/what-is-machine-learning.md | 97 ++++++++++++++++++++++++++ 5 files changed, 226 insertions(+), 1 deletion(-) create mode 100755 _dsa/bayesian-methods-abuja.md create mode 100755 _dsa/gaussian-processes.md create mode 100755 _dsa/probabilistic-machine-learning.md create mode 100755 _dsa/what-is-machine-learning.md diff --git a/_dsa/bayesian-methods-abuja.md b/_dsa/bayesian-methods-abuja.md new file mode 100755 index 0000000..4526536 --- /dev/null +++ b/_dsa/bayesian-methods-abuja.md @@ -0,0 +1,48 @@ +--- +session: 3 +title: "Bayesian Methods" +subtitle: Probabilistic Machine Learning +abstract: > + In this session we review the *probabilistic* approach to machine + learning. We start with a review of probability, and introduce the + concepts of probabilistic modelling. We then apply the approach in + practice to Naive Bayesian classification. + + In this session we review the probabilistic formulation of a + classification model, reviewing initially maximum likelihood and + the naive Bayes model. +author: +- family: Lawrence + given: Neil D. + gscholar: r3SJcvoAAAAJ + institute: Amazon Cambridge and University of Sheffield + twitter: lawrennd + url: http://inverseprobability.com +date: 2018-11-14 +venue: DSA, Abuja +transition: None +--- + +\include{talk-macros.tex} + +\include{_ml/includes/what-is-ml.md} +\include{_ml/includes/nigerian-nmis-data.md} +\include{_ml/includes/probability-intro.md} +\include{_ml/includes/probabilistic-modelling.md} + +\include{_ml/includes/graphical-models.md} +\include{_ml/includes/classification-intro.md} +\include{_ml/includes/classification-examples.md} +\include{_ml/includes/bayesian-reminder.md} +\include{_ml/includes/bernoulli-distribution.md} +\include{_ml/includes/bernoulli-maximum-likelihood.md} +\include{_ml/includes/bayes-rule-reminder.md} +\include{_ml/includes/naive-bayes.md} + +\subsection{Other Reading} + +* Chapter 5 of @Rogers:book11 up to pg 179 (Section 5.1, and 5.2 up to 5.2.2). + +\references + +\thanks diff --git a/_dsa/gaussian-processes.md b/_dsa/gaussian-processes.md new file mode 100755 index 0000000..45b08e6 --- /dev/null +++ b/_dsa/gaussian-processes.md @@ -0,0 +1,28 @@ +--- +session: 4 +title: Gaussian Processes +abstract: > + Classical machine learning and statistical approaches to learning, such as neural networks and linear regression, assume a parametric form for functions. Gaussian process models are an alternative approach that assumes a probabilistic prior over functions. This brings benefits, in that uncertainty of function estimation is sustained throughout inference, and some challenges: algorithms for fitting Gaussian processes tend to be more complex than parametric models. + + In this sessions I will introduce Gaussian processes and explain why sustaining uncertainty is important. +date: 2020-11-14 +venue: Virtual Data Science Nigeria +time: "15:00 (West Africa Standard Time)" +transition: None +--- + +\include{talk-macros.tex} +\include{_gp/includes/gp-book.md} +\include{_gp/includes/what-is-a-gp.md} + +\include{_gp/includes/gp-summer-school.md} +\include{_gp/includes/gpy-software.md} +\include{_gp/includes/other-software.md} + + +\thanks + +\references + + + diff --git a/_dsa/ml-systems.md b/_dsa/ml-systems.md index 51c86b9..bebe7a5 100755 --- a/_dsa/ml-systems.md +++ b/_dsa/ml-systems.md @@ -1,7 +1,7 @@ --- +session: 2 title: "Introduction to Machine Learning Systems" abstract: "This notebook introduces some of the challenges of building machine learning data systems. It will introduce you to concepts around joining of databases together. The storage and manipulation of data is at the core of machine learning systems and data science. The goal of this notebook is to introduce the reader to these concepts, not to authoritatively answer any questions about the state of Nigerian health facilities or Covid19, but it may give you ideas about how to try and do that in your own country." -layout: talk author: - given: Eric family: Meissner diff --git a/_dsa/probabilistic-machine-learning.md b/_dsa/probabilistic-machine-learning.md new file mode 100755 index 0000000..8c80d10 --- /dev/null +++ b/_dsa/probabilistic-machine-learning.md @@ -0,0 +1,52 @@ +--- +session: 3 +title: "Probabilistic Machine Learning" +abstract: > + In this session we review the *probabilistic* approach to machine + learning. We start with a review of probability, and introduce the + concepts of probabilistic modelling. We then apply the approach in + practice to Naive Bayesian classification. + + In this session we review the Bayesian formalism in the context of + linear models, reviewing initially maximum likelihood and + introducing basis functions as a way of driving non-linearity in the + model. +ipynb: True +reveal: True +author: +- family: Lawrence + given: Neil D. + gscholar: r3SJcvoAAAAJ + institute: Amazon Cambridge and University of Sheffield + twitter: lawrennd + url: http://inverseprobability.com +date: 2018-11-16 +venue: DSA, Abuja +transition: None +--- + +%%%%%%%%%%%% LOCAL DATA %%%%%%%%%%%%%%%%%%%% +https://www.kaggle.com/alaowerre/nigeria-nmis-health-facility-data +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +\include{talk-macros.tex} + +\include{_ml/includes/what-is-ml.md} +\include{_ml/includes/probability-intro.md} +\include{_ml/includes/probabilistic-modelling.md} + +\include{_ml/includes/graphical-models.md} +\include{_ml/includes/classification-intro.md} +\include{_ml/includes/classification-examples.md} +\include{_ml/includes/bayesian-reminder.md} +\include{_ml/includes/bernoulli-distribution.md} +\include{_ml/includes/bernoulli-maximum-likelihood.md} +\include{_ml/includes/bayes-rule-reminder.md} +\include{_ml/includes/naive-bayes.md} + +### Other Reading + +* Chapter 5 of @Rogers:book11 up to pg 179 (Section 5.1, and 5.2 up to 5.2.2). + +### References diff --git a/_dsa/what-is-machine-learning.md b/_dsa/what-is-machine-learning.md new file mode 100755 index 0000000..9d5eb49 --- /dev/null +++ b/_dsa/what-is-machine-learning.md @@ -0,0 +1,97 @@ +--- +session: 1 +title: What is Machine Learning? +venue: Data Science Africa Summer School, Addis Ababa, Ethiopia +author: +- given: Neil D. + family: Lawrence + url: http://inverseprobability.com + institute: Amazon Cambridge and University of Sheffield + twitter: lawrennd + gscholar: r3SJcvoAAAAJ + orchid: +abstract: > + In this talk we will introduce the fundamental ideas in machine learning. We'll develop our exposition around the ideas of prediction function and the objective function. We don't so much focus on the derivation of particular algorithms, but more the general principles involved to give an idea of the machine learning *landscape*. +date: 2019-06-03 +categories: +- notes +geometry: ["a4paper", "margin=2cm"] +papersize: a4paper +transition: None +--- + +\include{../talk-macros.gpp} + +\section{Introduction} + +\include{_data-science/includes/data-science-africa.md} +\include{_health/includes/malaria-gp.md} + +\subsection{Machine Learning} +\notes{This talk is a general introduction to machine learning, we will highlight the technical challenges and the current solutions. We will give an overview of what is machine learning and why it is important.} + +\subsection{Rise of Machine Learning} +\slides{ +* Driven by data and computation +* Fundamentally dependent on models +}\notes{Machine learning is the combination of data and models, through computation, to make predictions.} +$$ +\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction} +$$ + +\subsection{Data Revolution} + +\notes{Machine learning has risen in prominence due to the rise in data availability, and its interconnection with computers. The high bandwidth connection between data and computer leads to a new interaction between us and data via the computer. It is that channel that is being mediated by machine learning techniques.} +\figure{\includediagram{\diagramsDir/data-science/new-flow-of-information}{60%}}{Large amounts of data and high interconnection bandwidth mean that we receive much of our information about the world around us through computers.}{data-science-information-flow} + +\include{_supply-chain/includes/supply-chain-africa.md} +\include{_ml/includes/process-emulation.md} + +\newslide{Kapchorwa District} + +\figure{\includediagramclass{\diagramsDir/health/Kapchorwa_District_in_Uganda}{50%}}{The Kapchorwa District, home district of Stephen Kiprotich.}{kapchorwa-district-in-uganda} + +\notes{Stephen Kiprotich, the 2012 gold medal winner from the London Olympics, comes from Kapchorwa district, in eastern Uganda, near the border with Kenya.} + +\include{_ml/includes/olympic-marathon-polynomial.md} +\include{../_ml/includes/what-does-machine-learning-do.md} + +\include{_ml/includes/what-is-ml-2.md} +\include{_ai/includes/ai-vs-data-science-2.md} +\include{_ml/includes/neural-networks.md} + +\subsection{Machine Learning} +\slides{ +1. observe a system in practice +2. emulate its behavior with mathematics. + +* Design challenge: where to put mathematical function. +* Where it's placed leads to different ML domains. +}\notes{The key idea in machine learning is to observe the system in practice, and then emulate its behavior with mathematics. That leads to a design challenge as to where to place the mathematical function. The placement of the mathematical function leads to the different domains of machine learning.} + +\newslide{Types of Machine Learning} + +1. Supervised learning +2. Unsupervised learning +3. Reinforcement learning + +\newslide{Types of Machine Learning} +\slides{ +1. Supervised learning +2. Unsupervised learning +3. Reinforcement learning +} +\include{_ml/includes/supervised-learning.md} + +\notes{ +\include{_ml/includes/unsupervised-learning.md} +\include{_ml/includes/reinforcement-learning.md} + +\notes{We have introduced a range of machine learning approaches by focusing on their use of mathematical functions to replace manually coded systems of rules. The important characteristic of machine learning is that the form of these functions, as dictated by their parameters, is determined by acquiring data from the real world.} + + +\include{_ml/includes/deployment.md}} + +\thanks + +\references From 7f9500a1d2a3771ded9828019aec52bf466ab09b Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Sun, 1 Nov 2020 18:06:32 +0000 Subject: [PATCH 08/24] Update local file. --- _dsa/bayesian-methods-abuja.md | 5 +++++ _dsa/gaussian-processes.md | 31 +++++++++++++++++++++++++++---- 2 files changed, 32 insertions(+), 4 deletions(-) diff --git a/_dsa/bayesian-methods-abuja.md b/_dsa/bayesian-methods-abuja.md index 4526536..c445542 100755 --- a/_dsa/bayesian-methods-abuja.md +++ b/_dsa/bayesian-methods-abuja.md @@ -18,6 +18,11 @@ author: institute: Amazon Cambridge and University of Sheffield twitter: lawrennd url: http://inverseprobability.com +- family: Koyejo + given: Oluwasanmi + institute: Google and University of Illinois + url: https://sanmi.cs.illinois.edu/ + gscholar: EaaOeJwAAAAJ date: 2018-11-14 venue: DSA, Abuja transition: None diff --git a/_dsa/gaussian-processes.md b/_dsa/gaussian-processes.md index 45b08e6..8e13f67 100755 --- a/_dsa/gaussian-processes.md +++ b/_dsa/gaussian-processes.md @@ -5,19 +5,42 @@ abstract: > Classical machine learning and statistical approaches to learning, such as neural networks and linear regression, assume a parametric form for functions. Gaussian process models are an alternative approach that assumes a probabilistic prior over functions. This brings benefits, in that uncertainty of function estimation is sustained throughout inference, and some challenges: algorithms for fitting Gaussian processes tend to be more complex than parametric models. In this sessions I will introduce Gaussian processes and explain why sustaining uncertainty is important. -date: 2020-11-14 +date: 2020-11-13 venue: Virtual Data Science Nigeria time: "15:00 (West Africa Standard Time)" transition: None --- \include{talk-macros.tex} +\include{_mlai/includes/mlai-notebook-setup.md} \include{_gp/includes/gp-book.md} -\include{_gp/includes/what-is-a-gp.md} + +\include{_gp/includes/gp-intro-lectures.md} -\include{_gp/includes/gp-summer-school.md} +\include{_ml/includes/basis-functions-intro.md} + +\include{_gp/includes/gp-from-basis-functions.md} + +\include{_gp/includes/non-degenerate-gps.md} +\include{_gp/includes/gp-function-space.md} + +\include{_gp/includes/gptwopointpred.md} + +\include{_gp/includes/gp-covariance-function-importance.md} +\include{_gp/includes/gp-numerics-and-optimization.md} + +\include{_gp/includes/gp-optimize.md} +\include{_kern/includes/eq-covariance.md} +\include{_health/includes/malaria-gp.md} \include{_gp/includes/gpy-software.md} -\include{_gp/includes/other-software.md} +\include{_gp/includes/gpy-tutorial.md} +\include{_gp/includes/nigeria-covid-gp.md} + + +\subsection{Review} + +\include{_gp/includes/gp-summer-school.md} +\include{_gp/includes/other-gp-software.md} \thanks From eaf9406ab7d6b5e401571a7e5351ce8d335ea10b Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Mon, 2 Nov 2020 15:16:55 +0000 Subject: [PATCH 09/24] Add svg of determinant. --- _dsa/gaussian-processes.md | 1 + 1 file changed, 1 insertion(+) diff --git a/_dsa/gaussian-processes.md b/_dsa/gaussian-processes.md index 8e13f67..d18ffb3 100755 --- a/_dsa/gaussian-processes.md +++ b/_dsa/gaussian-processes.md @@ -42,6 +42,7 @@ transition: None \include{_gp/includes/gp-summer-school.md} \include{_gp/includes/other-gp-software.md} +\reading \thanks From a682c945668e6a819700917416eff980bb0fdce7 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Mon, 2 Nov 2020 16:16:24 +0000 Subject: [PATCH 10/24] Add svg of determinant. --- _dsa/gaussian-processes.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/_dsa/gaussian-processes.md b/_dsa/gaussian-processes.md index d18ffb3..4d2008a 100755 --- a/_dsa/gaussian-processes.md +++ b/_dsa/gaussian-processes.md @@ -16,6 +16,9 @@ transition: None \include{_gp/includes/gp-book.md} \include{_gp/includes/gp-intro-lectures.md} +\include{_ml/includes/univariate-gaussian-properties.md} +\include{_ml/includes/two-d-gaussian.md} +\include{_ml/includes/multivariate-gaussian-properties.md} \include{_ml/includes/basis-functions-intro.md} From f0043a80ee082deb4c38d66812bc5453b00ca459 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Mon, 2 Nov 2020 22:44:29 +0000 Subject: [PATCH 11/24] Update with alignment symbols. --- _dsa/gaussian-processes.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/_dsa/gaussian-processes.md b/_dsa/gaussian-processes.md index 4d2008a..6f4543e 100755 --- a/_dsa/gaussian-processes.md +++ b/_dsa/gaussian-processes.md @@ -21,6 +21,11 @@ transition: None \include{_ml/includes/multivariate-gaussian-properties.md} \include{_ml/includes/basis-functions-intro.md} +\include{_ml/includes/relu-basis.md} + +\include{_ml/includes/linear-model-overview.md} + +\include{_ml/includes/radial-basis.md} \include{_gp/includes/gp-from-basis-functions.md} From 8e049b66f5b0e34da6edd2d5153c235b1a36bb64 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Sat, 7 Nov 2020 23:32:08 +0000 Subject: [PATCH 12/24] Split up GPSS lectures into sessions. --- _dsa/gaussian-processes.md | 1 + 1 file changed, 1 insertion(+) diff --git a/_dsa/gaussian-processes.md b/_dsa/gaussian-processes.md index 6f4543e..412f19b 100755 --- a/_dsa/gaussian-processes.md +++ b/_dsa/gaussian-processes.md @@ -14,6 +14,7 @@ transition: None \include{talk-macros.tex} \include{_mlai/includes/mlai-notebook-setup.md} \include{_gp/includes/gp-book.md} +\include{_ml/includes/first-course-book.md} \include{_gp/includes/gp-intro-lectures.md} \include{_ml/includes/univariate-gaussian-properties.md} From ae0f5f255a1a9746f07d6b9a293d83ebd64b5730 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Fri, 13 Nov 2020 08:24:24 +0000 Subject: [PATCH 13/24] Update tti-explorer with youtube link. --- _dsa/gaussian-processes.md | 30 +++++++++++++++++++++++++++--- 1 file changed, 27 insertions(+), 3 deletions(-) diff --git a/_dsa/gaussian-processes.md b/_dsa/gaussian-processes.md index 412f19b..4586b43 100755 --- a/_dsa/gaussian-processes.md +++ b/_dsa/gaussian-processes.md @@ -16,15 +16,39 @@ transition: None \include{_gp/includes/gp-book.md} \include{_ml/includes/first-course-book.md} -\include{_gp/includes/gp-intro-lectures.md} \include{_ml/includes/univariate-gaussian-properties.md} \include{_ml/includes/two-d-gaussian.md} \include{_ml/includes/multivariate-gaussian-properties.md} -\include{_ml/includes/basis-functions-intro.md} +\include{_ml/includes/basis-functions-nn.md} \include{_ml/includes/relu-basis.md} -\include{_ml/includes/linear-model-overview.md} +\include{_gp/includes/gp-intro-lectures.md} + +\ifndef{gpIntroLectures} +\define{gpIntroLectures} + +\editme + +\subsection{Gaussian Processes} +\slides{ +* Basis function models give non-linear predictions. +* Need to choose number and location of basis functions. +* Gaussian processes is a general framework (basis functions special case) +* Within the framework you can consider models with infinite basis functions. +} +\notes{Models where we model the entire joint distribution of our training data, $p(\dataVector, \inputMatrix)$ are sometimes described as *generative models*. Because we can use sampling to generate data sets that represent all our assumptions. However, as we discussed in the sessions on \refnotes{logistic regression}{logistic-regression} and \refnotes{naive Bayes}{naive-bayes}, this can be a bad idea, because if our assumptions are wrong then we can make poor predictions. We can try to make more complex assumptions about data to alleviate the problem, but then this typically leads to challenges for tractable application of the sum and rules of probability that are needed to compute the relevant marginal and conditional densities. If we know the form of the question we wish to answer then we typically try and represent that directly, through $p(\dataVector|\inputMatrix)$. In practice, we also have been making assumptions of conditional independence given the model parameters,} +$$ +p(\dataVector|\inputMatrix, \mappingVector) = +\prod_{i=1}^{\numData} p(\dataScalar_i | \inputVector_i, \mappingVector) +$$ +\notes{Gaussian processes are *not* normally considered to be *generative models*, but we will be much more interested in the principles of conditioning in Gaussian processes because we will use conditioning to make predictions between our test and training data. We will avoid the data conditional indpendence assumption in favour of a richer assumption about the data, in a Gaussian process we assume data is *jointly Gaussian* with a particular mean and covariance,} +$$ +\dataVector|\inputMatrix \sim \gaussianSamp{\mathbf{m}(\inputMatrix)}{\kernelMatrix(\inputMatrix)}, +$$ +\notes{where the conditioning is on the inputs $\inputMatrix$ which are used for computing the mean and covariance. For this reason they are known as mean and covariance functions.} + +\endif \include{_ml/includes/radial-basis.md} From c31830c998cab68d54725c6f89d32c8cdb4a6486 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Sat, 14 Nov 2020 18:09:30 +0000 Subject: [PATCH 14/24] Update with plots of nigerian data. --- _dsa/gaussian-processes.md | 30 +++++++++++++++++------------- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/_dsa/gaussian-processes.md b/_dsa/gaussian-processes.md index 4586b43..c75c6c3 100755 --- a/_dsa/gaussian-processes.md +++ b/_dsa/gaussian-processes.md @@ -13,22 +13,27 @@ transition: None \include{talk-macros.tex} \include{_mlai/includes/mlai-notebook-setup.md} + \include{_gp/includes/gp-book.md} \include{_ml/includes/first-course-book.md} + +\include{_health/includes/malaria-gp.md} +\include{_ml/includes/what-is-ml.md} +\include{_ml/includes/overdetermined-inaugural.md} \include{_ml/includes/univariate-gaussian-properties.md} -\include{_ml/includes/two-d-gaussian.md} -\include{_ml/includes/multivariate-gaussian-properties.md} -\include{_ml/includes/basis-functions-nn.md} -\include{_ml/includes/relu-basis.md} -\include{_gp/includes/gp-intro-lectures.md} +\include{_ml/includes/multivariate-gaussian-properties.md} +\notes{\include{_ml/includes/linear-regression-log-likelihood.md} +\include{_ml/includes/olympic-data-linear-regression.md} +\include{_ml/includes/linear-regression-direct-solution.md}} -\ifndef{gpIntroLectures} -\define{gpIntroLectures} +\include{_ml/includes/underdetermined-system.md} +\include{_ml/includes/two-d-gaussian.md} -\editme +\include{_ml/includes/basis-functions-nn.md} +\include{_ml/includes/relu-basis.md} \subsection{Gaussian Processes} \slides{ @@ -48,7 +53,9 @@ $$ $$ \notes{where the conditioning is on the inputs $\inputMatrix$ which are used for computing the mean and covariance. For this reason they are known as mean and covariance functions.} -\endif + + +\include{_ml/includes/linear-model-overview.md} \include{_ml/includes/radial-basis.md} @@ -64,15 +71,12 @@ $$ \include{_gp/includes/gp-optimize.md} \include{_kern/includes/eq-covariance.md} -\include{_health/includes/malaria-gp.md} +\include{_gp/includes/gp-summer-school.md} \include{_gp/includes/gpy-software.md} \include{_gp/includes/gpy-tutorial.md} -\include{_gp/includes/nigeria-covid-gp.md} - \subsection{Review} -\include{_gp/includes/gp-summer-school.md} \include{_gp/includes/other-gp-software.md} \reading From 54527f595bd2fac94b8601dce7108dd79e89673d Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Fri, 22 Jan 2021 09:45:50 +0000 Subject: [PATCH 15/24] Add local config.yml files for each directory --- _dsa/_config.yml | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) create mode 100755 _dsa/_config.yml diff --git a/_dsa/_config.yml b/_dsa/_config.yml new file mode 100755 index 0000000..ca67505 --- /dev/null +++ b/_dsa/_config.yml @@ -0,0 +1,16 @@ +author: +- given: Neil D. + family: Lawrence + institution: University of Cambridge + gscholar: r3SJcvoAAAAJ + twitter: lawrennd + orcid: 0000-0001-9258-1030 + url: http://inverseprobability.com +layout: lecture +venue: Virtual (Zoom) +ipynb: True +postdir: ../../../mlatcl/dsa/_lectures/ +slidedir: ../../../mlatcl/dsa/slides/ +notedir: ../../../mlatcl/dsa/_notes/ +notebookdir: ../../../mlatcl/dsa/_notebooks/ +transition: None From 99254c11de5c10a4f38524522d57d2da119ea84c Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Mon, 25 Jan 2021 12:01:22 +0000 Subject: [PATCH 16/24] Separate out parts of linear regression optimisaton. --- _dsa/gaussian-processes.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/_dsa/gaussian-processes.md b/_dsa/gaussian-processes.md index c75c6c3..64b5c21 100755 --- a/_dsa/gaussian-processes.md +++ b/_dsa/gaussian-processes.md @@ -27,7 +27,13 @@ transition: None \include{_ml/includes/multivariate-gaussian-properties.md} \notes{\include{_ml/includes/linear-regression-log-likelihood.md} \include{_ml/includes/olympic-data-linear-regression.md} +\include{_ml/includes/linear-regression-multivariate-log-likelihood.md} +\define{designVector}{\basisVector} +\define{designVariable}{Phi} +\define{designMatrix}{\basisMatrix} \include{_ml/includes/linear-regression-direct-solution.md}} +\include{_ml/includes/linear-regression-objective-optimisation.md} +\include{_ml/includes/movie-body-count-linear-regression.md} \include{_ml/includes/underdetermined-system.md} \include{_ml/includes/two-d-gaussian.md} From 9baa9e4d4522dc847f444fff9d10f226b717124c Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Tue, 26 Jan 2021 07:41:23 +0000 Subject: [PATCH 17/24] Update name of nigeria nmis data. --- _dsa/bayesian-methods-abuja.md | 2 +- _dsa/what-is-machine-learning-ashesi.md | 106 ++++++++++++++++++++++++ 2 files changed, 107 insertions(+), 1 deletion(-) create mode 100755 _dsa/what-is-machine-learning-ashesi.md diff --git a/_dsa/bayesian-methods-abuja.md b/_dsa/bayesian-methods-abuja.md index c445542..1e4cbe3 100755 --- a/_dsa/bayesian-methods-abuja.md +++ b/_dsa/bayesian-methods-abuja.md @@ -31,7 +31,7 @@ transition: None \include{talk-macros.tex} \include{_ml/includes/what-is-ml.md} -\include{_ml/includes/nigerian-nmis-data.md} +\include{_ml/includes/nigeria-nmis-data.md} \include{_ml/includes/probability-intro.md} \include{_ml/includes/probabilistic-modelling.md} diff --git a/_dsa/what-is-machine-learning-ashesi.md b/_dsa/what-is-machine-learning-ashesi.md new file mode 100755 index 0000000..be0f574 --- /dev/null +++ b/_dsa/what-is-machine-learning-ashesi.md @@ -0,0 +1,106 @@ +--- +layout: slides +title: What is Machine Learning? +venue: Data Science Africa Summer School, Ashesi, Ghana +author: +- given: Neil D. + family: Lawrence + url: http://inverseprobability.com + institute: University of Cambridge + twitter: lawrennd + gscholar: r3SJcvoAAAAJ + orchid: +abstract: > + In this talk we will introduce the fundamental ideas in machine learning. We'll develop our exposition around the ideas of prediction function and the objective function. We don't so much focus on the derivation of particular algorithms, but more the general principles involved to give an idea of the machine learning *landscape*. +date: 2019-10-21 +categories: +- notes +layout: talk +geometry: ["a4paper", "margin=2cm"] +papersize: a4paper +transition: None +--- + +\include{../talk-macros.tex} + +\section{Introduction} + +\include{_data-science/includes/data-science-africa.md} +\include{_health/includes/malaria-gp.md} + +\subsection{Machine Learning} +\notes{This talk is a general introduction to machine learning, we will highlight the technical challenges and the current solutions. We will give an overview of what is machine learning and why it is important.} + +\subsection{Rise of Machine Learning} +\slides{ +* Driven by data and computation +* Fundamentally dependent on models +}\notes{Machine learning is the combination of data and models, through computation, to make predictions.} +$$ +\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction} +$$ + +\subsection{Data Revolution} + +\notes{Machine learning has risen in prominence due to the rise in data availability, and its interconnection with computers. The high bandwidth connection between data and computer leads to a new interaction between us and data via the computer. It is that channel that is being mediated by machine learning techniques.} +\figure{\includediagram{\diagramsDir/data-science/new-flow-of-information}{60%}}{Large amounts of data and high interconnection bandwidth mean that we receive much of our information about the world around us through computers.}{data-science-information-flow} + +\include{_supply-chain/includes/supply-chain-africa.md} +\include{_ml/includes/process-emulation.md} +\include{_ml/includes/nigeria-nmis-data.md} +\include{_ml/includes/what-does-machine-learning-do.md} +\include{_ml/includes/what-is-ml-2.md} +\include{_ai/includes/ai-vs-data-science-2.md} +\include{_ml/includes/neural-networks.md} + +\subsection{Machine Learning} +\slides{ +1. observe a system in practice +2. emulate its behavior with mathematics. + +* Design challenge: where to put mathematical function. +* Where it's placed leads to different ML domains. +}\notes{The key idea in machine learning is to observe the system in practice, and then emulate its behavior with mathematics. That leads to a design challenge as to where to place the mathematical function. The placement of the mathematical function leads to the different domains of machine learning.} + +\newslide{Types of Machine Learning} + +1. Supervised learning +2. Unsupervised learning +3. Reinforcement learning + +\newslide{Types of Machine Learning} +\slides{ +1. Supervised learning +2. Unsupervised learning +3. Reinforcement learning +} + + +\include{_ml/includes/supervised-learning-intro.md} + +\include{_ml/includes/classification-intro.md} +\include{_ml/includes/classification-examples.md} +\include{_ml/includes/the-perceptron.md} +\notes{\include{_ml/includes/logistic-regression.md} +\include{_ml/includes/nigeria-nmis-data-logistic.md}} +\include{_ml/includes/regression-intro.md} +\include{_ml/includes/regression-examples.md} +\include{_ml/includes/olympic-marathon-polynomial.md} + +\include{_ml/includes/supervised-learning-challenges.md} + + +\notes{ +\include{_ml/includes/unsupervised-learning.md} +\include{_ml/includes/reinforcement-learning.md} + +\notes{We have introduced a range of machine learning approaches by focusing on their use of mathematical functions to replace manually coded systems of rules. The important characteristic of machine learning is that the form of these functions, as dictated by their parameters, is determined by acquiring data from the real world.} + + +\include{_ml/includes/deployment.md}} + +\reading + +\thanks + +\references From 470cea370d8695b3292357590da3522458edec94 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Sat, 28 Aug 2021 18:58:38 +0100 Subject: [PATCH 18/24] Make directory names consistently pulural --- _dsa/_config.yml | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/_dsa/_config.yml b/_dsa/_config.yml index ca67505..c1cf0be 100755 --- a/_dsa/_config.yml +++ b/_dsa/_config.yml @@ -9,8 +9,16 @@ author: layout: lecture venue: Virtual (Zoom) ipynb: True -postdir: ../../../mlatcl/dsa/_lectures/ -slidedir: ../../../mlatcl/dsa/slides/ -notedir: ../../../mlatcl/dsa/_notes/ -notebookdir: ../../../mlatcl/dsa/_notebooks/ +postsdir: ../../../mlatcl/dsa/_lectures/ +slidesdir: ../../../mlatcl/dsa/slides/ +notesdir: ../../../mlatcl/dsa/_notes/ +notebooksdir: ../../../mlatcl/dsa/_notebooks/ +writediagramsdir: . +diagramsdir: ./slides/diagrams/ transition: None +ghub: +- organization: lawrennd + repository: talks + branch: gh-pages + directory: _dsa + \ No newline at end of file From 009418acc5bd61baad443212a66c5b35692acce7 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Fri, 3 Sep 2021 18:21:55 +0100 Subject: [PATCH 19/24] Update by removing images --- _dsa/ml-systems.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/_dsa/ml-systems.md b/_dsa/ml-systems.md index bebe7a5..83d09cd 100755 --- a/_dsa/ml-systems.md +++ b/_dsa/ml-systems.md @@ -35,7 +35,8 @@ transition: None \include{_systems/includes/databases-and-joins.md} \include{_systems/includes/nigeria-nmis-data-systems.md} \include{_systems/includes/nigeria-nmis-spatial-join.md} -\include{_systems/includes/nigeria-nmis-sqlite.md} +\define{databaseType}{sqlite} +\include{_systems/includes/nigeria-nmis-sql.md} \include{_systems/includes/nigeria-nmis-covid-join.md} } From fdd789e132835e6ce0eae1592cccd3707f1abb2f Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Sun, 12 Sep 2021 22:15:12 +0100 Subject: [PATCH 20/24] Update access talk. --- _dsa/ml-systems-kimberley.md | 43 ++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 _dsa/ml-systems-kimberley.md diff --git a/_dsa/ml-systems-kimberley.md b/_dsa/ml-systems-kimberley.md new file mode 100644 index 0000000..bb268e3 --- /dev/null +++ b/_dsa/ml-systems-kimberley.md @@ -0,0 +1,43 @@ +--- +title: "Introduction to Machine Learning Systems" +abstract: "This session introduces some of the challenges of building machine learning data systems. It will introduce you to concepts around joining of databases together. The storage and manipulation of data is at the core of machine learning systems and data science. The goal of this notebook is to introduce the reader to these concepts, not to authoritatively answer any questions about the state of Nigerian health facilities or Covid19, but it may give you ideas about how to try and do that in your own country." +author: +- given: Eric + family: Meissner + url: https://www.linkedin.com/in/meissnereric/ + twitter: meissner_eric_7 +- given: Andrei + family: Paleyes + url: https://www.linkedin.com/in/andreipaleyes/ +- given: Neil D. + family: Lawrence + twitter: lawrennd + url: http://inverseprobability.com +date: 2021-10-06 +ipynb: true +venue: Virtual DSA, Kimberley +transition: None +--- + + +\slides{\section{AI via ML Systems} + +\include{_ai/includes/supply-chain-system.md} +\include{_ai/includes/aws-soa.md} +\include{_ai/includes/dsa-systems.md} +} + +\notes{ +\include{_systems/includes/nigeria-health-intro.md} +\include{_systems/includes/nigeria-nmis-installs.md} +\include{_systems/includes/databases-and-joins.md} +\include{_systems/includes/nigeria-nmis-data-systems.md} +\include{_systems/includes/nigeria-nmis-spatial-join.md} +\define{databaseType}{sqlite} +\include{_systems/includes/nigeria-nmis-sql.md} +\include{_systems/includes/nigeria-nmis-covid-join.md} +} + +\thanks + +\references From 575fe39a6a2d2448a6c610dd04ce00a633e3df84 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Wed, 6 Oct 2021 07:55:32 +0100 Subject: [PATCH 21/24] Update kimberly talk --- _dsa/_config.yml | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/_dsa/_config.yml b/_dsa/_config.yml index c1cf0be..529d800 100755 --- a/_dsa/_config.yml +++ b/_dsa/_config.yml @@ -9,16 +9,19 @@ author: layout: lecture venue: Virtual (Zoom) ipynb: True +talkcss: https://inverseprobability.com/assets/css/talks.css postsdir: ../../../mlatcl/dsa/_lectures/ slidesdir: ../../../mlatcl/dsa/slides/ notesdir: ../../../mlatcl/dsa/_notes/ notebooksdir: ../../../mlatcl/dsa/_notebooks/ writediagramsdir: . diagramsdir: ./slides/diagrams/ +baseurl: "dsa/" # the subpath of your site, e.g. /blog/ +url: "https://mlatcl.github.io/" # the base hostname & protocol for your site transition: None ghub: - organization: lawrennd repository: talks branch: gh-pages directory: _dsa - \ No newline at end of file + From beffe0df70d476e1078727ae7ba7a74e2dffa5b8 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Tue, 25 Jan 2022 09:28:09 +0000 Subject: [PATCH 22/24] Update for deepnn lecture --- _dsa/gaussian-processes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_dsa/gaussian-processes.md b/_dsa/gaussian-processes.md index 64b5c21..76c7338 100755 --- a/_dsa/gaussian-processes.md +++ b/_dsa/gaussian-processes.md @@ -26,7 +26,7 @@ transition: None \include{_ml/includes/multivariate-gaussian-properties.md} \notes{\include{_ml/includes/linear-regression-log-likelihood.md} -\include{_ml/includes/olympic-data-linear-regression.md} +\include{_ml/includes/olympic-marathon-linear-regression.md} \include{_ml/includes/linear-regression-multivariate-log-likelihood.md} \define{designVector}{\basisVector} \define{designVariable}{Phi} From d3f671d7f617152b8cbf2d6473a4fbcc6d2226f9 Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Sun, 27 Feb 2022 08:00:04 +0000 Subject: [PATCH 23/24] Rewmove talk-macros loads. --- _dsa/bayesian-methods-abuja.md | 30 +++++------ _dsa/gaussian-processes.md | 66 ++++++++++++------------- _dsa/ml-systems-kimberley.md | 20 ++++---- _dsa/ml-systems.md | 22 ++++----- _dsa/probabilistic-machine-learning.md | 28 +++++------ _dsa/what-is-machine-learning-ashesi.md | 46 ++++++++--------- _dsa/what-is-machine-learning.md | 28 +++++------ 7 files changed, 120 insertions(+), 120 deletions(-) diff --git a/_dsa/bayesian-methods-abuja.md b/_dsa/bayesian-methods-abuja.md index 1e4cbe3..e4a311b 100755 --- a/_dsa/bayesian-methods-abuja.md +++ b/_dsa/bayesian-methods-abuja.md @@ -28,21 +28,21 @@ venue: DSA, Abuja transition: None --- -\include{talk-macros.tex} - -\include{_ml/includes/what-is-ml.md} -\include{_ml/includes/nigeria-nmis-data.md} -\include{_ml/includes/probability-intro.md} -\include{_ml/includes/probabilistic-modelling.md} - -\include{_ml/includes/graphical-models.md} -\include{_ml/includes/classification-intro.md} -\include{_ml/includes/classification-examples.md} -\include{_ml/includes/bayesian-reminder.md} -\include{_ml/includes/bernoulli-distribution.md} -\include{_ml/includes/bernoulli-maximum-likelihood.md} -\include{_ml/includes/bayes-rule-reminder.md} -\include{_ml/includes/naive-bayes.md} +talk-macros.gpp}lk-macros.tex} + +talk-macros.gpp}l/includes/what-is-ml.md} +talk-macros.gpp}l/includes/nigeria-nmis-data.md} +talk-macros.gpp}l/includes/probability-intro.md} +talk-macros.gpp}l/includes/probabilistic-modelling.md} + +talk-macros.gpp}l/includes/graphical-models.md} +talk-macros.gpp}l/includes/classification-intro.md} +talk-macros.gpp}l/includes/classification-examples.md} +talk-macros.gpp}l/includes/bayesian-reminder.md} +talk-macros.gpp}l/includes/bernoulli-distribution.md} +talk-macros.gpp}l/includes/bernoulli-maximum-likelihood.md} +talk-macros.gpp}l/includes/bayes-rule-reminder.md} +talk-macros.gpp}l/includes/naive-bayes.md} \subsection{Other Reading} diff --git a/_dsa/gaussian-processes.md b/_dsa/gaussian-processes.md index 76c7338..d46a71f 100755 --- a/_dsa/gaussian-processes.md +++ b/_dsa/gaussian-processes.md @@ -11,35 +11,35 @@ time: "15:00 (West Africa Standard Time)" transition: None --- -\include{talk-macros.tex} -\include{_mlai/includes/mlai-notebook-setup.md} +talk-macros.gpp}lk-macros.tex} +talk-macros.gpp}lai/includes/mlai-notebook-setup.md} -\include{_gp/includes/gp-book.md} -\include{_ml/includes/first-course-book.md} +talk-macros.gpp}p/includes/gp-book.md} +talk-macros.gpp}l/includes/first-course-book.md} -\include{_health/includes/malaria-gp.md} -\include{_ml/includes/what-is-ml.md} -\include{_ml/includes/overdetermined-inaugural.md} -\include{_ml/includes/univariate-gaussian-properties.md} +talk-macros.gpp}ealth/includes/malaria-gp.md} +talk-macros.gpp}l/includes/what-is-ml.md} +talk-macros.gpp}l/includes/overdetermined-inaugural.md} +talk-macros.gpp}l/includes/univariate-gaussian-properties.md} -\include{_ml/includes/multivariate-gaussian-properties.md} -\notes{\include{_ml/includes/linear-regression-log-likelihood.md} -\include{_ml/includes/olympic-marathon-linear-regression.md} -\include{_ml/includes/linear-regression-multivariate-log-likelihood.md} +talk-macros.gpp}l/includes/multivariate-gaussian-properties.md} +\notes{talk-macros.gpp}l/includes/linear-regression-log-likelihood.md} +talk-macros.gpp}l/includes/olympic-marathon-linear-regression.md} +talk-macros.gpp}l/includes/linear-regression-multivariate-log-likelihood.md} \define{designVector}{\basisVector} \define{designVariable}{Phi} \define{designMatrix}{\basisMatrix} -\include{_ml/includes/linear-regression-direct-solution.md}} -\include{_ml/includes/linear-regression-objective-optimisation.md} -\include{_ml/includes/movie-body-count-linear-regression.md} +talk-macros.gpp}l/includes/linear-regression-direct-solution.md}} +talk-macros.gpp}l/includes/linear-regression-objective-optimisation.md} +talk-macros.gpp}l/includes/movie-body-count-linear-regression.md} -\include{_ml/includes/underdetermined-system.md} -\include{_ml/includes/two-d-gaussian.md} +talk-macros.gpp}l/includes/underdetermined-system.md} +talk-macros.gpp}l/includes/two-d-gaussian.md} -\include{_ml/includes/basis-functions-nn.md} -\include{_ml/includes/relu-basis.md} +talk-macros.gpp}l/includes/basis-functions-nn.md} +talk-macros.gpp}l/includes/relu-basis.md} \subsection{Gaussian Processes} \slides{ @@ -61,29 +61,29 @@ $$ -\include{_ml/includes/linear-model-overview.md} +talk-macros.gpp}l/includes/linear-model-overview.md} -\include{_ml/includes/radial-basis.md} +talk-macros.gpp}l/includes/radial-basis.md} -\include{_gp/includes/gp-from-basis-functions.md} +talk-macros.gpp}p/includes/gp-from-basis-functions.md} -\include{_gp/includes/non-degenerate-gps.md} -\include{_gp/includes/gp-function-space.md} +talk-macros.gpp}p/includes/non-degenerate-gps.md} +talk-macros.gpp}p/includes/gp-function-space.md} -\include{_gp/includes/gptwopointpred.md} +talk-macros.gpp}p/includes/gptwopointpred.md} -\include{_gp/includes/gp-covariance-function-importance.md} -\include{_gp/includes/gp-numerics-and-optimization.md} +talk-macros.gpp}p/includes/gp-covariance-function-importance.md} +talk-macros.gpp}p/includes/gp-numerics-and-optimization.md} -\include{_gp/includes/gp-optimize.md} -\include{_kern/includes/eq-covariance.md} -\include{_gp/includes/gp-summer-school.md} -\include{_gp/includes/gpy-software.md} -\include{_gp/includes/gpy-tutorial.md} +talk-macros.gpp}p/includes/gp-optimize.md} +talk-macros.gpp}ern/includes/eq-covariance.md} +talk-macros.gpp}p/includes/gp-summer-school.md} +talk-macros.gpp}p/includes/gpy-software.md} +talk-macros.gpp}p/includes/gpy-tutorial.md} \subsection{Review} -\include{_gp/includes/other-gp-software.md} +talk-macros.gpp}p/includes/other-gp-software.md} \reading diff --git a/_dsa/ml-systems-kimberley.md b/_dsa/ml-systems-kimberley.md index bb268e3..75f5037 100644 --- a/_dsa/ml-systems-kimberley.md +++ b/_dsa/ml-systems-kimberley.md @@ -22,20 +22,20 @@ transition: None \slides{\section{AI via ML Systems} -\include{_ai/includes/supply-chain-system.md} -\include{_ai/includes/aws-soa.md} -\include{_ai/includes/dsa-systems.md} +talk-macros.gpp}i/includes/supply-chain-system.md} +talk-macros.gpp}i/includes/aws-soa.md} +talk-macros.gpp}i/includes/dsa-systems.md} } \notes{ -\include{_systems/includes/nigeria-health-intro.md} -\include{_systems/includes/nigeria-nmis-installs.md} -\include{_systems/includes/databases-and-joins.md} -\include{_systems/includes/nigeria-nmis-data-systems.md} -\include{_systems/includes/nigeria-nmis-spatial-join.md} +talk-macros.gpp}ystems/includes/nigeria-health-intro.md} +talk-macros.gpp}ystems/includes/nigeria-nmis-installs.md} +talk-macros.gpp}ystems/includes/databases-and-joins.md} +talk-macros.gpp}ystems/includes/nigeria-nmis-data-systems.md} +talk-macros.gpp}ystems/includes/nigeria-nmis-spatial-join.md} \define{databaseType}{sqlite} -\include{_systems/includes/nigeria-nmis-sql.md} -\include{_systems/includes/nigeria-nmis-covid-join.md} +talk-macros.gpp}ystems/includes/nigeria-nmis-sql.md} +talk-macros.gpp}ystems/includes/nigeria-nmis-covid-join.md} } \thanks diff --git a/_dsa/ml-systems.md b/_dsa/ml-systems.md index 83d09cd..055e2cf 100755 --- a/_dsa/ml-systems.md +++ b/_dsa/ml-systems.md @@ -20,24 +20,24 @@ venue: Virtual DSA transition: None --- -\include{talk-macros.tex} +talk-macros.gpp}lk-macros.tex} \slides{\section{AI via ML Systems} -\include{_ai/includes/supply-chain-system.md} -\include{_ai/includes/aws-soa.md} -\include{_ai/includes/dsa-systems.md} +talk-macros.gpp}i/includes/supply-chain-system.md} +talk-macros.gpp}i/includes/aws-soa.md} +talk-macros.gpp}i/includes/dsa-systems.md} } \notes{ -\include{_systems/includes/nigeria-health-intro.md} -\include{_systems/includes/nigeria-nmis-installs.md} -\include{_systems/includes/databases-and-joins.md} -\include{_systems/includes/nigeria-nmis-data-systems.md} -\include{_systems/includes/nigeria-nmis-spatial-join.md} +talk-macros.gpp}ystems/includes/nigeria-health-intro.md} +talk-macros.gpp}ystems/includes/nigeria-nmis-installs.md} +talk-macros.gpp}ystems/includes/databases-and-joins.md} +talk-macros.gpp}ystems/includes/nigeria-nmis-data-systems.md} +talk-macros.gpp}ystems/includes/nigeria-nmis-spatial-join.md} \define{databaseType}{sqlite} -\include{_systems/includes/nigeria-nmis-sql.md} -\include{_systems/includes/nigeria-nmis-covid-join.md} +talk-macros.gpp}ystems/includes/nigeria-nmis-sql.md} +talk-macros.gpp}ystems/includes/nigeria-nmis-covid-join.md} } \thanks diff --git a/_dsa/probabilistic-machine-learning.md b/_dsa/probabilistic-machine-learning.md index 8c80d10..744af19 100755 --- a/_dsa/probabilistic-machine-learning.md +++ b/_dsa/probabilistic-machine-learning.md @@ -30,20 +30,20 @@ https://www.kaggle.com/alaowerre/nigeria-nmis-health-facility-data %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -\include{talk-macros.tex} - -\include{_ml/includes/what-is-ml.md} -\include{_ml/includes/probability-intro.md} -\include{_ml/includes/probabilistic-modelling.md} - -\include{_ml/includes/graphical-models.md} -\include{_ml/includes/classification-intro.md} -\include{_ml/includes/classification-examples.md} -\include{_ml/includes/bayesian-reminder.md} -\include{_ml/includes/bernoulli-distribution.md} -\include{_ml/includes/bernoulli-maximum-likelihood.md} -\include{_ml/includes/bayes-rule-reminder.md} -\include{_ml/includes/naive-bayes.md} +talk-macros.gpp}lk-macros.tex} + +talk-macros.gpp}l/includes/what-is-ml.md} +talk-macros.gpp}l/includes/probability-intro.md} +talk-macros.gpp}l/includes/probabilistic-modelling.md} + +talk-macros.gpp}l/includes/graphical-models.md} +talk-macros.gpp}l/includes/classification-intro.md} +talk-macros.gpp}l/includes/classification-examples.md} +talk-macros.gpp}l/includes/bayesian-reminder.md} +talk-macros.gpp}l/includes/bernoulli-distribution.md} +talk-macros.gpp}l/includes/bernoulli-maximum-likelihood.md} +talk-macros.gpp}l/includes/bayes-rule-reminder.md} +talk-macros.gpp}l/includes/naive-bayes.md} ### Other Reading diff --git a/_dsa/what-is-machine-learning-ashesi.md b/_dsa/what-is-machine-learning-ashesi.md index be0f574..d3ecf0f 100755 --- a/_dsa/what-is-machine-learning-ashesi.md +++ b/_dsa/what-is-machine-learning-ashesi.md @@ -21,12 +21,12 @@ papersize: a4paper transition: None --- -\include{../talk-macros.tex} +talk-macros.gpp}/talk-macros.tex} \section{Introduction} -\include{_data-science/includes/data-science-africa.md} -\include{_health/includes/malaria-gp.md} +talk-macros.gpp}ata-science/includes/data-science-africa.md} +talk-macros.gpp}ealth/includes/malaria-gp.md} \subsection{Machine Learning} \notes{This talk is a general introduction to machine learning, we will highlight the technical challenges and the current solutions. We will give an overview of what is machine learning and why it is important.} @@ -45,13 +45,13 @@ $$ \notes{Machine learning has risen in prominence due to the rise in data availability, and its interconnection with computers. The high bandwidth connection between data and computer leads to a new interaction between us and data via the computer. It is that channel that is being mediated by machine learning techniques.} \figure{\includediagram{\diagramsDir/data-science/new-flow-of-information}{60%}}{Large amounts of data and high interconnection bandwidth mean that we receive much of our information about the world around us through computers.}{data-science-information-flow} -\include{_supply-chain/includes/supply-chain-africa.md} -\include{_ml/includes/process-emulation.md} -\include{_ml/includes/nigeria-nmis-data.md} -\include{_ml/includes/what-does-machine-learning-do.md} -\include{_ml/includes/what-is-ml-2.md} -\include{_ai/includes/ai-vs-data-science-2.md} -\include{_ml/includes/neural-networks.md} +talk-macros.gpp}upply-chain/includes/supply-chain-africa.md} +talk-macros.gpp}l/includes/process-emulation.md} +talk-macros.gpp}l/includes/nigeria-nmis-data.md} +talk-macros.gpp}l/includes/what-does-machine-learning-do.md} +talk-macros.gpp}l/includes/what-is-ml-2.md} +talk-macros.gpp}i/includes/ai-vs-data-science-2.md} +talk-macros.gpp}l/includes/neural-networks.md} \subsection{Machine Learning} \slides{ @@ -76,28 +76,28 @@ $$ } -\include{_ml/includes/supervised-learning-intro.md} +talk-macros.gpp}l/includes/supervised-learning-intro.md} -\include{_ml/includes/classification-intro.md} -\include{_ml/includes/classification-examples.md} -\include{_ml/includes/the-perceptron.md} -\notes{\include{_ml/includes/logistic-regression.md} -\include{_ml/includes/nigeria-nmis-data-logistic.md}} -\include{_ml/includes/regression-intro.md} -\include{_ml/includes/regression-examples.md} -\include{_ml/includes/olympic-marathon-polynomial.md} +talk-macros.gpp}l/includes/classification-intro.md} +talk-macros.gpp}l/includes/classification-examples.md} +talk-macros.gpp}l/includes/the-perceptron.md} +\notes{talk-macros.gpp}l/includes/logistic-regression.md} +talk-macros.gpp}l/includes/nigeria-nmis-data-logistic.md}} +talk-macros.gpp}l/includes/regression-intro.md} +talk-macros.gpp}l/includes/regression-examples.md} +talk-macros.gpp}l/includes/olympic-marathon-polynomial.md} -\include{_ml/includes/supervised-learning-challenges.md} +talk-macros.gpp}l/includes/supervised-learning-challenges.md} \notes{ -\include{_ml/includes/unsupervised-learning.md} -\include{_ml/includes/reinforcement-learning.md} +talk-macros.gpp}l/includes/unsupervised-learning.md} +talk-macros.gpp}l/includes/reinforcement-learning.md} \notes{We have introduced a range of machine learning approaches by focusing on their use of mathematical functions to replace manually coded systems of rules. The important characteristic of machine learning is that the form of these functions, as dictated by their parameters, is determined by acquiring data from the real world.} -\include{_ml/includes/deployment.md}} +talk-macros.gpp}l/includes/deployment.md}} \reading diff --git a/_dsa/what-is-machine-learning.md b/_dsa/what-is-machine-learning.md index 9d5eb49..9880c26 100755 --- a/_dsa/what-is-machine-learning.md +++ b/_dsa/what-is-machine-learning.md @@ -20,12 +20,12 @@ papersize: a4paper transition: None --- -\include{../talk-macros.gpp} +talk-macros.gpp}/talk-macros.gpp} \section{Introduction} -\include{_data-science/includes/data-science-africa.md} -\include{_health/includes/malaria-gp.md} +talk-macros.gpp}ata-science/includes/data-science-africa.md} +talk-macros.gpp}ealth/includes/malaria-gp.md} \subsection{Machine Learning} \notes{This talk is a general introduction to machine learning, we will highlight the technical challenges and the current solutions. We will give an overview of what is machine learning and why it is important.} @@ -44,8 +44,8 @@ $$ \notes{Machine learning has risen in prominence due to the rise in data availability, and its interconnection with computers. The high bandwidth connection between data and computer leads to a new interaction between us and data via the computer. It is that channel that is being mediated by machine learning techniques.} \figure{\includediagram{\diagramsDir/data-science/new-flow-of-information}{60%}}{Large amounts of data and high interconnection bandwidth mean that we receive much of our information about the world around us through computers.}{data-science-information-flow} -\include{_supply-chain/includes/supply-chain-africa.md} -\include{_ml/includes/process-emulation.md} +talk-macros.gpp}upply-chain/includes/supply-chain-africa.md} +talk-macros.gpp}l/includes/process-emulation.md} \newslide{Kapchorwa District} @@ -53,12 +53,12 @@ $$ \notes{Stephen Kiprotich, the 2012 gold medal winner from the London Olympics, comes from Kapchorwa district, in eastern Uganda, near the border with Kenya.} -\include{_ml/includes/olympic-marathon-polynomial.md} -\include{../_ml/includes/what-does-machine-learning-do.md} +talk-macros.gpp}l/includes/olympic-marathon-polynomial.md} +talk-macros.gpp}/_ml/includes/what-does-machine-learning-do.md} -\include{_ml/includes/what-is-ml-2.md} -\include{_ai/includes/ai-vs-data-science-2.md} -\include{_ml/includes/neural-networks.md} +talk-macros.gpp}l/includes/what-is-ml-2.md} +talk-macros.gpp}i/includes/ai-vs-data-science-2.md} +talk-macros.gpp}l/includes/neural-networks.md} \subsection{Machine Learning} \slides{ @@ -81,16 +81,16 @@ $$ 2. Unsupervised learning 3. Reinforcement learning } -\include{_ml/includes/supervised-learning.md} +talk-macros.gpp}l/includes/supervised-learning.md} \notes{ -\include{_ml/includes/unsupervised-learning.md} -\include{_ml/includes/reinforcement-learning.md} +talk-macros.gpp}l/includes/unsupervised-learning.md} +talk-macros.gpp}l/includes/reinforcement-learning.md} \notes{We have introduced a range of machine learning approaches by focusing on their use of mathematical functions to replace manually coded systems of rules. The important characteristic of machine learning is that the form of these functions, as dictated by their parameters, is determined by acquiring data from the real world.} -\include{_ml/includes/deployment.md}} +talk-macros.gpp}l/includes/deployment.md}} \thanks From 8adaee1356261c472e4693d9f7f4dd31e587584b Mon Sep 17 00:00:00 2001 From: Neil Lawrence Date: Fri, 11 Mar 2022 10:52:42 +0000 Subject: [PATCH 24/24] fix conflict --- _dsa/bayesian-methods-abuja.md | 30 +++++------ _dsa/gaussian-processes.md | 66 ++++++++++++------------- _dsa/ml-systems-kimberley.md | 20 ++++---- _dsa/ml-systems.md | 22 ++++----- _dsa/probabilistic-machine-learning.md | 28 +++++------ _dsa/what-is-machine-learning-ashesi.md | 46 ++++++++--------- _dsa/what-is-machine-learning.md | 28 +++++------ 7 files changed, 120 insertions(+), 120 deletions(-) diff --git a/_dsa/bayesian-methods-abuja.md b/_dsa/bayesian-methods-abuja.md index e4a311b..1e4cbe3 100755 --- a/_dsa/bayesian-methods-abuja.md +++ b/_dsa/bayesian-methods-abuja.md @@ -28,21 +28,21 @@ venue: DSA, Abuja transition: None --- -talk-macros.gpp}lk-macros.tex} - -talk-macros.gpp}l/includes/what-is-ml.md} -talk-macros.gpp}l/includes/nigeria-nmis-data.md} -talk-macros.gpp}l/includes/probability-intro.md} -talk-macros.gpp}l/includes/probabilistic-modelling.md} - -talk-macros.gpp}l/includes/graphical-models.md} -talk-macros.gpp}l/includes/classification-intro.md} -talk-macros.gpp}l/includes/classification-examples.md} -talk-macros.gpp}l/includes/bayesian-reminder.md} -talk-macros.gpp}l/includes/bernoulli-distribution.md} -talk-macros.gpp}l/includes/bernoulli-maximum-likelihood.md} -talk-macros.gpp}l/includes/bayes-rule-reminder.md} -talk-macros.gpp}l/includes/naive-bayes.md} +\include{talk-macros.tex} + +\include{_ml/includes/what-is-ml.md} +\include{_ml/includes/nigeria-nmis-data.md} +\include{_ml/includes/probability-intro.md} +\include{_ml/includes/probabilistic-modelling.md} + +\include{_ml/includes/graphical-models.md} +\include{_ml/includes/classification-intro.md} +\include{_ml/includes/classification-examples.md} +\include{_ml/includes/bayesian-reminder.md} +\include{_ml/includes/bernoulli-distribution.md} +\include{_ml/includes/bernoulli-maximum-likelihood.md} +\include{_ml/includes/bayes-rule-reminder.md} +\include{_ml/includes/naive-bayes.md} \subsection{Other Reading} diff --git a/_dsa/gaussian-processes.md b/_dsa/gaussian-processes.md index d46a71f..76c7338 100755 --- a/_dsa/gaussian-processes.md +++ b/_dsa/gaussian-processes.md @@ -11,35 +11,35 @@ time: "15:00 (West Africa Standard Time)" transition: None --- -talk-macros.gpp}lk-macros.tex} -talk-macros.gpp}lai/includes/mlai-notebook-setup.md} +\include{talk-macros.tex} +\include{_mlai/includes/mlai-notebook-setup.md} -talk-macros.gpp}p/includes/gp-book.md} -talk-macros.gpp}l/includes/first-course-book.md} +\include{_gp/includes/gp-book.md} +\include{_ml/includes/first-course-book.md} -talk-macros.gpp}ealth/includes/malaria-gp.md} -talk-macros.gpp}l/includes/what-is-ml.md} -talk-macros.gpp}l/includes/overdetermined-inaugural.md} -talk-macros.gpp}l/includes/univariate-gaussian-properties.md} +\include{_health/includes/malaria-gp.md} +\include{_ml/includes/what-is-ml.md} +\include{_ml/includes/overdetermined-inaugural.md} +\include{_ml/includes/univariate-gaussian-properties.md} -talk-macros.gpp}l/includes/multivariate-gaussian-properties.md} -\notes{talk-macros.gpp}l/includes/linear-regression-log-likelihood.md} -talk-macros.gpp}l/includes/olympic-marathon-linear-regression.md} -talk-macros.gpp}l/includes/linear-regression-multivariate-log-likelihood.md} +\include{_ml/includes/multivariate-gaussian-properties.md} +\notes{\include{_ml/includes/linear-regression-log-likelihood.md} +\include{_ml/includes/olympic-marathon-linear-regression.md} +\include{_ml/includes/linear-regression-multivariate-log-likelihood.md} \define{designVector}{\basisVector} \define{designVariable}{Phi} \define{designMatrix}{\basisMatrix} -talk-macros.gpp}l/includes/linear-regression-direct-solution.md}} -talk-macros.gpp}l/includes/linear-regression-objective-optimisation.md} -talk-macros.gpp}l/includes/movie-body-count-linear-regression.md} +\include{_ml/includes/linear-regression-direct-solution.md}} +\include{_ml/includes/linear-regression-objective-optimisation.md} +\include{_ml/includes/movie-body-count-linear-regression.md} -talk-macros.gpp}l/includes/underdetermined-system.md} -talk-macros.gpp}l/includes/two-d-gaussian.md} +\include{_ml/includes/underdetermined-system.md} +\include{_ml/includes/two-d-gaussian.md} -talk-macros.gpp}l/includes/basis-functions-nn.md} -talk-macros.gpp}l/includes/relu-basis.md} +\include{_ml/includes/basis-functions-nn.md} +\include{_ml/includes/relu-basis.md} \subsection{Gaussian Processes} \slides{ @@ -61,29 +61,29 @@ $$ -talk-macros.gpp}l/includes/linear-model-overview.md} +\include{_ml/includes/linear-model-overview.md} -talk-macros.gpp}l/includes/radial-basis.md} +\include{_ml/includes/radial-basis.md} -talk-macros.gpp}p/includes/gp-from-basis-functions.md} +\include{_gp/includes/gp-from-basis-functions.md} -talk-macros.gpp}p/includes/non-degenerate-gps.md} -talk-macros.gpp}p/includes/gp-function-space.md} +\include{_gp/includes/non-degenerate-gps.md} +\include{_gp/includes/gp-function-space.md} -talk-macros.gpp}p/includes/gptwopointpred.md} +\include{_gp/includes/gptwopointpred.md} -talk-macros.gpp}p/includes/gp-covariance-function-importance.md} -talk-macros.gpp}p/includes/gp-numerics-and-optimization.md} +\include{_gp/includes/gp-covariance-function-importance.md} +\include{_gp/includes/gp-numerics-and-optimization.md} -talk-macros.gpp}p/includes/gp-optimize.md} -talk-macros.gpp}ern/includes/eq-covariance.md} -talk-macros.gpp}p/includes/gp-summer-school.md} -talk-macros.gpp}p/includes/gpy-software.md} -talk-macros.gpp}p/includes/gpy-tutorial.md} +\include{_gp/includes/gp-optimize.md} +\include{_kern/includes/eq-covariance.md} +\include{_gp/includes/gp-summer-school.md} +\include{_gp/includes/gpy-software.md} +\include{_gp/includes/gpy-tutorial.md} \subsection{Review} -talk-macros.gpp}p/includes/other-gp-software.md} +\include{_gp/includes/other-gp-software.md} \reading diff --git a/_dsa/ml-systems-kimberley.md b/_dsa/ml-systems-kimberley.md index 75f5037..bb268e3 100644 --- a/_dsa/ml-systems-kimberley.md +++ b/_dsa/ml-systems-kimberley.md @@ -22,20 +22,20 @@ transition: None \slides{\section{AI via ML Systems} -talk-macros.gpp}i/includes/supply-chain-system.md} -talk-macros.gpp}i/includes/aws-soa.md} -talk-macros.gpp}i/includes/dsa-systems.md} +\include{_ai/includes/supply-chain-system.md} +\include{_ai/includes/aws-soa.md} +\include{_ai/includes/dsa-systems.md} } \notes{ -talk-macros.gpp}ystems/includes/nigeria-health-intro.md} -talk-macros.gpp}ystems/includes/nigeria-nmis-installs.md} -talk-macros.gpp}ystems/includes/databases-and-joins.md} -talk-macros.gpp}ystems/includes/nigeria-nmis-data-systems.md} -talk-macros.gpp}ystems/includes/nigeria-nmis-spatial-join.md} +\include{_systems/includes/nigeria-health-intro.md} +\include{_systems/includes/nigeria-nmis-installs.md} +\include{_systems/includes/databases-and-joins.md} +\include{_systems/includes/nigeria-nmis-data-systems.md} +\include{_systems/includes/nigeria-nmis-spatial-join.md} \define{databaseType}{sqlite} -talk-macros.gpp}ystems/includes/nigeria-nmis-sql.md} -talk-macros.gpp}ystems/includes/nigeria-nmis-covid-join.md} +\include{_systems/includes/nigeria-nmis-sql.md} +\include{_systems/includes/nigeria-nmis-covid-join.md} } \thanks diff --git a/_dsa/ml-systems.md b/_dsa/ml-systems.md index 055e2cf..83d09cd 100755 --- a/_dsa/ml-systems.md +++ b/_dsa/ml-systems.md @@ -20,24 +20,24 @@ venue: Virtual DSA transition: None --- -talk-macros.gpp}lk-macros.tex} +\include{talk-macros.tex} \slides{\section{AI via ML Systems} -talk-macros.gpp}i/includes/supply-chain-system.md} -talk-macros.gpp}i/includes/aws-soa.md} -talk-macros.gpp}i/includes/dsa-systems.md} +\include{_ai/includes/supply-chain-system.md} +\include{_ai/includes/aws-soa.md} +\include{_ai/includes/dsa-systems.md} } \notes{ -talk-macros.gpp}ystems/includes/nigeria-health-intro.md} -talk-macros.gpp}ystems/includes/nigeria-nmis-installs.md} -talk-macros.gpp}ystems/includes/databases-and-joins.md} -talk-macros.gpp}ystems/includes/nigeria-nmis-data-systems.md} -talk-macros.gpp}ystems/includes/nigeria-nmis-spatial-join.md} +\include{_systems/includes/nigeria-health-intro.md} +\include{_systems/includes/nigeria-nmis-installs.md} +\include{_systems/includes/databases-and-joins.md} +\include{_systems/includes/nigeria-nmis-data-systems.md} +\include{_systems/includes/nigeria-nmis-spatial-join.md} \define{databaseType}{sqlite} -talk-macros.gpp}ystems/includes/nigeria-nmis-sql.md} -talk-macros.gpp}ystems/includes/nigeria-nmis-covid-join.md} +\include{_systems/includes/nigeria-nmis-sql.md} +\include{_systems/includes/nigeria-nmis-covid-join.md} } \thanks diff --git a/_dsa/probabilistic-machine-learning.md b/_dsa/probabilistic-machine-learning.md index 744af19..8c80d10 100755 --- a/_dsa/probabilistic-machine-learning.md +++ b/_dsa/probabilistic-machine-learning.md @@ -30,20 +30,20 @@ https://www.kaggle.com/alaowerre/nigeria-nmis-health-facility-data %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -talk-macros.gpp}lk-macros.tex} - -talk-macros.gpp}l/includes/what-is-ml.md} -talk-macros.gpp}l/includes/probability-intro.md} -talk-macros.gpp}l/includes/probabilistic-modelling.md} - -talk-macros.gpp}l/includes/graphical-models.md} -talk-macros.gpp}l/includes/classification-intro.md} -talk-macros.gpp}l/includes/classification-examples.md} -talk-macros.gpp}l/includes/bayesian-reminder.md} -talk-macros.gpp}l/includes/bernoulli-distribution.md} -talk-macros.gpp}l/includes/bernoulli-maximum-likelihood.md} -talk-macros.gpp}l/includes/bayes-rule-reminder.md} -talk-macros.gpp}l/includes/naive-bayes.md} +\include{talk-macros.tex} + +\include{_ml/includes/what-is-ml.md} +\include{_ml/includes/probability-intro.md} +\include{_ml/includes/probabilistic-modelling.md} + +\include{_ml/includes/graphical-models.md} +\include{_ml/includes/classification-intro.md} +\include{_ml/includes/classification-examples.md} +\include{_ml/includes/bayesian-reminder.md} +\include{_ml/includes/bernoulli-distribution.md} +\include{_ml/includes/bernoulli-maximum-likelihood.md} +\include{_ml/includes/bayes-rule-reminder.md} +\include{_ml/includes/naive-bayes.md} ### Other Reading diff --git a/_dsa/what-is-machine-learning-ashesi.md b/_dsa/what-is-machine-learning-ashesi.md index d3ecf0f..be0f574 100755 --- a/_dsa/what-is-machine-learning-ashesi.md +++ b/_dsa/what-is-machine-learning-ashesi.md @@ -21,12 +21,12 @@ papersize: a4paper transition: None --- -talk-macros.gpp}/talk-macros.tex} +\include{../talk-macros.tex} \section{Introduction} -talk-macros.gpp}ata-science/includes/data-science-africa.md} -talk-macros.gpp}ealth/includes/malaria-gp.md} +\include{_data-science/includes/data-science-africa.md} +\include{_health/includes/malaria-gp.md} \subsection{Machine Learning} \notes{This talk is a general introduction to machine learning, we will highlight the technical challenges and the current solutions. We will give an overview of what is machine learning and why it is important.} @@ -45,13 +45,13 @@ $$ \notes{Machine learning has risen in prominence due to the rise in data availability, and its interconnection with computers. The high bandwidth connection between data and computer leads to a new interaction between us and data via the computer. It is that channel that is being mediated by machine learning techniques.} \figure{\includediagram{\diagramsDir/data-science/new-flow-of-information}{60%}}{Large amounts of data and high interconnection bandwidth mean that we receive much of our information about the world around us through computers.}{data-science-information-flow} -talk-macros.gpp}upply-chain/includes/supply-chain-africa.md} -talk-macros.gpp}l/includes/process-emulation.md} -talk-macros.gpp}l/includes/nigeria-nmis-data.md} -talk-macros.gpp}l/includes/what-does-machine-learning-do.md} -talk-macros.gpp}l/includes/what-is-ml-2.md} -talk-macros.gpp}i/includes/ai-vs-data-science-2.md} -talk-macros.gpp}l/includes/neural-networks.md} +\include{_supply-chain/includes/supply-chain-africa.md} +\include{_ml/includes/process-emulation.md} +\include{_ml/includes/nigeria-nmis-data.md} +\include{_ml/includes/what-does-machine-learning-do.md} +\include{_ml/includes/what-is-ml-2.md} +\include{_ai/includes/ai-vs-data-science-2.md} +\include{_ml/includes/neural-networks.md} \subsection{Machine Learning} \slides{ @@ -76,28 +76,28 @@ talk-macros.gpp}l/includes/neural-networks.md} } -talk-macros.gpp}l/includes/supervised-learning-intro.md} +\include{_ml/includes/supervised-learning-intro.md} -talk-macros.gpp}l/includes/classification-intro.md} -talk-macros.gpp}l/includes/classification-examples.md} -talk-macros.gpp}l/includes/the-perceptron.md} -\notes{talk-macros.gpp}l/includes/logistic-regression.md} -talk-macros.gpp}l/includes/nigeria-nmis-data-logistic.md}} -talk-macros.gpp}l/includes/regression-intro.md} -talk-macros.gpp}l/includes/regression-examples.md} -talk-macros.gpp}l/includes/olympic-marathon-polynomial.md} +\include{_ml/includes/classification-intro.md} +\include{_ml/includes/classification-examples.md} +\include{_ml/includes/the-perceptron.md} +\notes{\include{_ml/includes/logistic-regression.md} +\include{_ml/includes/nigeria-nmis-data-logistic.md}} +\include{_ml/includes/regression-intro.md} +\include{_ml/includes/regression-examples.md} +\include{_ml/includes/olympic-marathon-polynomial.md} -talk-macros.gpp}l/includes/supervised-learning-challenges.md} +\include{_ml/includes/supervised-learning-challenges.md} \notes{ -talk-macros.gpp}l/includes/unsupervised-learning.md} -talk-macros.gpp}l/includes/reinforcement-learning.md} +\include{_ml/includes/unsupervised-learning.md} +\include{_ml/includes/reinforcement-learning.md} \notes{We have introduced a range of machine learning approaches by focusing on their use of mathematical functions to replace manually coded systems of rules. The important characteristic of machine learning is that the form of these functions, as dictated by their parameters, is determined by acquiring data from the real world.} -talk-macros.gpp}l/includes/deployment.md}} +\include{_ml/includes/deployment.md}} \reading diff --git a/_dsa/what-is-machine-learning.md b/_dsa/what-is-machine-learning.md index 9880c26..9d5eb49 100755 --- a/_dsa/what-is-machine-learning.md +++ b/_dsa/what-is-machine-learning.md @@ -20,12 +20,12 @@ papersize: a4paper transition: None --- -talk-macros.gpp}/talk-macros.gpp} +\include{../talk-macros.gpp} \section{Introduction} -talk-macros.gpp}ata-science/includes/data-science-africa.md} -talk-macros.gpp}ealth/includes/malaria-gp.md} +\include{_data-science/includes/data-science-africa.md} +\include{_health/includes/malaria-gp.md} \subsection{Machine Learning} \notes{This talk is a general introduction to machine learning, we will highlight the technical challenges and the current solutions. We will give an overview of what is machine learning and why it is important.} @@ -44,8 +44,8 @@ $$ \notes{Machine learning has risen in prominence due to the rise in data availability, and its interconnection with computers. The high bandwidth connection between data and computer leads to a new interaction between us and data via the computer. It is that channel that is being mediated by machine learning techniques.} \figure{\includediagram{\diagramsDir/data-science/new-flow-of-information}{60%}}{Large amounts of data and high interconnection bandwidth mean that we receive much of our information about the world around us through computers.}{data-science-information-flow} -talk-macros.gpp}upply-chain/includes/supply-chain-africa.md} -talk-macros.gpp}l/includes/process-emulation.md} +\include{_supply-chain/includes/supply-chain-africa.md} +\include{_ml/includes/process-emulation.md} \newslide{Kapchorwa District} @@ -53,12 +53,12 @@ talk-macros.gpp}l/includes/process-emulation.md} \notes{Stephen Kiprotich, the 2012 gold medal winner from the London Olympics, comes from Kapchorwa district, in eastern Uganda, near the border with Kenya.} -talk-macros.gpp}l/includes/olympic-marathon-polynomial.md} -talk-macros.gpp}/_ml/includes/what-does-machine-learning-do.md} +\include{_ml/includes/olympic-marathon-polynomial.md} +\include{../_ml/includes/what-does-machine-learning-do.md} -talk-macros.gpp}l/includes/what-is-ml-2.md} -talk-macros.gpp}i/includes/ai-vs-data-science-2.md} -talk-macros.gpp}l/includes/neural-networks.md} +\include{_ml/includes/what-is-ml-2.md} +\include{_ai/includes/ai-vs-data-science-2.md} +\include{_ml/includes/neural-networks.md} \subsection{Machine Learning} \slides{ @@ -81,16 +81,16 @@ talk-macros.gpp}l/includes/neural-networks.md} 2. Unsupervised learning 3. Reinforcement learning } -talk-macros.gpp}l/includes/supervised-learning.md} +\include{_ml/includes/supervised-learning.md} \notes{ -talk-macros.gpp}l/includes/unsupervised-learning.md} -talk-macros.gpp}l/includes/reinforcement-learning.md} +\include{_ml/includes/unsupervised-learning.md} +\include{_ml/includes/reinforcement-learning.md} \notes{We have introduced a range of machine learning approaches by focusing on their use of mathematical functions to replace manually coded systems of rules. The important characteristic of machine learning is that the form of these functions, as dictated by their parameters, is determined by acquiring data from the real world.} -talk-macros.gpp}l/includes/deployment.md}} +\include{_ml/includes/deployment.md}} \thanks