(only slides / videos i.e. no code)
- familiarity with scope of most common ML applications in business or science
- understand difference between memorization and generalization
- familiarity with main machine learning concepts and vocabulary
Why and when? Example applications.
- Mention the iris example, pitch it as historical, but also as a
botanical and agriculture problem.
The benefit of this example is that it forces to think about
measurement, but also because it has one class that is easy to separate
only one feature
- Aurelie: real irises for the video? artificial flowers ordered (thanks Anne)
- The "adult" dataset
- Maybe looking at it with excel, to be in an environment familiar to people
- Mention the importance of data visualization: intuitions about the data can be very helpful
Descriptive vs predictive analysis
- Generalization (Out of sample properties)
- An example of where it makes a difference: if the data has redundant variables, such as expressing the education level as the name of the degree or the corresponding number of years of education
Learning from data vs expertly engineered decision rules
- One the iris example, show that cutting on one specific feature separates well one class
- How do we automate this? How do we achieve this on more complex data such as the census dataset?
Generalization vs memorization: the need for a train / test split
- The nearest neighbors example to illustrate this
Supervised vs Unsupervised
- Formalize supervised learning (define "X" and "y")
- Introduce unsupervised learning, for instance dimensionality reduction (and go back to the example of redundant variables: if we have many of these, we should be able to reduce the problem without even looking at y
Regression vs Classification
- In the adult data: it would make more sense to do a continuous prediction
- In the iris example, it is naturally a classification problem
Features and samples
- The data matrix
- Build the data matrix of Iris
A few words about the style and scope of this course: it is centered around code, though we strive to keep it simple
Given a case study (e.g. pricing apartments based on a real estate website database) and sample toy dataset: say whether it’s an application of supervised vs unsupervised, classification vs regression, what are the features, what is the target variable, what is a record.
Propose a hand engineer decision rule that can be used as a baseline
Propose a quantitative evaluation of the success of this decision rule.
- load tabular data with pandas
- visualize marginal distribution with histograms
- visualize pairwise interactions with scatter plots
- identify outlier and dynamic range of each column
Defining a predictive task that relates to the business or scientific case
Pandas read_csv
Simple exploratory data analysis with pandas and matplotlib
- Know the difference between a numerical and a categorical variable
- use a scaler
- convert category labels to dummy variables
- combine feature preprocessing and model with pipeline
- evaluate generalization of model with cross-validation
Prepare a train / test split
Basic model on numerical features only
Basic processing: missing values and scaling
Use a pipeline to evaluate model with cross-validation with and without scaling
Handling categorical variables with one-hot encoding
Use the column transformer to build pipeline with heterogeneous dtype
Model fitting and performance evaluation with cross-validation
- Gael thinks that we could use a video here for cross-validation (in particular, the "plot_cv_indices" in the notebook gets a bit in the way of being accessible and didactic
- Learn to no trust blindly the default parameters of scikit-learn estimators
Parameter tuning with Grid and Random hyperparameter search Nested cross-validation
Confirmation of performance with final test set
Understand decision rules for a few important algorithms Know how to diagnose model generalization errors (overfitting especially) How to use variable selection and generalization to fight overfitting Feature engineering to limit underfitting
Olivier: Overfitting/Underfitting validation curves, learning curves, regularisation with linear models
- Video about overfitting?
- Notebook
- Slides
- Reviews
- Notebook
- Slides
- Reviews
- Notebook
- Slides
- Reviews
Need to add regression plots.
Logistic Regression, linear regression, classification vs regression, multi-class, linear separability. Pros and cons L1 and L2 penalty for linear models Learning curves and validation curves (video: how to read curves)
- Notebook
- Slides
- Reviews
Binning / Polynomial feature extraction / Nystroem method
Feature selection to combat overfitting and speed-up models
Show catastrophic example where feature selection is done on the whole dataset rather than only on train
- Notebook
Failure Mode : cardinality bias of overfitting random forest feature importances
Gael thinks that explaining the difference between conditional and marginal interpretation is important.
Stability of hyperparameter during cross-validation