This repository is forked and adapted from a four part workshop developed by AashitaK at Harvey Mudd College.
The workshop series is designed with a focus on the practical aspects of machine learning. We will be working in Python and using real-world datasets from Kaggle, the machine learning platform most suited for the “learn-by-doing” philosophy. The series is targeted towards complete beginners familiar with Python, but it is also designed adaptively so that you will be challenged even if you have some familiarity with machine learning tools.
Session 1: Setup: Python and Github
- Installing software: python, jupyter, git
- Navigation and other basic commands in the terminal
- Working with the p-ai-org github
- Python:
- variables
- data structures (numbers, lists, strings)
- control flow (if, while, for)
- functions
- importing packages
Session 2: Data analysis: numpy and pandas (tentative topic list! probably to be reduced)
- Pandas dataframes as the data structure for datasets
- Converting csv files to dataframes
- Slicing and indexing dataframes using conditionals as well as iloc and loc methods.
- Statistical summary and exploration of dataframes
- Detecting and filling missing values in the dataframes
- Regular expressions for data extraction
- Feature engineering such as creating new features
- Basic statistical plots using
matplotlib
andseaborn
- Correlation among features
- Basic operations such as dropping rows/columns, setting index, replacing values of a column using a dictionary, etc.
- Split-apply-combine operations by grouping rows of a dataframe
- Encoding categorical variables
- Concatentating and merging dataframes
- More operations such as sorting the rows, creating a dataframe from the scratch, etc.
Session 3: Model Building, Tuning and Validation using Scikit-learn (tentative topic list! definitely to be reduced)
-
Overfitting and underfitting of models
-
Regression algorithms
- Linear Regression
- Polynomial Regression
- Rigde Regression
- Lasso Regression
-
Model Validation
-
Tuning regularization paramter
-
Evaluation metrics for regression - R-squared and Root Mean-Squared Error (RMSE)
-
Normalization and scaling of features
-
Classification algorithms
- Logistic Regression
- Decision Trees
- k-Nearest Neighbors
- Support Vector Machines
- Random Forests
-
Evaluation metrics for classification
- Classification accuracy
- Confusion matrix
- Decision Threshold
- Precision and Recall
- F1 score
- Area Under ROC curve
-
Dimensionality reduction (Optional)
- Principal Component Analysis (PCA)
-
k-fold Cross-validation
-
Maximum Voting Classifiers
Andrew Chen, Alex Ker, Corrine Donnay, Chanha Kim, Hannah Zucker
Instructor: Aashita Kesarwani
TAs: Rex Asabor, Ben Langton and Qualan Woodard