layout | title | description |
---|---|---|
default |
Lectures and Class Material |
Links to the pre-recorded lectures and material |
Lecture material: Link to the parent GitHub Repository.
Back to home QLS612 website
Slack workspace QLS612 slack
Instructor: JB Poline
Outline
With this lecture, you will get a general introduction to reproducible - or irreproducible - life sciences. Specifically, you will
- learn what is meant by reproducibility of research results in the life sciences
- undertand the main causes for irreproducible results
- learn the possible collective and individual actions for curbing irreproducibility
Material: GitHub Link
Pre-recorded lecture video: YouTube Link
Slides: Slides
Lecture Resources
- Canonical paper: Ten common statistical mistakes to watch out for when writing or reviewing a manuscript
Questions you will be able to answer after taking this module:
- Is the term “replicability” generally applied to obtaining the same results with another (new) dataset ?
- Is the root cause of irreproducibility the publication incentive ?
- What is a similar result with the same methodology or pipeline but different data ?
Instructor: Jacob Sanz-Robinson
Outline
To follow most of the other modules you will have to have some basic understanding of the command line. In this module we'll take a look at the the BourneAgainSHell (BASH), the default command line in most Linux systems. You will learn how to:
- move around on your computer with the command line, create and open directories and files
- find things with the command line (files and programs, PATH variables)
- run useful command line programs and find help (find, grep, ls, and man / documentation)
Materials:
Pre-recorded lecture video: YouTube Link
Slides: Slides
Questions you will be able to answer after taking this module:
- What is a command line shell
- How would you copy thousands of files with file names starting with
"my_good_file..."
to a different directory on your computer? - Among thousands of files and directories you know there is one where you wrote down
"location of my thesis backup"
. How do you find this file? - What is an environment variable and how can you change it?
Instructor: Jacob Sanz-Robinson and Michelle Wang
Outline
- This lecture is designed to get students up and running with Python. It is expected that Python 3 (preferably 3.7 or later) is installed, and that the students have some basic previous experience in a scripting language.
- It will guide students through the fundamental syntax, concepts, and data structures required to code in Python 3.
- Topics include: Running your code, commenting, variables, arithmetic, logic, strings, lists, tuples, dictionaries, functions, libraries, if statements, loops, exceptions, and classes.
Material: GitHub Link
Pre-recorded lecture video: YouTube Link
Questions you will be able to answer after taking this module:
(1) How does the use of a ‘break’ statement alter the flow of a loop in Python?
(2) What happens if you attempt to append new elements to a Tuple?
(3) Without running the code on your machine, what is the printed output when the following code is run?
my_dictionary = {"a" : 1, "b" : {"c" : {"d" : [4,5,6,4]}}, "c" : [1,2,3]}
x = my_dictionary["b"]["c"]["d"].append(my_dictionary["c"][-3])
print(my_dictionary.values())
- a) [1, {'c': {'d': [4, 5, 6, 4}}, [1, 2, 3]]
- b) [1, {'c': {'d': [4, 5, 6, 4, 1]}}, [1, 2, 3]]
- c) [1, [4,5,6,4,1], [1,2,3]]
- d) [1, [4,5,6,4], [1,2,3]]
(4) Without running the code on your machine, which string is returned by my_function when called with the specified parameters?
def my_function(x, y, z):
result = ""
if len(z) <= 6 and len(z) > 2:
result = z[-2] + y
else:
result = x + y
return x + x + result
my_function("111", "abc", "0100")
- a) ‘1111110abc’
- b) ‘0abc111111’
- c) ‘111111bca0’
- d) ‘1111111110’
Instructor: Tristan Glatard
Outline
This lecture will introduce NumPy, Pandas, and SciPy, three of the main libraries in the scientific Python ecosystem. At the end of the lecture, participants will be able to:
- Manipulate arrays of numbers with NumPy
- Manipulate data frames with Pandas
- Apply numerical methods from the scientific Python ecosystem
Materials: GitHub Link
Lecture Resources
- A Visual Intro to NumPy and Data Representation by Jay Alammar, up to "Transposing and Reshaping.
- Pandas DataFrame introduction
- Pandas read-write tutorial
- Scipy introduction
- Scipy IO tutorial
Questions you will be able to answer after taking this module:
(1) NumPy's main data structure is a Python list
- True
- False
(2) Pandas's main data structure is a 2D table
- True
- False
(3) A Pandas Series is a one-dimensional array
- True
- False
Instructor: Kendra Oudyk
Outline
Git and GitHub are key tools for doing version control in both academia and industry. These tools can help students do more effient, open, and reproducible research. Further, knowing these tools can help prepare students for careers in academia and industry. In this lecture, students will learn
- What is version control and why has it become so important in science and industry;
- How to track and share their own work using Git and GitHub; and
- How to collaborate and contribute to open projects using Git and GitHub.
Materials: GitHub Link
Pre-recorded lecture video: YouTube Link
Slides: Slides
Questions you will be able to answer after taking this module:
- In a ________ version control system, individuals have the entire repository and its history in their local repository.
- a) Centralized
- b) Distributed
- What is the basic workflow for tracking a change and sharing it on github?
- a)
git commit
,git add
,git push
- b)
git pull
,git add
,git push
- c)
git add
,git commit
,git push
- How do you start a parallel line of development, in order to do nonlinear version control?
- a) make a tag
- b) start a new branch
- c) create a remote repository
- How do you make a copy of another GitHub repo on your GitHub account?
- a)
git clone <repo address>
- b) go to the repo's GitHub page and click "fork"
- c) go to the repo's GitHub page and open an issue to ask for a copy
- d) go to the repo's GitHub page and do a pull request
Instructor: Nadia Blostein
This module is designed to introduce students to data preprocessing (ie preparation) in Python. Data preprocessing is a critical prerequisite to any data analysis or machine learning application. Students will be preprocessing .csv and .png data from the following repository and the session will cover the topics below:
Outline
- Load and examine your data
- Data reformatting
- Data filtering
- Data transforms
- Data visualization
- Examining and manipulating 2D images with scikit image and scipy
Materials: GitHub Link
Pre-recorded lecture video: YouTube Link
Lecture resourecs
- One-hot encoding
- 10 Python image manipulation tools
- 6 Different Ways to Compensate for Missing Values In a Dataset
- Imputation of mixed data with multilevel singular value decomposition
- Understanding the Difference Between Normalization vs. Standardization
Questions you will be able to answer after taking this module:
- What is a problem that can arise when you one-hot encode a feature with a lot of categories?
- What Python library can you use to generate histograms?
- If you are using a Gaussian filter to blur an image, which of the following sigma values will blur your image the most: 0.1, 2, 4, 5, 6 ?
- What Python package is faster for matrix computations: Pandas or Numpy?
Instructor: Nikhil Bhagwat
Outline
- Define machine-learning nomenclature
- Describe basics of the “learning” process
- Explain model design choices and performance trade-offs
- Introduce model selection and validation frameworks
- Explain model performance metrics
Materials: GitHub Link
Pre-recorded lecture video: YouTube Link
Slides: Slides
Lecture Resources
Questions you will be able to answer after taking this module:
- Model training - what is under/over-fitting?
- Model selection - what is (nested) cross-validation?
- Model evaluatation - what are type-1 and type-2 errors?
Instructor: Jérôme Dockès
Outline
- Learn how to properly select a machine-learning model, set hyperparameters, and evaluate prediction performance.
- Understand the challenges of learning from high-dimensional data and learn about tools to mitigate the issue.
Materials: GitHub Link
Pre-recorded lecture video: YouTube Link
Questions you will be able to answer after taking this module:
- I am predicting continuous cognitive scores of 1,000 participants using 20,000 brain imaging features. I use least-squares regression. What is regularization and why do I need it?
- I decide to use ridge regression (l2 regularization). How can I set the regularization hyperparameter?
- I also add a dimensionality reduction step to my model: PCA. I do 5-fold cross-validation, and I perform a full grid-search, using 3 folds for the inner validation loop. I use a grid of 3 options for the number of PCA components and 6 options for the ridge hyperparameter. How many times (at least) will I need to fit a PCA?
Instructor: Jonathan Armoza
Outline
- This module will teach students fundamental concepts of data visualization and familiarize them with several graphing libraries in Python (Matplotlib, Seaborn, Plot.ly, Bokeh) with the goals of using visualizations as a tool to understand data and creating graphics for multiple science contexts.
- It will guide students through the process of familiarizing themselves with graphing libraries, and choosing plots that display the information accurately and clearly.
- It will provide students with a perspective on best practices for visualization design.
Materials: GitHub Link
Pre-recorded lecture video: YouTube Link
Slides: Slides
Questions you will be able to answer after taking this module:
- Which plot types are best to visualize scalar, categorical, or distributional data? How does the answer to that question change if the data relationship being plotted is univariate vs multivariate?
- What are a few best practics for visualization design that balances clarity and consideration for audience?
- Why would I choose to generate static visualizations vs interactive ones?
- Which Python graphing libraries are most efficient to do so? And what are some of the capabilities of each?
- Is a data visualization an objective research output?
Instructor: Sebastian Urchs
Outline
- Learn why containerization and virtualization are important for research projects.
- Have an overview of different solutions to create isolated environments.
- Get some basic hands on experience with Python virtual environments and Docker.
Materials: GitHub Link
Pre-recorded lecture video: YouTube Link
Slides: Slides
Lecture Resources
- The Visual Display of Quantitative Information by Edward R. Tufte
- Gapminder
- Lev Manovich
Questions you will be able to answer after taking this module: (to check your understanding of the pre-recorded materials)
- When working with the file system inside a Docker container, which statements are true?
- I cannot see files on the host system from inside the container
- files written into the container file system are lost with the container
- I can mount paths on the host system into the container to expose their contents to it
- What is an advantage of Docker over a Virtual Machine?
- a Docker container can run any operating system, independently of the host operating system
- Docker is a good choice for shared systems because of its high level of security
- Docker containers are easier to specify, build, and manage and have better sharing infrastructure
- What is the difference between a Docker container and a Docker image?
- A Docker container is a registry service to store and share Docker images
- A Docker image is a read-only snapshot and a Docker container is a running instance of it
- A Docker container is a read-only snapshot that can be easily shared (e.g. on Dockerhub) and from it, many live Docker images can be spawned
- What is an advantage conda has over pip for Python environments?
- conda is usually prepackaged with Python, so you don't have to install anything
- conda has more Python packages than pip because of the Anaconda distribution
- conda can resolve non-Python dependencies and can also create virtual environments
Instructor: Brent McPherson
Outline
- Learn the key facts about High Performance Computing (HPC) and Cloud computing
- Understand the advantages and the constraints of HPC
- Learn the key concepts and practical bash commands to get started on the Compute Canada HPC
Materials: GitHub Link
Pre-recorded lecture video: YouTube Link
Slides: Slides
Questions you will be able to answer after taking this module:
- Choose the area that Advanced Research Computing traditionally does not include
- a) HPC/Clusters
- b) Research Data Management
- c) Cloud Computing
- d) Video Games
- Choose all components that are part of an HPC Compute Node
- a) Processor/Core
- b) Display/Monitor
- c) Memory
- d) Mouse
- e) Local Disk
- Choose all ways to access an HPC Cluster
- a) Secure shell to a Login Node
- b) Secure shell to a Compute Node
- c) Secure transfer to a Data Transfer Node