Skip to content

Commit

Permalink
source commit: d30f2ed
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Jan 24, 2024
0 parents commit 685829a
Show file tree
Hide file tree
Showing 79 changed files with 9,682 additions and 0 deletions.
474 changes: 474 additions & 0 deletions 1-introduction.md

Large diffs are not rendered by default.

785 changes: 785 additions & 0 deletions 2-keras.md

Large diffs are not rendered by default.

995 changes: 995 additions & 0 deletions 3-monitor-the-model.md

Large diffs are not rendered by default.

799 changes: 799 additions & 0 deletions 4-advanced-layer-types.md

Large diffs are not rendered by default.

161 changes: 161 additions & 0 deletions 5-outlook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
---
title: "Outlook"
teaching: 15
exercises: 15
---

::: questions
- "How does what I learned in this course translate to real-world problems?"
- "How do I organise a deep learning project?"
- "What are next steps to take after this course?"
:::

::: objectives
- "Understand that what we learned in this course can be applied to real-world problems"
- "Use best practices for organising a deep learning project"
- "Identify next steps to take after this course"
:::

You have come to the end of this course.
In this episode we will look back at what we have learned so far, how to apply that to real-world problems, and identify
next steps to take to start applying deep learning in your own projects.

## Real-world application
To introduce the core concepts of deep learning we have used quite simple machine learning problems.
But how does what we learned so far apply to real-world applications?

To illustrate that what we learned is actually the basis of succesful applications in research,
we will have a look at an example from the field of cheminformatics.

We will have a look at [this notebook](https://github.com/matchms/ms2deepscore/blob/0.4.0/notebooks/MS2DeepScore_tutorial.ipynb).
It is part of the codebase for [this paper](https://doi.org/10.1186/s13321-021-00558-4).

In short, the deep learning problem is that of finding out how similar two molecules are in terms of their molecular properties,
based on their mass spectrum.
You can compare this to comparing two pictures of animals, and predicting how similar they are.
A siamese neural network is used to solve the problem.
In a siamese neural network you have two input vectors, let's say two images of animals or two mass spectra.
They pass through a base network. Instead of outputting a class or number with one or a few output neurons, the output layer
of the base network is a whole vector of for example 100 neurons. After passing through the base network, you end up with two of these
vectors representing the two inputs. The goal of the base network is to output a meaningful representation of the input (this is called an embedding).
The next step is to compute the cosine similarity between these two output vectors,
cosine similarity is a measure for how similar two vectors are to each other, ranging from 0 (completely different) to 1 (identical).
This cosine similarity is compared to the actual similarity between the two inputs and this error is used to update the weights in the network.

Don't worry if you do not fully understand the deep learning problem and the approach that is taken here.
We just want you to appreciate that you already learned enough to be able to do this yourself in your own domain.

::: instructor
You don't have to use this project as an example.
It works best to use a suitable deep learning project that you know well and are passionate about.
:::
::: challenge
## Exercise: A real-world deep learning application

1. Looking at the 'Model training' section of the notebook, what do you recognize from what you learned in this course?
2. Can you identify the different steps of the deep learning workflow in this notebook?
3. (Optional): Try to understand the neural network architecture from the first figure of [the paper](https://doi.org/10.1186/s13321-021-00558-4).
a. Why are there 10.000 neurons in the input layer?
b. What do you think would happen if you would decrease the size of spectral embedding layer drastically, to for example 5 neurons?

:::: solution
## Solution
1. The model summary for the Siamese model is more complex than what we have seen so far,
but it is basically a repetition of Dense, BatchNorm, and Dropout layers.
The syntax for training and evaluating the model is the same as what we learned in this course.
EarlyStopping as well as the Adam optimizer is used.
2. The different steps are not as clearly defined as in this course, but you should be able to identify '3: Data preparation',
'4: Choose a pretrained model or start building architecture from scratch', '5: Choose a loss function and optimizer', '6: Train the model',
'7: Make predictions' (which is called 'Model inference' in this notebook), and '10: Save model'.
3. (optional)
a. Because the shape of the input is 10.000. More specifically, the spectrum is binned into a size 10.000 vector,
apparently this is a good size to represent the mass spectrum.
b. This would force the neural network to have a representation of the mass spectrum in only 5 numbers.
This representation would probably be more generic, but might fail to capture all the characteristics found in the spectrum.
This would likely result in underfitting.
::::
:::

Hopefully you can appreciate that what you learned in this course, can be applied to real-world problems as well.

::: callout
## Extensive data preparation
You might have noticed that the data preparation for this example is much more extensive than what we have done so far
in this course. This is quite common for applied deep learning projects. It is said that 90% of the time in a
deep learning problem is spent on data preparation, and only 10% on modeling!
:::

::: discussion
## Discussion: Large Language Models and prompt engineering
Large Language Models (LLMs) are deep learning models that are able to perform general-purpose language generation.
They are trained on large amounts of texts, such all pages of Wikipedia.
In recent years the quality of LLMs language understanding and generation has increased tremendously, and since the launch of generative chatbot ChatGPT in 2022 the power of LLMs is now appreciated by the general public.

It is becoming more and more feasible to unleash this power in scientific research. For example, the authors of [Zheng et al. (2023)](https://doi.org/10.1021/jacs.3c05819) guided ChatGPT in the automation of extracting chemical information from a large amount of research articles. The authors did not implement a deep learning model themselves, but instead they designed the right input for ChatGPT (called a 'prompt') that would produce optimal outputs. This is called prompt engineering. A highly simplified example of such a prompt would be: "Given compounds X and Y and context Z, what are the chemical details of the reaction?"

Developments in LLM research are moving fast, at the end of 2023 the newest ChatGPT version [could take images and sound as input](https://openai.com/blog/chatgpt-can-now-see-hear-and-speak).
In theory, this means that you can solve the Cifar-10 image classification problem from the previous episode by prompt engineering, with prompts similar to "Which out of these categories: [LIST OF CATEGORIES] is depicted in the image".

**Discuss the following statement with your neighbors:**

_In a few years most machine learning problems in scientific research can be solved with prompt engineering._
:::

## Organising deep learning projects
As you might have noticed already in this course, deep learning projects can quickly become messy.
Here follow some best practices for keeping your projects organized:

### 1. Organise experiments in notebooks
Jupyter notebooks are a useful tool for doing deep learning experiments.
You can very easily modify your code bit by bit, and interactively look at the results.
In addition you can explain why you are doing things in markdown cells.
- As a rule of thumb do one approach or experiment in one notebook.
- Give consistent and meaningful names to notebooks, such as: `01-all-cities-simple-cnn.ipynb`
- Add a rationale on top and a conclusion on the bottom of each notebook

[_Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks_](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007007) provides further advice on how to maximise the usefulness and reproducibility of experiments captured in a notebook.

### 2. Use Python modules
Code that is repeatedly used should live in a Python module and not be copied to multiple notebooks.
You can import functions and classes from the module(s) in the notebooks.
This way you can remove a lot of code definition from your notebooks and have a focus on the actual experiment.

### 3. Keep track of your results in a central place
Always evaluate your experiments in the same way, on the exact same test set.
Document the results of your experiments in a consistent and meaningful way.
You can use a simple spreadsheet such as this:

| MODEL NAME | MODEL DESCRIPTION | RMSE | TESTSET NAME | GITHUB COMMIT | COMMENTS |
|-------------------------|--------------------------------------------|------|---------------|---------------|----------|
| weather_prediction_v1.0 | Basel features only, 10 years. nn: 100-50 | 3.21 | 10_years_v1.0 | ed28d85 | |
| weather_prediction_v1.1 | all features, 10 years. nn: 100-50 | 3.35 | 10_years_v1.0 | 4427b78 | |

You could also use a tool such as [Weights and Biases](https://wandb.ai/site) for this.

::: callout
## Cookiecutter data science
If you want to get more pointers for organising deep learning, or data science projects in general,
we recommend [Cookiecutter data science](https://drivendata.github.io/cookiecutter-data-science/).
It is a template for initiating an organized data science project folder structure
that you can adapt to your own needs.
:::
## Next steps
You now understand the basic principles of deep learning and are able to implement your own deep learning pipelines in Python.
But there is still so much to learn and do!

Here are some suggestions for next steps you can take in your endeavor to become a deep learning expert:

* Learn more by going through a few of [the learning resources we have compiled for you](learners/reference.md#external-references)
* Apply what you have learned to your own projects. Use the deep learning workflow to structure your work.
Start as simple as possible, and incrementally increase the complexity of your approach.
* Compete in a [Kaggle competition](https://www.kaggle.com/competitions) to practice what you have learned.
* Get access to a GPU. Your deep learning experiments will progress much quicker if you have to wait for your network to train
in a few seconds instead of hours (which is the order of magnitude of speedup you can expect from training on a GPU instead of CPU).
Tensorflow/Keras will automatically detect and use a GPU if it is available on your system without any code changes.
A simple and quick way to get access to a GPU is to use [Google Colab](https://colab.google/)

::: keypoints
- "Although the data preparation and model architectures are somewhat more complex,
what we have learned in this course can directly be applied to real-world problems"
- "Use what you have learned in this course as a basis for your own learning trajectory in the world of deep learning"
:::
13 changes: 13 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---
title: "Contributor Code of Conduct"
---

As contributors and maintainers of this project,
we pledge to follow the [The Carpentries Code of Conduct][coc].

Instances of abusive, harassing, or otherwise unacceptable behavior
may be reported by following our [reporting guidelines][coc-reporting].


[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html
[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html
79 changes: 79 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
---
title: "Licenses"
---

## Instructional Material

All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry)
instructional material is made available under the [Creative Commons
Attribution license][cc-by-human]. The following is a human-readable summary of
(and not a substitute for) the [full legal text of the CC BY 4.0
license][cc-by-legal].

You are free:

- to **Share**---copy and redistribute the material in any medium or format
- to **Adapt**---remix, transform, and build upon the material

for any purpose, even commercially.

The licensor cannot revoke these freedoms as long as you follow the license
terms.

Under the following terms:

- **Attribution**---You must give appropriate credit (mentioning that your work
is derived from work that is Copyright (c) The Carpentries and, where
practical, linking to <https://carpentries.org/>), provide a [link to the
license][cc-by-human], and indicate if changes were made. You may do so in
any reasonable manner, but not in any way that suggests the licensor endorses
you or your use.

- **No additional restrictions**---You may not apply legal terms or
technological measures that legally restrict others from doing anything the
license permits. With the understanding that:

Notices:

* You do not have to comply with the license for elements of the material in
the public domain or where your use is permitted by an applicable exception
or limitation.
* No warranties are given. The license may not give you all of the permissions
necessary for your intended use. For example, other rights such as publicity,
privacy, or moral rights may limit how you use the material.

## Software

Except where otherwise noted, the example programs and other software provided
by The Carpentries are made available under the [OSI][osi]-approved [MIT
license][mit-license].

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

## Trademark

"The Carpentries", "Software Carpentry", "Data Carpentry", and "Library
Carpentry" and their respective logos are registered trademarks of [Community
Initiatives][ci].

[cc-by-human]: https://creativecommons.org/licenses/by/4.0/
[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode
[mit-license]: https://opensource.org/licenses/mit-license.html
[ci]: https://communityin.org/
[osi]: https://opensource.org
Loading

0 comments on commit 685829a

Please sign in to comment.