Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CIFAR10 not having a license #445

Closed
psteinb opened this issue Feb 27, 2024 · 22 comments · Fixed by #448
Closed

CIFAR10 not having a license #445

psteinb opened this issue Feb 27, 2024 · 22 comments · Fixed by #448
Labels
Carpentries Lab Needs to be fixed for Carpentries Lab high priority Need to be addressed ASAP lesson-dev-sprint

Comments

@psteinb
Copy link
Collaborator

psteinb commented Feb 27, 2024

As just discussed here it might be worth considering to replace CIFAR-10 for some other dataset.

@psteinb
Copy link
Collaborator Author

psteinb commented Feb 27, 2024

I'd propose fashionMNIST. This should be as simple as

keras.datasets.fashion_mnist.load_data()

@tobyhodges
Copy link
Member

Thanks so much for the suggestion, @psteinb. As I mentioned on the review thread, I plan to spend some time trying to replace CIFAR-10 in the episode soon, hopefully next week. Your proposal seems like a great place to start. I will report back here on my progress...

@svenvanderburg
Copy link
Collaborator

+1 for fashionMNIST @tobyhodges let me know if you don't have time, than I will pick this up!

@svenvanderburg
Copy link
Collaborator

@tobyhodges I would suggest to first test things out in a (or multiple) jupyter notebook and have someone review it, before going into changing the actual lesson material.

@psteinb
Copy link
Collaborator Author

psteinb commented Feb 28, 2024

Re: fashionMNIST, while this is a nice dataset (easy to communicate about), there are two differences to CIFAR10:

  • different shape 28x28 instead of 32x32
  • fashionMNIST is greyscale instead of RGB
    We should keep this in mind as it will affect teaching.

@tobyhodges
Copy link
Member

I tried to work through the episode using fashion-MNIST as suggested by @psteinb. You can see my process and results in https://github.com/carpentries-incubator/deep-learning-intro/blob/testing_fashion_MNIST/fashion_MNIST.ipynb

A summary of my observations:

  1. Please check my working. I might have implemented the convolutional layers incorrectly, in which case my results can be ignored! I needed to adjust the code to account for the single channel of the images, and I am not certain I did that correctly.
  2. The differences between the results of the different kinds of NN do not appear to be as stark when applied to this dataset. As such I think there are a few statements, plus possibly the final challenge where we vary the dropout rate, that might no longer hold/be interesting (again, assuming my implementation was correct!)

@svenvanderburg
Copy link
Collaborator

@tobyhodges thank you for working on this! This looks great actually 👍 Seems like you are having fun with keras :)

  1. That looks good
  2. Yes, you are right that the differences are not very strong. It seems like the first simple model is actually a really good model that is hard to improve. The val accuracy of 0.856 of the first model is actually not beaten in the rest of the episode. I will see if I can tweak things a little bit to make this fit better in the storyline. Or we can try other datasets, for example the orginal mnist (but this is an even simpler problem, so I don't think it will help) or https://data.caltech.edu/records/mzrjq-6wc02 (seems to have a CC-BY 4.0 license but I suspect that the images are crawled).

@svenvanderburg
Copy link
Collaborator

OK, seems like the fashionMNIST dataset is a bit of a weird ML problem. It is very hard to overfit, and regularization only results in worse performance in the end;) I tried some approaches to force it into overfitting (bigger models, smaller dataset) but no good results.

I am looking into other datasets, but oh my god there are so many CC-BY licensed datasets that are actually just crawled from the web.

@tobyhodges
Copy link
Member

there are so many CC-BY licensed datasets that are actually just crawled from the web

I think we should put a callout into the episode that addresses this TBH

@svenvanderburg
Copy link
Collaborator

I am now looking into the dollar street dataset, it is from gapminder so fits nicely in the carpentries philosophy!

@colinsauze
Copy link
Member

The dollar street dataset is 101GB! Is there a subset of it available?

@svenvanderburg
Copy link
Collaborator

@colinsauze I haven't found a subset, but I just downloaded it and if it works well with the lesson I will make a subset available with low-res pictures.

@tobyhodges
Copy link
Member

If the license permits, we can publish a subset on FigShare, Zenodo, or similar.

@svenvanderburg
Copy link
Collaborator

svenvanderburg commented Mar 18, 2024

Dollar street dataset

Checkout my notebook using the dollar street dataset for episode 4.

Results

  • Simple CNN, val accuracy: 0.26
  • Simple CNN with dropout, val accuracy: 0.33 and overfitting is reduced a bit
  • Pretrained SOTA CNN, val accuracy: 0.67 and barely overfitting
  • Dense neural network, val accuracy: 0.17

Conclusion

This dataset really allows to demonstrate all our points:

  • CNNs work better on image data than dense networks
  • Dropout reduces overfitting
  • Pretrained models with a large, established CNN architecture work really well on image data

Next steps

  • I think with little adaptations to the story in episode 4 we can use this dataset
  • The episode will end not very satisfactory, even with dropout we only get 30% accuracy. This would be a nice bridge to episode 5: transfer learning. There we show that a pretrained neural network can more than double the accuracy in this case.
  • I will upload the data to zenodo or figshare in a format that will load numpy train & val images & labels. (or store images as jpeg?)

@colinsauze @psteinb @dsmits @tobyhodges what do you think? I think we should choose between FashionMNIST and dollar street

@psteinb
Copy link
Collaborator Author

psteinb commented Mar 18, 2024 via email

@tobyhodges
Copy link
Member

This dollar street dataset sounds great to me, @svenvanderburg. Thanks so much for taking the time to explore it.

One note: I saw you used PIL for loading and re-sizing the images. Would it be possible to switch over to using scikit-image (and imageio for the loading part)? That way we can point people to DC Image Processing if they want to learn more about handling image data in Python?

@tobyhodges
Copy link
Member

Would you like to hold a coworking session/sprint to prepare the updated episode? Or prefer to draft something yourself then ask others to review?

@svenvanderburg
Copy link
Collaborator

This dollar street dataset sounds great to me, @svenvanderburg. Thanks so much for taking the time to explore it.

One note: I saw you used PIL for loading and re-sizing the images. Would it be possible to switch over to using scikit-image (and imageio for the loading part)? That way we can point people to DC Image Processing if they want to learn more about handling image data in Python?

The data preprocessing will not be done in the course, the starting point of the episode will be to load the data in its preprocessed form. To focus on deep learning instead of image data wrangling.

@svenvanderburg
Copy link
Collaborator

Would you like to hold a coworking session/sprint to prepare the updated episode? Or prefer to draft something yourself then ask others to review?

I want to get this through, it has been hanging for so long now. So I will draft something today. But I will organise a new sprint soon to pick up the remaining maintenance issues. I hope that's OK?

@svenvanderburg
Copy link
Collaborator

I'm working on it here: #448
The data is located here: https://zenodo.org/records/10837090

I hope to continue this next week, if anyone wants to pick this up in the meantime you are welcome! (Or start on transfer learning episode for example which would be really nice to add now).

@svenvanderburg
Copy link
Collaborator

Argh... I have very little time for this now. I plan to pick this up again 15th and 16th of April.

@tobyhodges
Copy link
Member

If you are happy for me to commit to your branch, @svenvanderburg, I can try to step in and make some further changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Carpentries Lab Needs to be fixed for Carpentries Lab high priority Need to be addressed ASAP lesson-dev-sprint
Development

Successfully merging a pull request may close this issue.

4 participants