CIFAR10 not having a license #445

psteinb · 2024-02-27T07:41:39Z

As just discussed here it might be worth considering to replace CIFAR-10 for some other dataset.

psteinb · 2024-02-27T07:42:26Z

I'd propose fashionMNIST. This should be as simple as

keras.datasets.fashion_mnist.load_data()

tobyhodges · 2024-02-27T12:10:28Z

Thanks so much for the suggestion, @psteinb. As I mentioned on the review thread, I plan to spend some time trying to replace CIFAR-10 in the episode soon, hopefully next week. Your proposal seems like a great place to start. I will report back here on my progress...

svenvanderburg · 2024-02-28T06:56:23Z

+1 for fashionMNIST @tobyhodges let me know if you don't have time, than I will pick this up!

svenvanderburg · 2024-02-28T07:15:26Z

@tobyhodges I would suggest to first test things out in a (or multiple) jupyter notebook and have someone review it, before going into changing the actual lesson material.

psteinb · 2024-02-28T10:41:58Z

Re: fashionMNIST, while this is a nice dataset (easy to communicate about), there are two differences to CIFAR10:

different shape 28x28 instead of 32x32
fashionMNIST is greyscale instead of RGB
We should keep this in mind as it will affect teaching.

tobyhodges · 2024-03-11T10:43:38Z

I tried to work through the episode using fashion-MNIST as suggested by @psteinb. You can see my process and results in https://github.com/carpentries-incubator/deep-learning-intro/blob/testing_fashion_MNIST/fashion_MNIST.ipynb

A summary of my observations:

Please check my working. I might have implemented the convolutional layers incorrectly, in which case my results can be ignored! I needed to adjust the code to account for the single channel of the images, and I am not certain I did that correctly.
The differences between the results of the different kinds of NN do not appear to be as stark when applied to this dataset. As such I think there are a few statements, plus possibly the final challenge where we vary the dropout rate, that might no longer hold/be interesting (again, assuming my implementation was correct!)

svenvanderburg · 2024-03-11T12:01:51Z

@tobyhodges thank you for working on this! This looks great actually 👍 Seems like you are having fun with keras :)

That looks good
Yes, you are right that the differences are not very strong. It seems like the first simple model is actually a really good model that is hard to improve. The val accuracy of 0.856 of the first model is actually not beaten in the rest of the episode. I will see if I can tweak things a little bit to make this fit better in the storyline. Or we can try other datasets, for example the orginal mnist (but this is an even simpler problem, so I don't think it will help) or https://data.caltech.edu/records/mzrjq-6wc02 (seems to have a CC-BY 4.0 license but I suspect that the images are crawled).

svenvanderburg · 2024-03-11T12:52:52Z

OK, seems like the fashionMNIST dataset is a bit of a weird ML problem. It is very hard to overfit, and regularization only results in worse performance in the end;) I tried some approaches to force it into overfitting (bigger models, smaller dataset) but no good results.

I am looking into other datasets, but oh my god there are so many CC-BY licensed datasets that are actually just crawled from the web.

tobyhodges · 2024-03-11T12:53:52Z

there are so many CC-BY licensed datasets that are actually just crawled from the web

I think we should put a callout into the episode that addresses this TBH

svenvanderburg · 2024-03-11T14:14:18Z

I am now looking into the dollar street dataset, it is from gapminder so fits nicely in the carpentries philosophy!

colinsauze · 2024-03-11T14:31:50Z

The dollar street dataset is 101GB! Is there a subset of it available?

svenvanderburg · 2024-03-11T14:50:14Z

@colinsauze I haven't found a subset, but I just downloaded it and if it works well with the lesson I will make a subset available with low-res pictures.

tobyhodges · 2024-03-11T14:52:53Z

If the license permits, we can publish a subset on FigShare, Zenodo, or similar.

svenvanderburg · 2024-03-18T13:09:08Z

Dollar street dataset

Checkout my notebook using the dollar street dataset for episode 4.

Results

Simple CNN, val accuracy: 0.26
Simple CNN with dropout, val accuracy: 0.33 and overfitting is reduced a bit
Pretrained SOTA CNN, val accuracy: 0.67 and barely overfitting
Dense neural network, val accuracy: 0.17

Conclusion

This dataset really allows to demonstrate all our points:

CNNs work better on image data than dense networks
Dropout reduces overfitting
Pretrained models with a large, established CNN architecture work really well on image data

Next steps

I think with little adaptations to the story in episode 4 we can use this dataset
The episode will end not very satisfactory, even with dropout we only get 30% accuracy. This would be a nice bridge to episode 5: transfer learning. There we show that a pretrained neural network can more than double the accuracy in this case.
I will upload the data to zenodo or figshare in a format that will load numpy train & val images & labels. (or store images as jpeg?)

@colinsauze @psteinb @dsmits @tobyhodges what do you think? I think we should choose between FashionMNIST and dollar street

psteinb · 2024-03-18T13:25:39Z

Use the dollar street sign dataset. Looking back at this conversation, this choice would not require too much adaptation with respect to text. Best, P

tobyhodges · 2024-03-18T13:51:32Z

This dollar street dataset sounds great to me, @svenvanderburg. Thanks so much for taking the time to explore it.

One note: I saw you used PIL for loading and re-sizing the images. Would it be possible to switch over to using scikit-image (and imageio for the loading part)? That way we can point people to DC Image Processing if they want to learn more about handling image data in Python?

tobyhodges · 2024-03-18T13:52:45Z

Would you like to hold a coworking session/sprint to prepare the updated episode? Or prefer to draft something yourself then ask others to review?

svenvanderburg · 2024-03-19T07:49:55Z

This dollar street dataset sounds great to me, @svenvanderburg. Thanks so much for taking the time to explore it.

One note: I saw you used PIL for loading and re-sizing the images. Would it be possible to switch over to using scikit-image (and imageio for the loading part)? That way we can point people to DC Image Processing if they want to learn more about handling image data in Python?

The data preprocessing will not be done in the course, the starting point of the episode will be to load the data in its preprocessed form. To focus on deep learning instead of image data wrangling.

svenvanderburg · 2024-03-19T07:51:44Z

Would you like to hold a coworking session/sprint to prepare the updated episode? Or prefer to draft something yourself then ask others to review?

I want to get this through, it has been hanging for so long now. So I will draft something today. But I will organise a new sprint soon to pick up the remaining maintenance issues. I hope that's OK?

svenvanderburg · 2024-03-19T16:03:00Z

I'm working on it here: #448
The data is located here: https://zenodo.org/records/10837090

I hope to continue this next week, if anyone wants to pick this up in the meantime you are welcome! (Or start on transfer learning episode for example which would be really nice to add now).

svenvanderburg · 2024-04-02T12:28:40Z

Argh... I have very little time for this now. I plan to pick this up again 15th and 16th of April.

tobyhodges · 2024-04-02T12:31:48Z

If you are happy for me to commit to your branch, @svenvanderburg, I can try to step in and make some further changes?

svenvanderburg mentioned this issue Feb 28, 2024

[Review]: Introduction to deep learning carpentries-lab/reviews#25

Open

5 tasks

svenvanderburg mentioned this issue Mar 19, 2024

Use dollar street dataset #448

Merged

svenvanderburg added Carpentries Lab Needs to be fixed for Carpentries Lab high priority Need to be addressed ASAP lesson-dev-sprint labels Apr 23, 2024

svenvanderburg added this to deep-learning-intro planning & tracking Apr 23, 2024

svenvanderburg moved this to In Progress in deep-learning-intro planning & tracking Apr 23, 2024

svenvanderburg moved this from In Progress to To Review in deep-learning-intro planning & tracking Apr 23, 2024

carschno mentioned this issue Apr 29, 2024

Add section on hyperparameter tuning #437

Merged

svenvanderburg closed this as completed in #448 Apr 29, 2024

github-project-automation bot moved this from To Review to Done in deep-learning-intro planning & tracking Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CIFAR10 not having a license #445

CIFAR10 not having a license #445

psteinb commented Feb 27, 2024 •

edited

Loading

psteinb commented Feb 27, 2024 •

edited

Loading

tobyhodges commented Feb 27, 2024

svenvanderburg commented Feb 28, 2024

svenvanderburg commented Feb 28, 2024

psteinb commented Feb 28, 2024

tobyhodges commented Mar 11, 2024

svenvanderburg commented Mar 11, 2024

svenvanderburg commented Mar 11, 2024

tobyhodges commented Mar 11, 2024

svenvanderburg commented Mar 11, 2024

colinsauze commented Mar 11, 2024

svenvanderburg commented Mar 11, 2024

tobyhodges commented Mar 11, 2024

svenvanderburg commented Mar 18, 2024 •

edited

Loading

psteinb commented Mar 18, 2024 via email

tobyhodges commented Mar 18, 2024

tobyhodges commented Mar 18, 2024

svenvanderburg commented Mar 19, 2024

svenvanderburg commented Mar 19, 2024

svenvanderburg commented Mar 19, 2024

svenvanderburg commented Apr 2, 2024

tobyhodges commented Apr 2, 2024

CIFAR10 not having a license #445

CIFAR10 not having a license #445

Comments

psteinb commented Feb 27, 2024 • edited Loading

psteinb commented Feb 27, 2024 • edited Loading

tobyhodges commented Feb 27, 2024

svenvanderburg commented Feb 28, 2024

svenvanderburg commented Feb 28, 2024

psteinb commented Feb 28, 2024

tobyhodges commented Mar 11, 2024

svenvanderburg commented Mar 11, 2024

svenvanderburg commented Mar 11, 2024

tobyhodges commented Mar 11, 2024

svenvanderburg commented Mar 11, 2024

colinsauze commented Mar 11, 2024

svenvanderburg commented Mar 11, 2024

tobyhodges commented Mar 11, 2024

svenvanderburg commented Mar 18, 2024 • edited Loading

Dollar street dataset

Results

Conclusion

Next steps

psteinb commented Mar 18, 2024 via email

tobyhodges commented Mar 18, 2024

tobyhodges commented Mar 18, 2024

svenvanderburg commented Mar 19, 2024

svenvanderburg commented Mar 19, 2024

svenvanderburg commented Mar 19, 2024

svenvanderburg commented Apr 2, 2024

tobyhodges commented Apr 2, 2024

psteinb commented Feb 27, 2024 •

edited

Loading

psteinb commented Feb 27, 2024 •

edited

Loading

svenvanderburg commented Mar 18, 2024 •

edited

Loading