-
-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CIFAR10 not having a license #445
Comments
I'd propose fashionMNIST. This should be as simple as keras.datasets.fashion_mnist.load_data() |
Thanks so much for the suggestion, @psteinb. As I mentioned on the review thread, I plan to spend some time trying to replace CIFAR-10 in the episode soon, hopefully next week. Your proposal seems like a great place to start. I will report back here on my progress... |
+1 for fashionMNIST @tobyhodges let me know if you don't have time, than I will pick this up! |
@tobyhodges I would suggest to first test things out in a (or multiple) jupyter notebook and have someone review it, before going into changing the actual lesson material. |
Re: fashionMNIST, while this is a nice dataset (easy to communicate about), there are two differences to CIFAR10:
|
I tried to work through the episode using fashion-MNIST as suggested by @psteinb. You can see my process and results in https://github.com/carpentries-incubator/deep-learning-intro/blob/testing_fashion_MNIST/fashion_MNIST.ipynb A summary of my observations:
|
@tobyhodges thank you for working on this! This looks great actually 👍 Seems like you are having fun with keras :)
|
OK, seems like the fashionMNIST dataset is a bit of a weird ML problem. It is very hard to overfit, and regularization only results in worse performance in the end;) I tried some approaches to force it into overfitting (bigger models, smaller dataset) but no good results. I am looking into other datasets, but oh my god there are so many CC-BY licensed datasets that are actually just crawled from the web. |
I think we should put a callout into the episode that addresses this TBH |
I am now looking into the dollar street dataset, it is from gapminder so fits nicely in the carpentries philosophy! |
The dollar street dataset is 101GB! Is there a subset of it available? |
@colinsauze I haven't found a subset, but I just downloaded it and if it works well with the lesson I will make a subset available with low-res pictures. |
If the license permits, we can publish a subset on FigShare, Zenodo, or similar. |
Dollar street datasetCheckout my notebook using the dollar street dataset for episode 4. Results
ConclusionThis dataset really allows to demonstrate all our points:
Next steps
@colinsauze @psteinb @dsmits @tobyhodges what do you think? I think we should choose between FashionMNIST and dollar street |
Use the dollar street sign dataset. Looking back at this conversation,
this choice would not require too much adaptation with respect to text.
Best,
P
|
This dollar street dataset sounds great to me, @svenvanderburg. Thanks so much for taking the time to explore it. One note: I saw you used PIL for loading and re-sizing the images. Would it be possible to switch over to using scikit-image (and imageio for the loading part)? That way we can point people to DC Image Processing if they want to learn more about handling image data in Python? |
Would you like to hold a coworking session/sprint to prepare the updated episode? Or prefer to draft something yourself then ask others to review? |
The data preprocessing will not be done in the course, the starting point of the episode will be to load the data in its preprocessed form. To focus on deep learning instead of image data wrangling. |
I want to get this through, it has been hanging for so long now. So I will draft something today. But I will organise a new sprint soon to pick up the remaining maintenance issues. I hope that's OK? |
I'm working on it here: #448 I hope to continue this next week, if anyone wants to pick this up in the meantime you are welcome! (Or start on transfer learning episode for example which would be really nice to add now). |
Argh... I have very little time for this now. I plan to pick this up again 15th and 16th of April. |
If you are happy for me to commit to your branch, @svenvanderburg, I can try to step in and make some further changes? |
As just discussed here it might be worth considering to replace CIFAR-10 for some other dataset.
The text was updated successfully, but these errors were encountered: