Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove epochs and only use batches #689

Open
5 tasks
tharvik opened this issue Jun 28, 2024 · 8 comments
Open
5 tasks

remove epochs and only use batches #689

tharvik opened this issue Jun 28, 2024 · 8 comments
Labels
discojs Related to Disco.js feature New feature or request

Comments

@tharvik
Copy link
Collaborator

tharvik commented Jun 28, 2024

after discussion, look like epochs are not really needed, we can directly use batches. so going from "round -> epoch -> batch" to have "round -> batch". that would give more direct control on

  • rework datasets to generate samples (ie batches) of the loaded data
    • randomized over the whole loaded data as to avoid skewing training
  • remove limitation in gpt-tfjs of running at most five batches
    • superseed by number of batch per round
  • remove TrainingInformation.epochs & EpochLogs
  • in Task, use rounds as the top level count of run, then batchesPerRound (renamed from roundDuration)
  • flatten generators from {Trainer,Model}.fit
@JulienVig
Copy link
Collaborator

Would it be possible to still support the concept of epochs somehow?
If I'm going to train a model on a dataset I will reflect in terms of epochs rather than batches (or round) for sure, so I would find it confusing and limiting to not be able to know how the nb of batches I have to choose translates into epochs.
What about allowing to specify either batches or epochs? (annoying from an implementation standpoint but could be nice as UX?)

@tharvik
Copy link
Collaborator Author

tharvik commented Jul 1, 2024

Would it be possible to still support the concept of epochs somehow? If I'm going to train a model on a dataset I will reflect in terms of epochs rather than batches (or round) for sure, so I would find it confusing and limiting

eventhough "epoch" is used throughout libraries, I don't think it is really important for training a model.
from a network perspective, we only need the clients to train for a certain amount of time on their data, not a specific amount of epoch (nor batches but that's for another time).
I've the feeling that I'm missing some deeper ML knowledge here, why do you find it limiting? do the model need to know that it has now see "all the dataset" (which is the meaning of epoch for me)?

not be able to know how the nb of batches I have to choose translates into epochs. What about allowing to specify either batches or epochs? (annoying from an implementation standpoint but could be nice as UX?)

this changes quite fundamentally the concept of batch: it would be now a fixed-size random extract of a dataset. I'll use sample from now on as I find it clearer.
there is not really a translation of samples to epoches, as it is random now. to have a probable (>=50%) epoch of the dataset, one could use

const sampleCount = epochCount * dataset.size / sampleSize

this way, we can also avoid having both implementation in discojs and only have to computation outside of discojs.

@martinjaggi
Copy link
Member

martinjaggi commented Jul 1, 2024 via email

@JulienVig
Copy link
Collaborator

the user can specify their round duration either in epochs (but that should allow fractional values such as 0.2) or in batches=steps.

Yes! That's exactly what I meant

we only need the clients to train for a certain amount of time on their data

As a user, how I would choose what is a "certain amount of time" would depend on the concept of epochs. Ideally I have a sizeable and manageable amount of data and I will want to train for exactly one epoch: I took advantage of all the data available and the model only saw each data point once, so less overfitting.

If I can only choose a number of batches (= samples), I will not know if the number of batches I choose represents more or less than one pass over the dataset.

In practice there's usually not enough data and I will want to do multiple passes, or I have may too much data them I would like to do a fraction of epoch (in which case specifying a number of samples would be useful)

Essentially, when I think about how much data I want the model to see, I reason in terms of number of passes over the dataset (=epochs) and not in terms of samples (=batches =samples). That may be very personal and that's why I think being able to choose would be nice

@tharvik
Copy link
Collaborator Author

tharvik commented Jul 2, 2024

okay, so we need support for both partial dataset (sampled based) and full dataset (one epoch). so when someone ask to train for

  • 1.2 epoches, that would mean one iteration over the whole dataset (each line once) and a sampling of 20% of the dataset
  • 0.5 epoches, only sampling of 50% of dataset
  • 3 epoches, only three full iterations over the whole dataset

that does requires that we change discojs itself, as we will in fact have two types PartialDataset and FullDataset, both implementing Dataset (batch generator). in the end, the training will only be on batches so we will drop the explicit epoch layer and chain the various Dataset implementations.

is that what you had in mind?

@JulienVig
Copy link
Collaborator

Yes! I expect that most cases would either be a fraction less than one or an integer number of epochs though

@martinjaggi
Copy link
Member

martinjaggi commented Jul 2, 2024

just a comment on random sampling: either it should be done in both cases (full epochs and fractional ones), or not at all. in the latter case this means that we'd assume the dataset is shuffled already. (if that's an assumption would be good to state in the readmes and code.). btw if it's shuffled, you don't need sampling but can just go with the first 20% of that ordered dataset.

so maybe it's easiest to do dataset shuffling in the preprocessing, or then not do any sampling/shuffling ever?

in terms of terminology, i'd say batch size is more clear than sample/sample size (more robust in meaning in all scenarios)

@tharvik
Copy link
Collaborator Author

tharvik commented Jul 2, 2024

just a comment on random sampling: either it should be done in both cases (full epochs and fractional ones), or not at all.

in my understanding, sampling can potentially return previously seen element in the same iteration (might even return twice the same element in a single batch, which very low probability). so that's incompatible with full epoch (all lines once).
I constract that with shuffling which can be applied to full epoch and returns every element of the dataset once but in a random order.
now with theses definitons out of the way, I agree that every full dataset should be shuffled. and as partial dataset are sampling their elements, it's already random. does that makes sense?

btw if it's shuffled, you don't need sampling but can just go with the first 20% of that ordered dataset.
so maybe it's easiest to do dataset shuffling in the preprocessing, or then not do any sampling/shuffling ever?

that means that the model will always train on the same part of the dataset, is that an issue?

FWIW, having whole suffling is a bit costly memory wise, as we have to keep track of the remaining elements.

in terms of terminology, i'd say batch size is more clear than sample/sample size (more robust in meaning in all scenarios)

yep, I agree, batch makes more sense now.

@tharvik tharvik added feature New feature or request discojs Related to Disco.js labels Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discojs Related to Disco.js feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants