Discussion: `ingest!` and `update!`

Comment of @jeremiedb, copied from #10:

I think I remain a little confused to the extent to which these term can translate unambiguously to the variety of algos and their implementations.

For a GBT / EvoTree:

fit: preprocess X / Y, creates a cache, then apply a grow_evotree! for some iterations
update!: essentially grow_evotree!, that is, add a tree to the model, assuming no change to data (uses cache), though the hyper-params may have changed (learning rate, regularization...) but not all (nbins couldn't as it would require an expensive re-creation of cache)
ingest!: continue training using new data. This is not a functionnality that isn't supported in current implementation. Is there an actual use case for which it would be expeted to be supported for GBT?
Is the intent of ingest! to act as a support for online learning? As I unnderstand it, the intent of ingest! is that the intended effect of fit(x1, y1, m); update!(x2, y2, m) is to be equivalent to fit(x3, y3) where x3 is the concatenation of x1 & x2. If such is the case, then I guess that some extra information needs to be captures during fit, for as linear model for example to exhibit such behavior where the fit + ingest = fit on concatenated data.

Is it assumed that for ingest!, the new data keeps the same features, or it could be a subset / overset? The later could be relevant in situations where uses initial model as offset models, over which training could be performed potentially on additional features, though I don't think such mechanism would be the appropriate approach to achieve this rather than an explicit model stacking.

For neural nets, where a model is fed data through a DataLoader, I'm not too clear which of the update! and ingest! best applies. Is each of of the batch of an epoch be considered like new data? Or would ingest! only be used is a new DataLoader is built or new data?

I think a reason why I find the update / ingest distinction not so clear is that it may be that the underlying reason for a difference in implications from the 2 verbs have more to do about algorithm implementations and whether they involve preprocessing / caching, than actual distinct verbs generally applicable.

For example, if using a GBT with exact method (one which does not require data preprocessing), then such tree construction algo could be implemented using a stream / online approach. Each iteration could be fed with either entirely new data (having the same features) or just another subsampling of the original data. This is a similar situation for neural nets where I don't see fundamental distinction between a batch from a fixed dataset or a batch coming from an entirely new one. And in all cases, I think there are some parms that can changed through both update / ingest like learning rates and regularization, and others that can't like number of features, or size of hidden layers.

Perhaps this has already been done, but I'm wondering if a clarification of the scope of what algos / use cases are supported by the framework. By that I mean to explicit what are the implications (is there any overhead, in what circumstances) for a variety of algo families, notably:

Linear models
Neural Nets
Gradient boosted trees
Algos requiring cache / initialization vs those that doesn't
Given the broadly different cowds that may feel concerned by the framework, it also comes with very different perspectives of what are the "natural" way of doing things and what appear like reasonablw compromise (for instance performance overhead is a big deal in my prod oriented usage, but isn't for many research / educational ones).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion: `ingest!` and `update!` #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discussion: ingest! and update! #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Discussion: `ingest!` and `update!` #13