New new interface #53

darsnack · 2021-10-20T14:43:43Z

Based on comments, the proposed interface here is basically the same as before with one small exception: authors of new types must specify the obsdim keyword. There is no longer any default rerouting that drops the keyword.

I also made a couple changes to reduce the complexity of the type hierarchy and added some defaults. Now, if a type doesn't define getobs(x, idx), it defaults to getindex(x, idx) (we can remove this if people don't like it). The only abstract types are now AbstractDataContainer and AbstractDataIterator. At minimum, a type must define getobs(x, idx) and nobs(x) to buy into the interface. If it also subtypes AbstractDataContainer, then getindex and iterate are defined for that type based on getobs. This should reduce the overhead / boilerplate of adopting this interface by default. I also added AbstractDataIterator though it doesn't serve a purpose just yet. This is mainly forward-looking to address things like: lorenzoh/DataLoaders.jl#26

I should also note that @racinmat has taken over the corresponding MLDataPattern.jl and MLLabelUtils.jl PRs to see this work through. I haven't had the bandwidth to work on it, so I appreciate their help finishing the job.

darsnack · 2021-10-20T14:44:35Z

I personally think the MLLabelUtils.jl interface needs a lot of cleaning, but for now it is copied over as-is, and we can save that for later work.

src/observation.jl

CarloLucibello · 2021-10-26T23:29:09Z

this PR needs a rebase

src/observation.jl

CarloLucibello · 2021-10-26T23:34:23Z

I'm not sure I understand the keyword arg problem this PR is trying to fix. I'm pretty sure though that I very much favour the clarity of getobs(a, 2, obsdim=1) compared to getobs(a, 2, 1). Regarding the 2 points in the OP

Is having to use an internal "shadow method" so bad?
I don't understand this

darsnack · 2021-10-27T00:00:27Z

Shadowing is not bad, but I fear a situation where someone overrides _getobs instead of extending. Alternative here is a runtime check for the various dispatch types on obsdim which strikes me as un-Julian.

Since we had the routing

getobs(x, i; obsdim) = getobs(x, i)

If a type Foo does not define getobs at all, the error will be "keyword argument undefined" instead of a method error.

The alternative is removing the default rerouting and to force implementers to define getobs(x, i; obsdim) even if their type does not have an observation dimension as a concept.

CarloLucibello · 2021-10-27T06:45:41Z

Shadowing is not bad, but I fear a situation where someone overrides _getobs instead of extending.

But we do want people to have their own internal implementation

module MyModule
import LearnBase

LearnBase.getobs(d::MyDataset, i; obsdim) = _getobs(d, i, obsdim)

# Some internal function of MyModule
_getobs(d, i, obsdim) = ...

end

I'm not sure I understand the concern

Since we had the routing
getobs(x, i; obsdim) = getobs(x, i)
If a type Foo does not define getobs at all, the error will be "keyword argument undefined" instead of a method error.
I see, since getobs(x, i; obsdim) and getobs(x, i) are not 2 different methods this is circular unless dispatched to a specialized implementation.

The alternative is removing the default rerouting and to force implementers to define getobs(x, i; obsdim) even if their type does not have an observation dimension as a concept.

Yes I think we should remove the routing and tell implementers to implement

getobs(x::MyType, i; obsdims=nothing) = ...

even if obsdims won't make sense in their case. The alternative is to be more lax and allow for the implementation

getobs(x::MyType, i) = ...

but then datasets' consumers such Dataloaders cannot rely on the presence of obsdims. For DataLoaders this is fine, e.g. I didn't use it in FluxML/Flux.jl#1683, but maybe a lot of code in MLDataPattern will have to be changed ?

darsnack · 2021-10-29T19:27:44Z

Okay I have rebased and swapped back to the old interface. Array and tuple code has been migrated back over. I opted to use the implementations in MLDataPattern.jl for consistency, but we can drop the error checking if we want. We also need to decide whether to move over the tst_container.jl tests from MLDataPattern.jl wholesale or not.

CarloLucibello · 2021-10-29T20:30:46Z

Something that we could put in the pipeline for the next breaking release (not necessarily in this PR) is to stop pirating StatsBase.nobs and define our own LearnBase.numobs. I think it is not worth it to try to adopt the interface of the Stats ecosystem, too much coordination work is needed for very little gain (see JuliaStats/StatsAPI.jl#3)

darsnack · 2021-10-30T21:11:36Z

Do we want every method to have to handle the nothing case? If default_obsdim is part of the interface, then there shouldn't be a reason for a user to explicitly pass in nothing.

…e the implementation is now there

src/labels.jl

nalimilan · 2021-11-26T22:51:38Z

Sorry to hijack this thread, but have you considered using the Tables.jl interface to represent data? It's very flexible and can be used not only with tabular types such as DataFrame, but also to wrap matrices (via Tables.table), vectors of named tuples, named tuples of vectors... More than 1,000 packages already implement that interface so you get interoperability for free, without defining any particular API.

ToucheSir · 2021-12-28T03:58:08Z

@nalimilan I think integration with Tables.jl would be nice, but it only handles a subset of data needed for ML use cases. Higher dimensional arrays, images, graphs, audio/waveforms and more don't really fit a tabular interface.

darsnack · 2021-12-28T14:15:17Z

Also note that the interface that we are "refreshing" is also an established interface (though maybe not as popular as Tables.jl).

nalimilan · 2021-12-28T14:49:04Z

OK. I just hoped we could find consistent definitions for table-like cases across JuliaML and JuliaStats. In particular, the fact that Tables.jl and JuliaStats packages use rows as observations, but that JuliaML packages use column as observations by default isn't great.

ToucheSir · 2021-12-28T18:09:12Z

I'm not familiar with how JuliaStats packages handle array inputs, do you mean they always slice/index the first dimension?

Either way, that shouldn't affect integration with Tables.jl because we've managed to successfully integrate both interfaces before. The code in FastAI could use some updating to work with Tables.[dict](row|column)table, for example, but that's an implementation concern rather than a fundamental interface disconnect.

RE coordination, I mentioned in JuliaML/MLUtils.jl#2 (comment) that it would be great to have folks from each org talking again. I'm not sure how these ecosystems became so siloed in the first place, but it benefits nobody to keep things that way. So if you're game, we could look into setting something up in the new year :)

darsnack · 2021-12-28T18:13:35Z

Either way, that shouldn't affect integration with Tables.jl because we've managed to successfully integrate both interfaces before.

Worth noting that this code should eventually be "standardized" in MLDatasets.jl as per JuliaML/MLDatasets.jl#73.

nalimilan · 2021-12-28T18:40:22Z

I'm not familiar with how JuliaStats packages handle array inputs, do you mean they always slice/index the first dimension?

In general when a matrix is passed observations are in rows and variables in columns: that e.g. cor and cov in Statistics and StatsBase compute the correlation between pairs of columns by default, lm/glm/fit in GLM expect a matrix with variables as columns for independent variables, and when the @formula syntax from StatsModels is used instead, a Table.jl object with observations as rows and variables as columns is expected.

Either way, that shouldn't affect integration with Tables.jl because we've managed to successfully integrate both interfaces before. The code in FastAI could use some updating to work with Tables.[dict](row|column)table, for example, but that's an implementation concern rather than a fundamental interface disconnect.

Cool!

RE coordination, I mentioned in JuliaML/MLUtils.jl#2 (comment) that it would be great to have folks from each org talking again. I'm not sure how these ecosystems became so siloed in the first place, but it benefits nobody to keep things that way. So if you're game, we could look into setting something up in the new year :)

Sure, let's do that. Some of the discussion could continue at JuliaStats/StatsAPI.jl#3, but Slack might be a good venue for less formal exchanges.

devmotion · 2021-12-28T20:04:03Z

I was notified about this discussion here but since I'm not involved in this package at all feel free to ignore my comment 🙃

I think it is quite annoying in general to have to propagate and support obsdim or dim keyword arguments in many code bases, and also as a user it is annoying if you have to specify it multiple times in a script or function. I prefer much more if the dimension of observations is specified by the inputs - regardless of how many functions I call with my data, the obsdim should always be the same and comceptually is a property of the data but not something that could or should be tunable in the function (in contrast to eg parameters of an algorithm). Usually, this means that commonly ordered data can be viewed as an AbstractVector{T} where T is the type of an individual observation. This concept is used and explained in JuliaGaussianProcesses: https://juliagaussianprocesses.github.io/KernelFunctions.jl/dev/design/ There's also a PR to Julia that adds EachCol and EachRow types for matrices with column and row vectors as observations that would be constructed by eachcol and eachrow (in fact, it even supports more general slices): JuliaLang/julia#32310 Of course, the same approach would work also for datasets of graphs and other non-array observations.

For convenience, one could still support obsdim in user-facing functions - even though usually I prefer if APIs are simple and there's only one recommended way of doing something - but in my experience at least internally it is much more convenient if one does not have to work obsdim (keyword) arguments but only with vectors or collections of data.

An additional advantage of using vectors, and e.g. EachRow and EachCol instead of matrices, is that one does not have to make any opinionated choices about whether observations should be rows or columns by default. People tend to have strong opinions regarding this question, depending on their background, and there seem to be different conventions e.g. in stats and ML packages. However, one can avoid this discussion and default choices completely if users have to use eachrow and eachcol.

ToucheSir · 2021-12-28T20:39:47Z

You're not the only one, see the prototyping work going on at JuliaML/MLUtils.jl#1.

darsnack · 2021-12-29T15:07:18Z

Closing in favor of #55.

CarloLucibello reviewed Oct 26, 2021

View reviewed changes

src/observation.jl Show resolved Hide resolved

CarloLucibello reviewed Oct 26, 2021

View reviewed changes

src/observation.jl Outdated Show resolved Hide resolved

darsnack added 3 commits October 29, 2021 12:21

Temporary revert

503199c

Switch to positional interface

f6691ad

Go back to kwarg interface but forcing obsdim to be specified

cc0aec4

darsnack force-pushed the darsnack/rm-obsdim branch from b15e1f4 to cc0aec4 Compare October 29, 2021 19:21

Specify obsdim = nothing for fallback

b47d5c6

darsnack added 2 commits October 29, 2021 14:32

Allow NamedTuple and fix test

d69f722

Fix typo

7050afd

racinmat added 2 commits October 30, 2021 22:52

fixing tests and some implementation details around nobs

dba9b80

julia 1.0 compat

7b82076

racinmat added 6 commits October 30, 2021 23:22

adding default value for kwarg

bff1409

adding default value for kwarg and related tests

27a4030

fixing getobs! and adding tests for it

90a66cf

removed redundant module name, added more tests

585dafe

fixing getobs! and moving more tests from MLDataPattern there, becaus…

4c5d1f6

…e the implementation is now there

even more moving tests to LearnBase

4fd4d4f

CarloLucibello reviewed Nov 7, 2021

View reviewed changes

src/labels.jl Outdated Show resolved Hide resolved

src/labels.jl Outdated Show resolved Hide resolved

fixed iteration

c463bff

CarloLucibello mentioned this pull request Dec 26, 2021

numobs and getobs JuliaML/MLUtils.jl#1

Merged

Make docstring code style consistent

b134296

darsnack added 5 commits December 27, 2021 17:50

Fix type inference in tests

3c14059

Fix capitalization in docstring

6b6a442

Fix Julia compatibility issues

bf4a238

Fix NamedTuple Julia compatibility again

9a2705d

Add more defaults for AbstractDataContainer

6e87a87

darsnack closed this Dec 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New new interface #53

New new interface #53

darsnack commented Oct 20, 2021 •

edited

Loading

darsnack commented Oct 20, 2021

CarloLucibello commented Oct 26, 2021

CarloLucibello commented Oct 26, 2021

darsnack commented Oct 27, 2021

CarloLucibello commented Oct 27, 2021

darsnack commented Oct 29, 2021

CarloLucibello commented Oct 29, 2021

darsnack commented Oct 30, 2021

nalimilan commented Nov 26, 2021

ToucheSir commented Dec 28, 2021

darsnack commented Dec 28, 2021

nalimilan commented Dec 28, 2021

ToucheSir commented Dec 28, 2021

darsnack commented Dec 28, 2021

nalimilan commented Dec 28, 2021

devmotion commented Dec 28, 2021

ToucheSir commented Dec 28, 2021

darsnack commented Dec 29, 2021

New new interface #53

New new interface #53

Conversation

darsnack commented Oct 20, 2021 • edited Loading

darsnack commented Oct 20, 2021

CarloLucibello commented Oct 26, 2021

CarloLucibello commented Oct 26, 2021

darsnack commented Oct 27, 2021

CarloLucibello commented Oct 27, 2021

darsnack commented Oct 29, 2021

CarloLucibello commented Oct 29, 2021

darsnack commented Oct 30, 2021

nalimilan commented Nov 26, 2021

ToucheSir commented Dec 28, 2021

darsnack commented Dec 28, 2021

nalimilan commented Dec 28, 2021

ToucheSir commented Dec 28, 2021

darsnack commented Dec 28, 2021

nalimilan commented Dec 28, 2021

devmotion commented Dec 28, 2021

ToucheSir commented Dec 28, 2021

darsnack commented Dec 29, 2021

darsnack commented Oct 20, 2021 •

edited

Loading