Skip to content
This repository has been archived by the owner on May 23, 2022. It is now read-only.

New new interface #53

Closed
wants to merge 21 commits into from
Closed

Conversation

darsnack
Copy link
Member

@darsnack darsnack commented Oct 20, 2021

Based on comments, the proposed interface here is basically the same as before with one small exception: authors of new types must specify the obsdim keyword. There is no longer any default rerouting that drops the keyword.

I also made a couple changes to reduce the complexity of the type hierarchy and added some defaults. Now, if a type doesn't define getobs(x, idx), it defaults to getindex(x, idx) (we can remove this if people don't like it). The only abstract types are now AbstractDataContainer and AbstractDataIterator. At minimum, a type must define getobs(x, idx) and nobs(x) to buy into the interface. If it also subtypes AbstractDataContainer, then getindex and iterate are defined for that type based on getobs. This should reduce the overhead / boilerplate of adopting this interface by default. I also added AbstractDataIterator though it doesn't serve a purpose just yet. This is mainly forward-looking to address things like: lorenzoh/DataLoaders.jl#26


I should also note that @racinmat has taken over the corresponding MLDataPattern.jl and MLLabelUtils.jl PRs to see this work through. I haven't had the bandwidth to work on it, so I appreciate their help finishing the job.

@darsnack
Copy link
Member Author

I personally think the MLLabelUtils.jl interface needs a lot of cleaning, but for now it is copied over as-is, and we can save that for later work.

@CarloLucibello
Copy link
Member

this PR needs a rebase

src/observation.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

I'm not sure I understand the keyword arg problem this PR is trying to fix. I'm pretty sure though that I very much favour the clarity of getobs(a, 2, obsdim=1) compared to getobs(a, 2, 1). Regarding the 2 points in the OP

  1. Is having to use an internal "shadow method" so bad?
  2. I don't understand this

@darsnack
Copy link
Member Author

Shadowing is not bad, but I fear a situation where someone overrides _getobs instead of extending. Alternative here is a runtime check for the various dispatch types on obsdim which strikes me as un-Julian.

Since we had the routing

getobs(x, i; obsdim) = getobs(x, i)

If a type Foo does not define getobs at all, the error will be "keyword argument undefined" instead of a method error.

The alternative is removing the default rerouting and to force implementers to define getobs(x, i; obsdim) even if their type does not have an observation dimension as a concept.

@CarloLucibello
Copy link
Member

Shadowing is not bad, but I fear a situation where someone overrides _getobs instead of extending.

But we do want people to have their own internal implementation

module MyModule
import LearnBase

LearnBase.getobs(d::MyDataset, i; obsdim) = _getobs(d, i, obsdim)

# Some internal function of MyModule
_getobs(d, i, obsdim) = ...

end

I'm not sure I understand the concern

Since we had the routing

getobs(x, i; obsdim) = getobs(x, i)

If a type Foo does not define getobs at all, the error will be "keyword argument undefined" instead of a method error.
I see, since getobs(x, i; obsdim) and getobs(x, i) are not 2 different methods this is circular unless dispatched to a specialized implementation.

The alternative is removing the default rerouting and to force implementers to define getobs(x, i; obsdim) even if their type does not have an observation dimension as a concept.

Yes I think we should remove the routing and tell implementers to implement

getobs(x::MyType, i; obsdims=nothing) = ...

even if obsdims won't make sense in their case. The alternative is to be more lax and allow for the implementation

getobs(x::MyType, i) = ...

but then datasets' consumers such Dataloaders cannot rely on the presence of obsdims. For DataLoaders this is fine, e.g. I didn't use it in FluxML/Flux.jl#1683, but maybe a lot of code in MLDataPattern will have to be changed ?

@darsnack
Copy link
Member Author

Okay I have rebased and swapped back to the old interface. Array and tuple code has been migrated back over. I opted to use the implementations in MLDataPattern.jl for consistency, but we can drop the error checking if we want. We also need to decide whether to move over the tst_container.jl tests from MLDataPattern.jl wholesale or not.

@CarloLucibello
Copy link
Member

Something that we could put in the pipeline for the next breaking release (not necessarily in this PR) is to stop pirating StatsBase.nobs and define our own LearnBase.numobs. I think it is not worth it to try to adopt the interface of the Stats ecosystem, too much coordination work is needed for very little gain (see JuliaStats/StatsAPI.jl#3)

@darsnack
Copy link
Member Author

Do we want every method to have to handle the nothing case? If default_obsdim is part of the interface, then there shouldn't be a reason for a user to explicitly pass in nothing.

src/labels.jl Outdated Show resolved Hide resolved
src/labels.jl Outdated Show resolved Hide resolved
@nalimilan
Copy link

Sorry to hijack this thread, but have you considered using the Tables.jl interface to represent data? It's very flexible and can be used not only with tabular types such as DataFrame, but also to wrap matrices (via Tables.table), vectors of named tuples, named tuples of vectors... More than 1,000 packages already implement that interface so you get interoperability for free, without defining any particular API.

@ToucheSir
Copy link

@nalimilan I think integration with Tables.jl would be nice, but it only handles a subset of data needed for ML use cases. Higher dimensional arrays, images, graphs, audio/waveforms and more don't really fit a tabular interface.

@darsnack
Copy link
Member Author

Also note that the interface that we are "refreshing" is also an established interface (though maybe not as popular as Tables.jl).

@nalimilan
Copy link

OK. I just hoped we could find consistent definitions for table-like cases across JuliaML and JuliaStats. In particular, the fact that Tables.jl and JuliaStats packages use rows as observations, but that JuliaML packages use column as observations by default isn't great.

@ToucheSir
Copy link

I'm not familiar with how JuliaStats packages handle array inputs, do you mean they always slice/index the first dimension?

Either way, that shouldn't affect integration with Tables.jl because we've managed to successfully integrate both interfaces before. The code in FastAI could use some updating to work with Tables.[dict](row|column)table, for example, but that's an implementation concern rather than a fundamental interface disconnect.

RE coordination, I mentioned in JuliaML/MLUtils.jl#2 (comment) that it would be great to have folks from each org talking again. I'm not sure how these ecosystems became so siloed in the first place, but it benefits nobody to keep things that way. So if you're game, we could look into setting something up in the new year :)

@darsnack
Copy link
Member Author

Either way, that shouldn't affect integration with Tables.jl because we've managed to successfully integrate both interfaces before.

Worth noting that this code should eventually be "standardized" in MLDatasets.jl as per JuliaML/MLDatasets.jl#73.

@nalimilan
Copy link

I'm not familiar with how JuliaStats packages handle array inputs, do you mean they always slice/index the first dimension?

In general when a matrix is passed observations are in rows and variables in columns: that e.g. cor and cov in Statistics and StatsBase compute the correlation between pairs of columns by default, lm/glm/fit in GLM expect a matrix with variables as columns for independent variables, and when the @formula syntax from StatsModels is used instead, a Table.jl object with observations as rows and variables as columns is expected.

Either way, that shouldn't affect integration with Tables.jl because we've managed to successfully integrate both interfaces before. The code in FastAI could use some updating to work with Tables.[dict](row|column)table, for example, but that's an implementation concern rather than a fundamental interface disconnect.

Cool!

RE coordination, I mentioned in JuliaML/MLUtils.jl#2 (comment) that it would be great to have folks from each org talking again. I'm not sure how these ecosystems became so siloed in the first place, but it benefits nobody to keep things that way. So if you're game, we could look into setting something up in the new year :)

Sure, let's do that. Some of the discussion could continue at JuliaStats/StatsAPI.jl#3, but Slack might be a good venue for less formal exchanges.

@devmotion
Copy link

I was notified about this discussion here but since I'm not involved in this package at all feel free to ignore my comment 🙃

I think it is quite annoying in general to have to propagate and support obsdim or dim keyword arguments in many code bases, and also as a user it is annoying if you have to specify it multiple times in a script or function. I prefer much more if the dimension of observations is specified by the inputs - regardless of how many functions I call with my data, the obsdim should always be the same and comceptually is a property of the data but not something that could or should be tunable in the function (in contrast to eg parameters of an algorithm). Usually, this means that commonly ordered data can be viewed as an AbstractVector{T} where T is the type of an individual observation. This concept is used and explained in JuliaGaussianProcesses: https://juliagaussianprocesses.github.io/KernelFunctions.jl/dev/design/ There's also a PR to Julia that adds EachCol and EachRow types for matrices with column and row vectors as observations that would be constructed by eachcol and eachrow (in fact, it even supports more general slices): JuliaLang/julia#32310 Of course, the same approach would work also for datasets of graphs and other non-array observations.

For convenience, one could still support obsdim in user-facing functions - even though usually I prefer if APIs are simple and there's only one recommended way of doing something - but in my experience at least internally it is much more convenient if one does not have to work obsdim (keyword) arguments but only with vectors or collections of data.

An additional advantage of using vectors, and e.g. EachRow and EachCol instead of matrices, is that one does not have to make any opinionated choices about whether observations should be rows or columns by default. People tend to have strong opinions regarding this question, depending on their background, and there seem to be different conventions e.g. in stats and ML packages. However, one can avoid this discussion and default choices completely if users have to use eachrow and eachcol.

@ToucheSir
Copy link

You're not the only one, see the prototyping work going on at JuliaML/MLUtils.jl#1.

@darsnack
Copy link
Member Author

Closing in favor of #55.

@darsnack darsnack closed this Dec 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants