Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metadata method #22

Closed
Tokazama opened this issue May 23, 2020 · 70 comments · Fixed by #48
Closed

metadata method #22

Tokazama opened this issue May 23, 2020 · 70 comments · Fixed by #48

Comments

@Tokazama
Copy link

It is often the case that one wants to attach metadata of some sort to an array/graph/etc. How do people feel about adding something basic like metadata(x) = nothing that can then be extended by other packages?

@bkamins
Copy link
Member

bkamins commented May 23, 2020

@pdeffebach had some good design ideas about it in DataFrames.jl in the past.
Now, finally after 0.21.0 release, we are planning to add this functionality to DataFrames.jl.

As this is raised on a higher level let me give the API I envision for DataFrames.jl for now:

  • metadata(::DataFrame) returns a Union{Nothing, Dict{Symbol,Any}} that - if filled - gives a DataFrame-level metadata (this can be arbitrary metadata). The restriction would be that symbols starting with DF_ in their name would be reserved for internal use of DataFrames.jl (as a convention)
  • metadata(obj::OTHER_TYPES_DEFINED_BY_DATAFRAMES) = metadata(parent(obj))
  • metadata(::DataFrame, ::ColumnIndex) returns a String (by default nothing) - which would indicate just a verbose name of the column, with default being just column name
  • metadata(obj::OTHER_TYPES_DEFINED_BY_DATAFRAMES, ::ColumnIndex) similar to the above if the column is present in the other type.

If we agree to this design then I can implement it. The key challenge is rules of propagation of metadata, but this is not DataAPI.jl related thing so I leave this discussion for later.

CC @pdeffebach, @nalimilan

@nalimilan
Copy link
Member

See JuliaData/DataFrames.jl#1458 for the last attempt at implementing this in DataFrames. Two points:

  • I think we need something more general than just having a custom label/verbose name for columns. For example it could be useful to store units, informations about measurement, etc. Label can just be a standard field among others.
  • We also need an API to set metadata and to retrieve the list of fields that have been set.

In general a choice has to be made between having 1) a single function in the API which would return a metadata dict which would have to implement specific methods (getindex, setindex! and keys notably); or 2) several functions in the API that would allow doing these operations directly. See the table in my first comment at JuliaData/DataFrames.jl#1458. I think returning an object is simpler since it allows reusing the standard dict API.

@Tokazama
Copy link
Author

Tokazama commented May 23, 2020

Metadata could technically be stored at any level of something like a table. For example, each column could be a MetadataArray (i.e. from MetadataArrays.jl) and the table itself could have metadata. I worry that if we started trying to design this around column based indexing it would needlessly complicate and potentially limit its wider usability. Even the definition of what "metadata" is to different people is likely to vary so I'm not sure we should even guarantee it returns a certain type.

@bkamins
Copy link
Member

bkamins commented May 23, 2020

Initially I wanted to write that metadata(::DataFrame, ::ColumnIndex) could also return a Dict{Symbol, Any} - for me it would be OK. In this case there should also be some namespace of resrved key names for internal use.

So personally I would prefer the "single function that returns a metadata dict" approach and later the user can just work on the Dict.

Ah - and now I see we could support metadata(::DataFrame, ::ColumnIndex) that would return a NamedTuple of dictionaries associated with columns.

I agree with @Tokazama that different people will want different things from metadata therefore I believe the API we provide should be maximally simple and flexible.
Therefore I would prefer to think that metadata is just a Dict and there is one global dict for a data frame as a whole and then each column can have column specific dicts. Then the rest - how to work with it - would be delegated to a decision of the user.

@Tokazama
Copy link
Author

Allowing the user to decide what to do with whatever metadata return also provides the freedom to further specialize on this later. For example you could always do something like colmeta(df, col) = metadata(df)[col] and then you wouldn't have to worry about reserving key names.

Would a simple PR to DataAPI.jl on this be a good next step right now?

@bkamins
Copy link
Member

bkamins commented May 23, 2020

One will probably need to reserve key names anyway. In particular I do not think that metadata(df)[col] to return a metadata for column col is a good API (if we allowed this then there would be no way to specify global metadata for a table as a whole).

I think this is such a major thing that we should wait for other JuliaData members to comment before moving forward.

@nalimilan
Copy link
Member

Maybe we can say that metadata(tbl) and metadata(tbl, col) have to return objects implementing the AbstractDictAPI, giving respectively the table-wise and column-wise metadata? That should be flexible enough for all implementations.

(In practice for DataFrames we would probably store column metadata internally as vectors with one entry per column as JuliaData/DataFrames.jl#1458 does but that can be exposed to users via a lazy AbstractDict object per column. What could be useful to provide in addition is a way to access these vectors for convenience/efficiency.)

@bkamins
Copy link
Member

bkamins commented May 23, 2020

have to return objects implementing the AbstractDictAPI

Agreed

In practice for DataFrames we would probably store column metadata internally as vectors with one entry per column

We can discuss what is best in the PR for DataFrames.jl when it is done (essentially we have two options: dict of vectors or vector of dicts).

@bkamins
Copy link
Member

bkamins commented May 24, 2020

As have been thinking about this issue and #1458 I came to the conclusion that we should go back to the fundamentals. And the core issue is:

It is often the case that one wants to attach metadata of some sort to an array/graph/etc.

What I mean that while we seem all agree that adding metadata to tables is needed, actually I would discuss first what kind of metadata we really think people would store in practice. This is a relevant questions as I think we should not create a functionality that later would be very rarely used. Conversely - if we know exactly what we actually want to use we can design API that supports the required use-cases cleanly.

My two concerns are:

  • persistence; most storage formats will not allow to save and load this metadata; which means that, at least in my understanding, the use cases, where people will use metadata will be situations of non-persistent metadata (i.e. something you attach to your table temporarily for programming convenience)
  • performance; we do not want to kill the performance of basic operations on tables, because the "table processing engine" would constantly check if metadata needs updating, or if the cos of updating the metadata would be large, or if the memory footprint of allowing to store metadata would be non-negligible (ideally if there is no metadata then the performance should not be affected)

So now let me go down to the starting question - what metadata we see that would be actually used (this is not a comprehensive list - please comment what you think would be really used - not just potentially used):

  • metadata for handling how data frame is shown (things like overriding show defaults, maybe custom decimal delimiter, maybe - if in the future we add integration with PrettyTables.jl, some settings of all the options that package provides)
  • custom column labels (possibly shown when printing a data frame) - this is tempting, but has a problem with "persistence" - i.e. most formats will not store this information, so my question is - how in practice we see that people will use this functionality?
  • custom row lables labels - the same situation
  • setting flags that some columns should be treated in a special way, e.g. for geospatial or time series analysis of data frames - this is tempting, but myself I am not 100% convinced it is a good idea, as it will be hard to ensure the metadata is consistent with a parent table (e.g. GeoPandas uses this strategy, and it has a problem with ensuring this integrity)

@Tokazama
Copy link
Author

In addition to what you've mentioned here are some types of metadata that I think would be useful for me personally to be able to store:

  • Source and collection information: I often have many tables that have some acquisition metadata pertinent to them that is not row or column specific but describes important aspects of all the data in one table.
  • Column tracking: When performing semi-automated feature creation I like to keep track of certain operations/parameters/weights that resulted in the formation of a column of measures.
  • Attaching metadata to a column that changes how it dispatches later on

it will be hard to ensure the metadata is consistent with a parent table (e.g. GeoPandas uses this strategy, and it has a problem with ensuring this integrity)

I think it depends on how much you care to take ownership of handling all metadata. I would prefer handling metadata be given a minimal interface. It could potentially have a methods for things like joins so that something like join_metadata would also join dictionaries but could be taken advantage of for custom metadata types.

I also think that I/O on metadata should be entirely dependent on the package supporting I/O. There aren't many file types equipped to flexibly handle metadata and it seems like the best thing for DataAPI.jl is to just make it simple to extract metadata.

@pdeffebach
Copy link

I agree with all that is said here. As the author of one of the previous attempts I think that meta-data is important and people coming from R and Python often don't fully appreciate how useful metadata is for Stata users and how it has hurt the adoption of R in applied economics, especially household surveys.

  • custom column labels (possibly shown when printing a data frame) - this is tempting, but has a problem with "persistence" - i.e. most formats will not store this information, so my question is - how in practice we see that people will use this functionality?

My use of metadata in Stata was twofold

  1. Pretty printing of columns and keeping track of data. For example, the table below would not have been possible to make programattically without extensive use of column labels. I couldn't imagine trying to write this in R because column metadata in R doesn't persist after joins.

Screenshot from 2020-05-24 11-05-32

  1. Keeping track of the data-cleaning process. In the above table, the variable "Standardized income index" is composed of the 3 variables below it. The note for that variable will tell us as much, and was automatically generated. If you were to type
note list standardized_index

you would see a note that said something along the lines of

A standardized index of 3 variables: net_earnings, consumption, durable_assets

Stata also has metadata about a table, which is often used to denote a source or author. I never used that feature.

With regards to IO, I don't see a huge problem with saving a data frame to two CSVs and providing a convenience method for adding metadata to a DataFrame when the metadata is stored as a Table. Maybe it's a bit heavy handed but it's robust.

@Tokazama
Copy link
Author

I think these are all great use cases that I've wanted at some point. As someone who deals with lots of different types of metadata I'd really like to emphasize that less is more as this is implemented. It's easy to get stuck in the weeds on every little implementation detail because you have the combination of situations that arise from row specific, column specific, and general table metadata and all the different types of metadata.

This is loosely the kind structure I'm considering using...

struct Table{T<:AbstractVector,M<:Union{Nothing,AbstractDict{Symbol,Any}}}
    data::Vector{T}
    index::Dict{Symbol,Int}
    meta::M
end

metadata(x::Table) = getfield(x, :meta)

Users don't have to ever worry about metadata unless they decide they want it and developers can create whatever type of fancy metadata that changes dispatch as long as it is a subtype of AbstractDict{Symbol,Any}.

I would think column specific metadata would be easiest if implemented as a column vector with metadata so I could just do metadata(table.column_name). Otherwise concatenating/merging/joining columns becomes the responsibility of DataAPI.jl. Seems more simple for this to be implemented at the level of something like vcat(::MetadataVector, ::MetadataVector).

@nalimilan
Copy link
Member

I agree with most of what has been said. Just one point:

I would think column specific metadata would be easiest if implemented as a column vector with metadata so I could just do metadata(table.column_name). Otherwise concatenating/merging/joining columns becomes the responsibility of DataAPI.jl. Seems more simple for this to be implemented at the level of something like vcat(::MetadataVector, ::MetadataVector).

I'm afraid this wouldn't be workable, as it would require users to deal with another new kind of vector just to store metadata. That would force recompiling all functions for that type, and it wouldn't be easy to deal with e.g. CategoricalArray. It would also make things appear more complex for users who load files with metadata (e.g. Stata files with column labels), while one of the strengths of DataFrames is that they just wrap standard arrays.

We can say the table type is responsible for preserving metadata across concatenations/joins. DataAPI itself doesn't have to know anything about that.

@pdeffebach
Copy link

With regards to spatial data, which is a natural use case of this, is there anyone in the Julia Data community who has a really detailed knowledge of R's sf package?

It's the best thing ever, being able to use all of dplyr while also maintaining spatial metadata and using spatial joins etc. is incredible.

Perhaps someone who has worked on that project could provide some insights.

@bkamins
Copy link
Member

bkamins commented May 24, 2020

The project is going to be done during JSoC this year. And one of the reasons I am pressing to decide on metadata now is to have a clear guidance how this extra package should integrate with DataFrames.jl.

@quinnj
Copy link
Member

quinnj commented Jun 1, 2020

I'm a little slow/late to the discussion here, but have thought a bit about this. I agree with the idea that this is a way that Julia/DataFrames can really stand apart/improve on the situation from R/pandas; having useful metadata integrated w/ a DataFrame could be really powerful when used in the right contexts.

That said, I worry about some of the suggestions around metadata use because they start to become so fundamental or logic-driven. IMO, if some kind of data starts to become so critical we're changing how things are computed/etc. then it probably deserves a more structured solution that just a metadata entry in a DataFrame.

IMO, metadata should be primarily "descriptive" about the object; give context, explain values and cardinality thereof; tweaking printing/showing seems fine to me. I just worry about packages starting to abuse metadata when they should really be creating a new AbstractArray type or something (I mean, you could imagine someone trying to implement CategoricalArrays by just using metadata).

My other thought is that while I agree that DataFrames can do a tight integration w/ metadata, I do thing we should allow/encourage metadata to be attached/used generically on objects, including columns. There are going to be a lot of cases across the ecosystem where you're not dealing w/ a DataFrame, and it will be useful to support metadata in a variety of ways on columns, rows, etc. But yes, DataFrames can choose how it approaches its use/integration w/ metadata, either at the table-level or column level.

@bkamins
Copy link
Member

bkamins commented Jun 1, 2020

I have discussed:

I just worry about packages starting to abuse metadata when they should really be creating a new AbstractArray type or something

with @visr with the context of geospatial data (temporial data is the same I think) and we came to the same conclusion. The logic in packages using tables should primarly be based either on type or a trait of a column (trait is probably preferable as currently Julia does not allow for multiple inheritance), but not metadata attached to it.


So given this - are there any more comments how the reference API should look like?

@quinnj
Copy link
Member

quinnj commented Jun 1, 2020

So I'm not sure what exactly the proposed API is? Is it just that metadata(x) returns Union{Nothing, AbstractDict}? Here are a couple thoughts/ideas:

  • I'm not sure we should require AbstractDict specifically vs. "an object that supports AbstractDict methods" (or as I like to call it, AbstractDictLike); namely it'd probably be nice to allow NamedTuple to be returned from metadata, which isn't an AbstractDict, but does support the interface; it'd be good to be very clear about what exactly is required of the object returned
  • If we're thinking of requiring Union{Nothing, AbstractDict-LIke}, I wonder if we should just require returning an "AbstractDictLike" and we can return an empty one by default; we could then provide convenience get/put methods. Alternatively, we could not require a specific object type to be returned and just have hte interface be metadata(x) and metadata!(x, meta). I kind of like the idea of requiring AbstractDictLike and returning a NamedTuple() by default
  • In terms of implementation, I've been looking a lot at how @doc is implemented in Base and I think it could make a lot of sense to do something similar for metadata; that is, instead of modifying DataFrame to have a metadata field, there'd be a global (or per module) metadata IdDict that could store metadata per object. That would allow attaching metadata to all kinds of objects w/o needing wrappers. I think it also helps reinforce the idea that it's metadata, or somewhat detached from the object and not to be too relied upon for program logic. Along this vein, it could make sense to make an entire Metadata.jl package that copied the Base.Docs implementation and could be used by packages everywhere. It would be pretty lightweight, but could provide a lot of flexibility and a clean, standard API that other packages can integrate with. If we want to go that route, we probably don't need a definition in DataAPI.jl

@bkamins
Copy link
Member

bkamins commented Jun 1, 2020

Along this vein, it could make sense to make an entire Metadata.jl package that copied the Base.

Actually I would prefer this idea as it would be much more composable. The consequence for DataFrames.jl users would be:

  1. metadata will not propagate when the object is copied.
  2. still it will propagate when it is just passed, not copied.

So: df.col will keep the metadata of col and similarly df.col = x will make col to have the metadata of x.

@nalimilan
Copy link
Member

Ah yes that's interesting. Indeed it's quite convenient in R to be able to attach metadata to any object, and yet in Julia we don't want to have to wrap any object in a special type just to add metadata.

Though losing the metadata on copy would be annoying. That could easily be fixed in DataFrames by ensuring we copy/readd the metadata when copying the columns (this would be needed for important cases like getindex but also select without transformations). But it's not easy to fix when the user calls copy on an arbitrary object: doing this may be too costly for small objects, and for large ones it would require support from all packages (including Base...). Maybe it's not the end of the world though if one has to do copywithmetadata(x) when needed?

(Otherwise, returning an empty NamedTuple by default (instead of nothing) sounds fine. We really need traits in Base!)

@bkamins
Copy link
Member

bkamins commented Jun 1, 2020

Though losing the metadata on copy would be annoying.

Personally I would feel safer if we worked this way. I would prefer to have a function that copies medatada explicitly that can be called if someone needs it.

@Tokazama
Copy link
Author

Tokazama commented Jun 1, 2020

There are a lot of packages that use the term "metadata" (e.g, ImageMetadata.jl, MetadataArrays.jl, MetaGraph.jl, FieldMetadata.jl, FieldProperties.jl, etc.). I don't think an interface like Base.Docs is flexible enough to fit many of the potential uses of metadata.

@quinnj
Copy link
Member

quinnj commented Jun 1, 2020

@Tokazama can you explain a little more why you think the Base.Docs approach wouldn't be flexible enough? In terms of approach, it's more of an implementation detail: the user interface would still be metadata(x), it would just retrieve the metadata from a per-module store instead of retrieving it from the object itself.

@Tokazama
Copy link
Author

Tokazama commented Jun 1, 2020

It wouldn't carry any type information so if someone did use something like a NamedTuple it wouldn't really help any.

@quinnj
Copy link
Member

quinnj commented Jun 1, 2020

Sorry, I'm still not following the concern. Why/where would type information be important? The discussion has revolved around metadata(x) returning any kind of object that implements the AbstractDict interface, so in practice, you would use metadata like:

meta = metadata(x)

# see metadata keys
keys(meta)

# iterate over metadata key-value pairs
for (k, v) in meta

end

# check if a specific metadata key is present
haskey(meta, :specific_key)

So depending on whether metadata(x) returned a Dict, or NamedTuple, you would have different implementations of these methods, but the interface is still the same.

We should probably require that the object returned be AbstractDictLike{Symbol, Any}, i.e. require that metadata keys by Symbol; does that sound reasonable or too restrictive?

@bkamins
Copy link
Member

bkamins commented Jun 1, 2020

Actually I prefer metadata to be flexible and type unstable.

Apart from convenience it is a clear signal for the developers not to use metadata to encode program logic - Julia provides other means to to this efficiently.

Metadata, as I think about it now (but my opinions evolve based on the comments we get here as the design here is not an easy decision) should be for lightweight things like descriptive strings or maybe some hints how output should be formatted (as working with IOContext is hard for most users and in some cases it it not flexible enough as IOContext is not always usable - e.g. you cannot replace stdout with a custom IOContext AFAIK).

@Tokazama
Copy link
Author

Tokazama commented Jun 1, 2020

Actually I prefer metadata to be flexible and type unstable.

I'm not against this being the case for specific implementations like what might be done in DataFrames but I don't think it should be the only option.

@pdeffebach
Copy link

Apart from convenience it is a clear signal for the developers not to use metadata to encode program logic - Julia provides other means to to this efficiently.

I agree. You don't want too many interfaces relying on specifically named metadata fields to create unnecessarily complicated features.

So: df.col will keep the metadata of col and similarly df.col = x will make col to have the metadata of x.

I don't fully understand this. IMO metadata should be attached to a data frame and df.col should always return a vector without anything else attached to them.

I agree with @nalimilan, it would be annoying to have metadata disappear with copying, consider something as simple as

df.income = clean_vec(df.income)
  1. clean_vec takes in a Vectof{Float64} and for whatever reason has a concrete type signature. So no meta-data is added.
  2. If I have a label for df.income, , say "Personal Income", I don't want this to disappear. Keeping track of these operations would get tiresome.

This sort of global Dict that contains metadata is basically how Rs metadata system works, and I've always found it very useless, partly because the metadata disappears.

@quinnj
Copy link
Member

quinnj commented Jun 2, 2020

@pdeffebach, I think there's a lot more flexibility in the Base.Docs system than just thinking of it as a "global Dict". With Julia's rich type system, macros, etc. I think we could easily accomodate scenarios where you want to attach metadata to a DataFrame column, and not the Vector itself, but to a named column of the DataFrame, which would "stick" beyond transformations. And as has been mentioned, there are a number of scenarios where you don't want metadata to stick around too much, if you're creating new objects and such.

As I've played around with ideas/implementations, I just don't see a realistic way to make a system that is general enough to be widely used that relies on either wrapper objects or requiring metadata fields. It just doesn't scale. The doc system, however, is extremely rich and accomplishes its goal/job very well, IMO; attaching extra information to types, variables, fields, etc.
Part of my experience/opinion here is coming from thinking through the entire data ecosystem, not just DataFrames. While I think DF is one of the primary targets for a metadata system, I also want to ensure that other table types, formats, and objects can also take advantage of a metadata system to enhance objects.

@nalimilan
Copy link
Member

I agree with @nalimilan, it would be annoying to have metadata disappear with copying, consider something as simple as

df.income = clean_vec(df.income)
1. `clean_vec` takes in a `Vectof{Float64}` and for whatever reason has a concrete type signature. So no meta-data is added.

2. If I have a label for `df.income`, , say `"Personal Income"`, I don't want this to disappear. Keeping track of these operations would get tiresome.

@pdeffebach What kind of operations would be performed within clean_vec? Apart from copy, which you could replace with (say) copywithmetadata, I'm not sure many operations should/could preserve it. In general I don't see what solution we could find a system in which both 1) metadata is preserved on copy and 2) metadata can be added to any object (e.g. Vector). What we can do, though, is to have DataFrames operations copy metadata automatically where it makes sense -- but that doesn't include custom functions since we have no way of knowing if you are just cleaning the income value or creating a completely new thing.

Or maybe in your example you meant that assigning a new vector to an existing column via df.income = v should preserve the metadata of the column? That makes some sense but could be problematic if you really want to replace the column (and you may even not know some previous column existed with that name).

@nalimilan
Copy link
Member

metadata(tbl, col::Union{Integer, Symbol}) - here the question is if AbstractString should be also allowed? I think it is non-problematic to have this method in general, as till we support it we could just return nothing.

@bkamins That's a minor point I'd say. We should be consistent across Tables.jl, so better discuss getcolumn, etc. at the same time, and separately from this issue which is already complex enough.

An alternative proposal would be to completely skip the second point, and instead of metadata(tbl, col) attach column-level metadata to vectors themselves and require using metadata(Tables.getcolumn(tbl, col)).

Unless metadata(::T, ::Symbol) is specifically defined for T the default is to just grab all the metadata and if it is an AbstractDict us getindex and otherwise use getproperty (as defined here).

@Tokazama Note that in my proposal Tables.metadata would be a different (and unexported) function from Metadata.metadata. Tables.metadata(tbl, col) would retrieve metadata for column col, while metadata(tbl)[key] would access table-level attribute key. Otherwise there could be conflicts e.g. if a column is called name and you want to store a table-level attribute called name.

@nalimilan, what's the advantage of having a Tables.jl-level API for metadata as opposed to just pointing people to something like Metadata.jl?

@quinnj The reason is that we need to define an API to access column-level metadata. I agree something like Metadata.jl is enough if we decide that column-level metadata should be attached to vector objects themselves rather than stored in the table (my second proporsal). But I think @pdeffebach had arguments against it.

@pdeffebach
Copy link

Having metadata be persistent across joins and reassignment is a crucial feature.

If there was a Tables.jl level API for could make assurances about the persistence of metadata. @quinnj if you aren't too familiar with Stata, this is basically the model for the behavior I would like metadata to have. In Stata, it's all about persistence.

My argument against having metadata attached to vector objects is that

  1. We don't want df.a to give you some special vector type that has metadata attached to it
  2. We copy a lot of places, i.e. after every select and transform. So the notion of meta-data being attached to a particular object gets complicated.

I'm going to cc @matthieugomez here, since he is someone familiar with Stata who has probably also thought about this in Julia.

@Tokazama
Copy link
Author

I don't love the use of macros because they're not really doing anything?

Similar to @doc, they point to the module where @attach_metadata was called. If you have a type that will always store metadata in the same module you could hard code that in and use metadata.

@nalimilan
Copy link
Member

nalimilan commented Nov 16, 2020

1. We don't want `df.a` to give you some special vector type that has metadata attached to it

@pdeffebach With @attach_metadata it wouldn't be a special type, just a plain Vector with metadata stored in a global dict.

2. We copy a lot of places, i.e. after every `select` and `transform`. So the notion of meta-data being attached to a particular object gets complicated.

We could copy metadata in select and transform. Overall I think the question of persistence should be addressed by particular implementations (e.g. DataFrames). Tables.jl doesn't care about that, it just has to allow you to pass metadata along with tables.

@pdeffebach
Copy link

pdeffebach commented Nov 16, 2020

Lets say I have

df = DataFrame(a = [1, 2], b = [3, 4])
metadata!(df, :a => "A", :b => "B")
@pipe df |>
    transform(_, [:a, :b] => ByRow(+) => :c) |>
    select(_, :b, :c)

with the meadata in a global Dict, how would this work? I'm confused by what the keys and what the values are.

What happens when the columns are copied inside the transform? You could imagine this global dict getting very ver large if we have thousands of columns and a lot of transform calls in a pipe.

@bkamins
Copy link
Member

bkamins commented Nov 16, 2020

Yes, GC of this dict is an issue I think.

@nalimilan
Copy link
Member

Yes, using a global dict will certainly be slower and less memory-efficient than storing column-level metadata in the table (especially since in that case we can store metadata using a vector with one entry for each column, and use the data frame index to map names to positions, like at JuliaData/DataFrames.jl#1458). But I wonder whether it really matters in practice: if you need to copy the column vector anyway, copying the metadata should be cheap in comparison.

Maybe we could also add a finalizer to column vectors when adding metadata, so that we can delete the entry from the global dict when the object it destroyed? @Tokazama Have you considered that?

@Tokazama
Copy link
Author

Tokazama commented Nov 17, 2020

Maybe we could also add a finalizer to column vectors when adding metadata, so that we can delete the entry from the global dict when the object it destroyed?

I don't think that's possible without type piracy. There is no unique type provided by "Metadata" that wraps an instance that is attached to global metadata when using @attach_metadata(x, meta). The most useful part of using global metadata is that it has absolutely no effect on dispatch so it can't possibly slow down your code, increase latency through codegen, etc.. This also means you cant propagate metadata by dispatching on a type that binds metadata to your table. If that's what you want, use attach_metadata, so that metadata is directly bound to your table with a wrapper that stores metadata in its structure.

If you want to do what @quinnj is suggesting, something like this should work...

function Metadata.global_metadata(tbl::MyTableType, column_name, module_name)
    return Metadata.global_metadata(getproperty(tbl, column_name), m)
end

...redirecting @metadata(tbl, column_name) to the relevant column.

Similary, you can do this if you expect your columns to wrap metadata.

Metadata.metadata(tbl::MyTableType, column_name) = metadata(getproperty(tbl, column_name), m)

Handling persistence of metadata without a wrapper type (i.e. global metadata) would just require actively using @share_metadata/@copy_metadata. You might want this to be optional (e.g., some_method(args...; share_metadata) so that you don't hurt performance by searching for metadata in every instance in every method.

@nalimilan
Copy link
Member

I don't think that's possible without type piracy. There is no unique type provided by "Metadata" that wraps an instance that is attached to global metadata when using @attach_metadata(x, meta). The most useful part of using global metadata is that it has absolutely no effect on dispatch so it can't possibly slow down your code, increase latency through codegen, etc..

Adding a finalizer doesn't require having a special type AFAICT. You can just call finalizer(f, obj).

The concern about performance is that in DataFrames transform copies all columns, so if we want to preserve metadata we would have to also add attach it to the new vectors, adding them to the global dict. If we don't remove metadata which has been attached to objects that have been destroyed by GC from the dict, it will grow indefinitely, which can be a problem.

@Tokazama
Copy link
Author

Tokazama commented Nov 17, 2020

Adding a finalizer doesn't require having a special type AFAICT. You can just call finalizer(f, obj).

Well, that's extremely good to know. I'm trying to add that now but it has the caveat that it only works with mutable structs. Any suggestions on getting this to work with something like DataFrame?

@pdeffebach
Copy link

pdeffebach commented Nov 17, 2020

I think this discussion has surpassed my technical knowledge but as the co-author of JuliaData/DataFrames.jl#1458 (Milan made the design), I like it's implementation. It's super transparent, I could make PRs to it, and users can understand it with a conceptual model. It's just a bit scary for me to have metadata be implemented by a global dict that is invisible to the user, but that could just be my lack of technical knowledge.

@bkamins
Copy link
Member

bkamins commented Nov 17, 2020

I do not think in DataFrames.jl we have to use a default metadata mechanism - we can do whatever we like. That is why we are discussing it in DataAPI.jl as I would like first to agree on the API of getting metadata and, if possible for setting metadata (but this is less crucial as I believe different table types might provide custom mechanisms for setting the metadata).

I think it is important to keep the API and the implementation separate, as otherwise we might run into problems in the future that might be hard to envision currently. Metadata.jl is very nice but it should be an opt-in I think, i.e. if some table type likes Metadata.jl it can start depending on it; but it should not be enforced.

@nalimilan
Copy link
Member

Yes, keeping API and implementation separate is usually a good thing. But a difficulty here is that if we don't add Tables.metadata(tbl, col) to the API, then the only way to support column-level metadata is to attach them to column vectors themselves. And that's only possible if we either 1) use a global dict like Metadata.jl, or 2) wrap vectors in a custom type (which is a no-go IMO). Only Tables.metadata(tbl, col) allows storing per-column metadata in the DataFrame itself (with the drawback that it cannot be retrieved if you only have the vector; not sure whether it's a problem).

@bkamins
Copy link
Member

bkamins commented Nov 18, 2020

I think having Tables.metadata(tbl, col) is not a problem to have.

The default implementations could be:

Tables.metadata(tbl, col) = Tables.metadata(Tables.getcolumn(tbl, col))`
Tables.metadata(tbl) = nothing

which would also cover the case of a default table-level metadata.

Now Tables.metadata(tbl, col) can be left as is, or if a table type has some way of keeping metadata for columns on a table level then simply Tables.metadata(tbl, col) can have a special method added.

In particular:

Tables.metadata(tbl::AbstractDataFrame, col) = # some custom implementation
Tables.metadata(tbl::AbstractDataFrame) = # some custom implementation

can use a completely different code path.

The only problem to solve is if both vector and table define metadata for column which should take the precedence, but this should be solved at AbstractDataFrame implementation level.

@pdeffebach
Copy link

In my opinion, metadata only makes sense at the level of table. Arrays should not have metadata themselves. i.e.

x = df.x

x should have no metadata attached to it, since any metadata can only be understood in the context of the table which it came from. Since x now lives on its own, separated from the DataFrame, it's not worth having any metadata attached to it.

@bkamins
Copy link
Member

bkamins commented Nov 18, 2020

In my opinion, metadata only makes sense at the level of table.

I agree that it is also my use case. However, we should design a flexible system that would fit different use cases. I can imagine that people might want to attach metadata to anything in general (this is what Metadata.jl provides now).

Note that in order to have column-level metadata you would have to opt-in for this (normal Vectors do not have metadata). So why disallowing this if someone wants to do it?

Recently we had a similar discussion related to AbstractMatrix being or not being a table. We decided to go for a flexible design allowing custom matrix types to have a different table representation than the default one (and I am OK with this, although I have used DataFrames.jl for years and always converted matrices to tables in a way that preserved shape).

@Tokazama
Copy link
Author

We decided to go for a flexible design allowing custom matrix types to have a different table representation than the default one

This is definitely the way to go. I'm currently using this for graphs, tables, and arrays. Performance and storage needs are different for each of these, but it's nice to be able to use a predictable interface for accomplishing this.

For example, you could do this in DataFrames.jl

struct DataFrameColumnMetadata{T<:AbstractDataFrame} <: AbstractDict{Symbol,Any}
    tbl::T
end

function metadata(x::DataFrameColumnMetadata, k)
    c = getcolumn(x.tbl, k)
    if has_metadata(c)
        # indicates the metadata was not found without throwing an error or interfering
        # with metadata that my use `nothing` or `missing` as a meaningful value.
        return Metadata.no_metadata
    else
        return metadata(c)
    end
end


function metadata(tbl::AbstractDataFrame)
    if has_metadata(tbl)
        return metadata(tbl)
    else
        return DataFrameColumnMetadata(tbl)
    end
end

@pdeffebach
Copy link

But now every call to copy, join, select, etc. needs to look up a global dictionary about metadata, right? It's hard to imagine this scaling well.

@Tokazama
Copy link
Author

needs to look up a global dictionary about metadata

There's a lot of flexibility here so that this doesn't need to be decided here.

struct DataFrame <: AbstractDataFrame
    columns::Vector{AbstractVector}
    colindex::Index
end

Metadata.metadata(df::DataFrame) = Metadata.global_metadata(df, Main)


struct MetaDataFrame <: AbstractDataFrame
    columns::Vector{AbstractVector}
    colindex::Index
    metadata::Dict{Symbol,Any}
end

Metadata.metadata(df::DataFrame) = getfield(df, :metadata)

@nalimilan
Copy link
Member

@Tokazama I don't understand how your last proposal stored column-level metadata. That's the main decision to make when designing a general API I think. Attaching metadata to the data frame itself is quite easy (either using Metadata.jl or a custom field in the struct).

@Tokazama
Copy link
Author

It wasn't intended to illustrate anymore than that you could store metadata in an instance or global metadata. In reality you would want to ensure that the keys in the metadata correspond to columns (e.g., k in metadata(tbl, k) corresponds to a column).

@nalimilan
Copy link
Member

@Tokazama I just saw you recently removed support for global metadata in Metadata.jl (Tokazama/Metadata.jl@e88941c). AFAICT this means there's no way to attach metadata to arbitrary objects without wrapping them in a new type. Is that right? This would be unfortunate as it was one of the main features we discussed above.

@Tokazama
Copy link
Author

Tokazama commented May 8, 2022

I can add it back in. Im still ironing out some details before releasing the next version. The new ability to set variables in modules makes it easier to do this sort of thing without macros.

@nalimilan nalimilan linked a pull request May 23, 2022 that will close this issue
@nalimilan nalimilan mentioned this issue May 25, 2022
@Tokazama
Copy link
Author

We can usually correspond metadata to the the data's self, values, indices/axes, or dimensions.
In terms of propagating metadata, I think we've mainly discussed copy, share, and drop as options.
The first two options only work if the metadata refers to the data's self or when following some method that copies the entirety of the data as is.
Indexing, dropping dimensions, permutation, and reduction all change how indices/axes metadata should be propagated (often also dimensional metadata).
This isn't even addressing how two datasets' metadata interact (e.g., cat, merge).
However, I think these provide enough context for most situations that some form of the following would be useful

  • is_selfmeta(m): is m metadata that corresponds to the entirety of it's corresponding data?
  • is_axesmeta(m): is m metadata that corresponds to the axes of it's corresponding data?
  • is_valmeta(m): is m metadata that corresponds to the values of it's corresponding data?
  • is_dimmeta(m): is m metadata that corresponds to dimensions of it's corresponding data?
  • should_copy_meta(m): should m be copied on propagation?
  • should_share_meta(m): should m be shared on propagation?
  • should_drop_meta(m): should m be dropped?

This can make the ever branching set of possibilities with metadata far more manageable.

function index_metadata(m, inds...)
    if should_drop_meta(m)
        return nothing
    else
        f = should_copy_meta(m) ? copy : identity
        if is_axesmeta(m)
            f(map(getindex, m, inds))
        elseif is_dimmeta(m)
            f(dropints(m, inds))
        elseif is_valmeta(m)
            f(m[inds...])
        else
            f(m)
        end
    end
end

There are certainly plenty of details that remain to make this into a robust generic interface, but I thought it might at least provide some helpful thoughts on how to proceed.

@Tokazama
Copy link
Author

Tokazama commented Jul 1, 2022

@Tokazama I just saw you recently removed support for global metadata in Metadata.jl (Tokazama/Metadata.jl@e88941c). AFAICT this means there's no way to attach metadata to arbitrary objects without wrapping them in a new type. Is that right? This would be unfortunate as it was one of the main features we discussed above.

I pulled out the globally stored metadata stuff into a new package and I'm registering it now JuliaRegistries/General#63519.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants