Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude certain kinds of dictionary tables from being interpreted as tables #189

Merged
merged 5 commits into from
Oct 12, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 13 additions & 7 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,12 +134,21 @@ Type `T` | `scitype(x)` for `x::T` | package/module required
`AbstractArray{<:Gray,2}` | `GrayImage{W,H}` where `(W, H) = size(x)` | ColorTypes.jl
`AbstractArrray{<:AbstractRGB,2}` | `ColorImage{W,H}` where `(W, H) = size(x)` | ColorTypes.jl
`PersistenceDiagram` | `PersistenceDiagram` | PersistenceDiagramsBase
any table type `T` supported by Tables.jl | `Table{K}` where `K=Union{column_scitypes...}` | Tables.jl
(*) any table type `T` | `Table{K}` where `K=Union{column_scitypes...}` | Tables.jl
† `CorpusLoaders.TaggedWord` | `Annotated{Textual}` | CorpusLoaders.jl
† `CorpusLoaders.Document{AbstractVector{Q}}` | `Annotated{AbstractVector{Scitype(Q)}}` | CorpusLoaders.jl
† `AbstractDict{<:AbstractString,<:Integer}` | `Multiset{Textual}` |
† `AbstractDict{<:TaggedWord,<:Integer}` | `Multiset{Annotated{Textual}}` | CorpusLoaders.jl

(*) More precisely, any object `X` for which `Tables.istable(X) == true` will have
`sctiype(X) = Table{K}`, where `K` is the union of the column scitypes, with the following
exceptions: abstract dictionaries with `AbstractString` keys, and abstract vectors of
abstract dictionaries with `AbstractString` keys are not considered tables by
ScientificTypes.jl. Prior to Tables.jl 1.8, one had `Tables.istable(X) == false` for these
objects but in releases 1.8 and 1.10, this behaviour changed. These changes were breaking
for ScientificTypes.jl, which has accordingly enforced the old behaviour, as far as
`scitype` is concerned.

† *Experimental* and subject to change in new minor or patch release

Here `nlevels(x) = length(levels(x.pool))`.
Expand All @@ -149,11 +158,8 @@ Here `nlevels(x) = length(levels(x.pool))`.
- We regard the built-in Julia types `Missing` and `Nothing` as scientific types.
- `Finite{N}`, `Multiclass{N}` and `OrderedFactor{N}` are all parameterized by the number of levels `N`. We export the alias `Binary = Finite{2}`.
- `Image{W,H}`, `GrayImage{W,H}` and `ColorImage{W,H}` are all parameterized by the image width and height dimensions, `(W, H)`.
- `Sampleable{K}` andb
`Density{K} <: Sampleable{K}` are parameterized by the sample space scitype.
- On objects for which the default convention has nothing to say, the
`scitype` function returns `Unknown`.

- `Sampleable{K}` and `Density{K} <: Sampleable{K}` are parameterized by the sample space scitype.
- On objects for which the default convention has nothing to say, the `scitype` function returns `Unknown`.

### Special note on binary data

Expand Down Expand Up @@ -319,7 +325,7 @@ schema(data)
scitype(data)
```

Similarly, any table implementing the Tables interface has scitype
Similarly, any table (see (*) above for the definition) has scitype
`Table{K}`, where `K` is the union of the scitypes of its columns.

Table scitypes are useful for dispatch and type checks, as shown here,
Expand Down
19 changes: 17 additions & 2 deletions src/ScientificTypes.jl
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
module ScientificTypes
module ScientificTypes

# Dependencies
using Reexport
Expand Down Expand Up @@ -27,11 +27,26 @@ const SCHEMA_SPECIALIZATION_THRESHOLD = Tables.SCHEMA_SPECIALIZATION_THRESHOLD

#---------------------------------------------------------------------------------------
# Define convention

struct DefaultConvention <: Convention end
const CONV = DefaultConvention()

# -------------------------------------------------------------
# vtrait function, returns either `Val{:table}()` or `Val{:other}()`
vtrait(X) = Val{ifelse(Tables.istable(X), :table, :other)}()

# To address https://github.com/JuliaData/Tables.jl/issues/306:
const DictColumnsWithStringKeys = AbstractDict{K, V} where {
K <: AbstractString,
V <: AbstractVector
}
const DictRowsWithStringKeys = AbstractVector{T} where {
T <: AbstractDict{<:AbstractString}
}
_istable(::DictColumnsWithStringKeys) = false
_istable(::DictRowsWithStringKeys) = false
_istable(something_else) = Tables.istable(something_else)

vtrait(X) = Val{ifelse(_istable(X), :table, :other)}()

# -------------------------------------------------------------
# Includes
Expand Down
71 changes: 53 additions & 18 deletions src/convention/scitype.jl
Original file line number Diff line number Diff line change
@@ -1,36 +1,71 @@
# -----------------------------------------------------------------------------------------
# This file includes the single argument definition of `scitype` method and corresponding
# convenience functions. It also includes definitions for `scitype/Scitype` of different
# This file includes the single argument definition of `scitype` method and corresponding
# convenience functions. It also includes definitions for `scitype/Scitype` of different
# objects with respect to the `DefaultConvention``.
# -----------------------------------------------------------------------------------------
#
#-----------------------------------------------------------------------------------------
"""
The scientific type (interpretation) of `X`, distinct from its
machine type.
scitype(X)

The scientific type (interpretation) of `X`, as distinct from its machine type. Atomic
scientific types (`Continuous`, `Multiclass`, etc) are mostly abstract types defined in
the package ScientificTypesBase.jl. Scientific types do not ordinarily have instances.

### Examples
```
julia> scitype(3.14)
Continuous

julia> scitype([1, 2, missing])
AbstractVector{Union{Missing, Count}}
AbstractVector{Union{Missing, Count}}

julia> scitype((5, "beige"))
Tuple{Count, Textual}

julia> using CategoricalArrays

julia> X = (gender = categorical(['M', 'M', 'F', 'M', 'F']),
julia> table = (gender = categorical(['M', 'M', 'F', 'M', 'F']),
ndevices = [1, 3, 2, 3, 2])

julia> scitype(X)
julia> scitype(table)
Table{Union{AbstractVector{Count}, AbstractVector{Multiclass{2}}}}

```

Column scitpes of a table can also be inspected with [`schema`](@ref).

The behavior of `scitype` is detailed in the [ScientificTypes
documentation](https://juliaai.github.io/ScientificTypes.jl/dev/#Summary-of-the-default-convention).
Key features of the default behavior are:

- `AbstractFloat` has scitype as `Continuous <: Infinite`.

- Any `Integer` has scitype as `Count <: Infinite`.

- Any `CategoricalValue` `x` has scitype as `Multiclass <: Finite` or
`OrderedFactor <: Finite`, depending on the value of `isordered(x)`.

- `String`s and `Char`s do *not* have scitype `Multiclass` or
`OrderedFactor`; they have scitypes `Textual` and `Unknown`
respectively.

- The scientific types of `nothing` and `missing` are `Nothing` and
`Missing`, Julia types that are also regarded as scientific.



!!! note

Third party packages may extend the behavior of `scitype`: Objects
previously having `Unknown` scitype may no longer do so.

See also [`coerce`](@ref), [`autotype`](@ref), [`schema`](@ref).

"""
scitype(X) = ST.scitype(X, CONV)

function ST.scitype(@nospecialize(X), C::DefaultConvention)
return _scitype(X, C, vtrait(X))
return _scitype(X, C, vtrait(X))
end

function _scitype(X, C, ::Val{:other})
Expand Down Expand Up @@ -98,7 +133,7 @@ end
function _cols_scitype(cols, sch::Tables.Schema{names, types}) where {names, types}
N = length(names)
if N <= COLS_SPECIALIZATION_THRESHOLD
return __cols_scitype(cols, sch)
return __cols_scitype(cols, sch)
else
scitypes = if types === nothing
Type[scitype(Tables.getcolumn(cols, names[i])) for i in Base.OneTo(N)]
Expand All @@ -109,21 +144,21 @@ function _cols_scitype(cols, sch::Tables.Schema{names, types}) where {names, typ
cols, fieldtype(types, i), i, names[i])
) for i in Base.OneTo(N)
]

end
return Table{Union{scitypes...}}
end
end

@inline function __cols_scitype(
cols,
cols,
sch::Tables.Schema{names, types}
) where {names, types}
N = length(names)
if @generated
stypes = if types === nothing
(
:(scitype(Tables.getcolumn(cols, $(Meta.QuoteNode(names[i])))))
:(scitype(Tables.getcolumn(cols, $(Meta.QuoteNode(names[i])))))
for i in Base.OneTo(N)
)
else
Expand All @@ -134,17 +169,17 @@ end
cols, $(fieldtype(types, i)), $i, $(Meta.QuoteNode(names[i]))
);
)
end
end
for i in Base.OneTo(N)
)
end
return :(Table{Union{$(stypes...)}})
else
stypes = if types === nothing
stypes = if types === nothing
(scitype(Tables.getcolumn(cols, names[i])) for i in Base.OneTo(N))
else
(
scitype(Tables.getcolumn(cols, fieldtype(types, i), i, names[i]))
scitype(Tables.getcolumn(cols, fieldtype(types, i), i, names[i]))
for i in Base.OneTo(N)
)
end
Expand Down Expand Up @@ -225,10 +260,10 @@ Return the element scientific type of an abstract array `A`. By definition, if
"""
elscitype(X) = elscitype(collect(X))
elscitype(X::Arr) = eltype(scitype(X))

"""
scitype_union(A)
Return the type union, over all elements `x` generated by the iterable `A`,
of `scitype(x)`. See also [`scitype`](@ref).
"""
scitype_union(X) = ST.scitype_union(X, DefaultConvention())
scitype_union(X) = ST.scitype_union(X, DefaultConvention())
19 changes: 14 additions & 5 deletions test/convention/scitype.jl
Original file line number Diff line number Diff line change
Expand Up @@ -65,10 +65,10 @@ end
@test scitype(Any[categorical(1:4)...]) == Vec{Multiclass{4}}
@test scitype(categorical([1, missing, 3])) ==
Vec{Union{Multiclass{2},Missing}}

a = ["aa", "bb", "aa", "bb"] |> categorical
@test scitype(a[1]) == Multiclass{2}

# NOTE: the slice here does not contain missings but the machine type
# still contains a missing so the scitype remains with a missing
@test scitype(categorical([1, missing, 3])[1:1]) ==
Expand Down Expand Up @@ -169,15 +169,15 @@ end
) == r
# ExtremelyWide row oriented table
@test ST._rows_scitype(
rows,
rows,
Tables.Schema(
Tables.columnnames(iterate(rows, 1)[1]),
(Int, Int, CategoricalValue{Char, UInt32}, Float64);
stored = true
)
) == r

# test schema for column oreinted tables with number of columns
# test schema for column oreinted tables with number of columns
# exceeding COLS_SPECIALIZATION_THRESHOLD.
nt = NamedTuple{
Tuple(Symbol("x$i") for i in Base.OneTo(ST.COLS_SPECIALIZATION_THRESHOLD + 1))
Expand All @@ -189,6 +189,15 @@ end
#issue 146
X = Tables.table(coerce(rand("abc", 5, 3), Multiclass))
@test scitype(X) === Table{AbstractVector{Multiclass{3}}}

# dictionaries are not tables:
_s(str) = SubString(str, 1)
@test !(scitype(Dict("a" => [1, 2, 3], "b" => [4, 5, 6])) <: Table)
@test !(scitype(Dict(_s("a") => [1, 2, 3], _s("b") => [4, 5, 6])) <: Table)

# vectors of dictionaries are not tables:
@test !(scitype([Dict("a" => 1), Dict("a" => 2)]) <: Table)
@test !(scitype([Dict(_s("a") => 1), Dict(_s("a") => 2)]) <: Table)
end

# TODO: re-instate when julia 1.0 is no longer LTS release:
Expand All @@ -199,4 +208,4 @@ end
# file = CSV.File("test.csv")
# @test scitype(file) == scitype(X)
# rm("test.csv")
# end
# end
4 changes: 2 additions & 2 deletions test/schema.jl
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ end
z = categorical(collect("asdfa")),
w = rand(5)
)
s = schema(X)
s = schema(X)
@test s.scitypes == (Continuous, Count, Multiclass{4}, Continuous)
@test s.types == (Float64, Int64, CategoricalValue{Char,UInt32}, Float64)

Expand Down Expand Up @@ -157,4 +157,4 @@ end
)

end