Skip to content

Commit

Permalink
Merge pull request #189 from JuliaAI/exclude-dictionary-tables
Browse files Browse the repository at this point in the history
Exclude certain kinds of dictionary tables from being interpreted as tables
  • Loading branch information
ablaom committed Oct 12, 2022
2 parents 6ab77cf + fb34924 commit 3f7cbc0
Show file tree
Hide file tree
Showing 5 changed files with 99 additions and 34 deletions.
20 changes: 13 additions & 7 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,12 +134,21 @@ Type `T` | `scitype(x)` for `x::T` | package/module required
`AbstractArray{<:Gray,2}` | `GrayImage{W,H}` where `(W, H) = size(x)` | ColorTypes.jl
`AbstractArrray{<:AbstractRGB,2}` | `ColorImage{W,H}` where `(W, H) = size(x)` | ColorTypes.jl
`PersistenceDiagram` | `PersistenceDiagram` | PersistenceDiagramsBase
any table type `T` supported by Tables.jl | `Table{K}` where `K=Union{column_scitypes...}` | Tables.jl
(*) any table type `T` | `Table{K}` where `K=Union{column_scitypes...}` | Tables.jl
`CorpusLoaders.TaggedWord` | `Annotated{Textual}` | CorpusLoaders.jl
`CorpusLoaders.Document{AbstractVector{Q}}` | `Annotated{AbstractVector{Scitype(Q)}}` | CorpusLoaders.jl
`AbstractDict{<:AbstractString,<:Integer}` | `Multiset{Textual}` |
`AbstractDict{<:TaggedWord,<:Integer}` | `Multiset{Annotated{Textual}}` | CorpusLoaders.jl

(*) More precisely, any object `X` for which `Tables.istable(X) == true` will have
`sctiype(X) = Table{K}`, where `K` is the union of the column scitypes, with the following
exceptions: abstract dictionaries with `AbstractString` keys, and abstract vectors of
abstract dictionaries with `AbstractString` keys are not considered tables by
ScientificTypes.jl. Prior to Tables.jl 1.8, one had `Tables.istable(X) == false` for these
objects but in releases 1.8 and 1.10, this behaviour changed. These changes were breaking
for ScientificTypes.jl, which has accordingly enforced the old behaviour, as far as
`scitype` is concerned.

*Experimental* and subject to change in new minor or patch release

Here `nlevels(x) = length(levels(x.pool))`.
Expand All @@ -149,11 +158,8 @@ Here `nlevels(x) = length(levels(x.pool))`.
- We regard the built-in Julia types `Missing` and `Nothing` as scientific types.
- `Finite{N}`, `Multiclass{N}` and `OrderedFactor{N}` are all parameterized by the number of levels `N`. We export the alias `Binary = Finite{2}`.
- `Image{W,H}`, `GrayImage{W,H}` and `ColorImage{W,H}` are all parameterized by the image width and height dimensions, `(W, H)`.
- `Sampleable{K}` andb
`Density{K} <: Sampleable{K}` are parameterized by the sample space scitype.
- On objects for which the default convention has nothing to say, the
`scitype` function returns `Unknown`.

- `Sampleable{K}` and `Density{K} <: Sampleable{K}` are parameterized by the sample space scitype.
- On objects for which the default convention has nothing to say, the `scitype` function returns `Unknown`.

### Special note on binary data

Expand Down Expand Up @@ -319,7 +325,7 @@ schema(data)
scitype(data)
```

Similarly, any table implementing the Tables interface has scitype
Similarly, any table (see (*) above for the definition) has scitype
`Table{K}`, where `K` is the union of the scitypes of its columns.

Table scitypes are useful for dispatch and type checks, as shown here,
Expand Down
19 changes: 17 additions & 2 deletions src/ScientificTypes.jl
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
module ScientificTypes
module ScientificTypes

# Dependencies
using Reexport
Expand Down Expand Up @@ -27,11 +27,26 @@ const SCHEMA_SPECIALIZATION_THRESHOLD = Tables.SCHEMA_SPECIALIZATION_THRESHOLD

#---------------------------------------------------------------------------------------
# Define convention

struct DefaultConvention <: Convention end
const CONV = DefaultConvention()

# -------------------------------------------------------------
# vtrait function, returns either `Val{:table}()` or `Val{:other}()`
vtrait(X) = Val{ifelse(Tables.istable(X), :table, :other)}()

# To address https://github.com/JuliaData/Tables.jl/issues/306:
const DictColumnsWithStringKeys = AbstractDict{K, V} where {
K <: AbstractString,
V <: AbstractVector
}
const DictRowsWithStringKeys = AbstractVector{T} where {
T <: AbstractDict{<:AbstractString}
}
_istable(::DictColumnsWithStringKeys) = false
_istable(::DictRowsWithStringKeys) = false
_istable(something_else) = Tables.istable(something_else)

vtrait(X) = Val{ifelse(_istable(X), :table, :other)}()

# -------------------------------------------------------------
# Includes
Expand Down
71 changes: 53 additions & 18 deletions src/convention/scitype.jl
Original file line number Diff line number Diff line change
@@ -1,36 +1,71 @@
# -----------------------------------------------------------------------------------------
# This file includes the single argument definition of `scitype` method and corresponding
# convenience functions. It also includes definitions for `scitype/Scitype` of different
# This file includes the single argument definition of `scitype` method and corresponding
# convenience functions. It also includes definitions for `scitype/Scitype` of different
# objects with respect to the `DefaultConvention``.
# -----------------------------------------------------------------------------------------
#
#-----------------------------------------------------------------------------------------
"""
The scientific type (interpretation) of `X`, distinct from its
machine type.
scitype(X)
The scientific type (interpretation) of `X`, as distinct from its machine type. Atomic
scientific types (`Continuous`, `Multiclass`, etc) are mostly abstract types defined in
the package ScientificTypesBase.jl. Scientific types do not ordinarily have instances.
### Examples
```
julia> scitype(3.14)
Continuous
julia> scitype([1, 2, missing])
AbstractVector{Union{Missing, Count}}
AbstractVector{Union{Missing, Count}}
julia> scitype((5, "beige"))
Tuple{Count, Textual}
julia> using CategoricalArrays
julia> X = (gender = categorical(['M', 'M', 'F', 'M', 'F']),
julia> table = (gender = categorical(['M', 'M', 'F', 'M', 'F']),
ndevices = [1, 3, 2, 3, 2])
julia> scitype(X)
julia> scitype(table)
Table{Union{AbstractVector{Count}, AbstractVector{Multiclass{2}}}}
```
Column scitpes of a table can also be inspected with [`schema`](@ref).
The behavior of `scitype` is detailed in the [ScientificTypes
documentation](https://juliaai.github.io/ScientificTypes.jl/dev/#Summary-of-the-default-convention).
Key features of the default behavior are:
- `AbstractFloat` has scitype as `Continuous <: Infinite`.
- Any `Integer` has scitype as `Count <: Infinite`.
- Any `CategoricalValue` `x` has scitype as `Multiclass <: Finite` or
`OrderedFactor <: Finite`, depending on the value of `isordered(x)`.
- `String`s and `Char`s do *not* have scitype `Multiclass` or
`OrderedFactor`; they have scitypes `Textual` and `Unknown`
respectively.
- The scientific types of `nothing` and `missing` are `Nothing` and
`Missing`, Julia types that are also regarded as scientific.
!!! note
Third party packages may extend the behavior of `scitype`: Objects
previously having `Unknown` scitype may no longer do so.
See also [`coerce`](@ref), [`autotype`](@ref), [`schema`](@ref).
"""
scitype(X) = ST.scitype(X, CONV)

function ST.scitype(@nospecialize(X), C::DefaultConvention)
return _scitype(X, C, vtrait(X))
return _scitype(X, C, vtrait(X))
end

function _scitype(X, C, ::Val{:other})
Expand Down Expand Up @@ -98,7 +133,7 @@ end
function _cols_scitype(cols, sch::Tables.Schema{names, types}) where {names, types}
N = length(names)
if N <= COLS_SPECIALIZATION_THRESHOLD
return __cols_scitype(cols, sch)
return __cols_scitype(cols, sch)
else
scitypes = if types === nothing
Type[scitype(Tables.getcolumn(cols, names[i])) for i in Base.OneTo(N)]
Expand All @@ -109,21 +144,21 @@ function _cols_scitype(cols, sch::Tables.Schema{names, types}) where {names, typ
cols, fieldtype(types, i), i, names[i])
) for i in Base.OneTo(N)
]

end
return Table{Union{scitypes...}}
end
end

@inline function __cols_scitype(
cols,
cols,
sch::Tables.Schema{names, types}
) where {names, types}
N = length(names)
if @generated
stypes = if types === nothing
(
:(scitype(Tables.getcolumn(cols, $(Meta.QuoteNode(names[i])))))
:(scitype(Tables.getcolumn(cols, $(Meta.QuoteNode(names[i])))))
for i in Base.OneTo(N)
)
else
Expand All @@ -134,17 +169,17 @@ end
cols, $(fieldtype(types, i)), $i, $(Meta.QuoteNode(names[i]))
);
)
end
end
for i in Base.OneTo(N)
)
end
return :(Table{Union{$(stypes...)}})
else
stypes = if types === nothing
stypes = if types === nothing
(scitype(Tables.getcolumn(cols, names[i])) for i in Base.OneTo(N))
else
(
scitype(Tables.getcolumn(cols, fieldtype(types, i), i, names[i]))
scitype(Tables.getcolumn(cols, fieldtype(types, i), i, names[i]))
for i in Base.OneTo(N)
)
end
Expand Down Expand Up @@ -225,10 +260,10 @@ Return the element scientific type of an abstract array `A`. By definition, if
"""
elscitype(X) = elscitype(collect(X))
elscitype(X::Arr) = eltype(scitype(X))

"""
scitype_union(A)
Return the type union, over all elements `x` generated by the iterable `A`,
of `scitype(x)`. See also [`scitype`](@ref).
"""
scitype_union(X) = ST.scitype_union(X, DefaultConvention())
scitype_union(X) = ST.scitype_union(X, DefaultConvention())
19 changes: 14 additions & 5 deletions test/convention/scitype.jl
Original file line number Diff line number Diff line change
Expand Up @@ -65,10 +65,10 @@ end
@test scitype(Any[categorical(1:4)...]) == Vec{Multiclass{4}}
@test scitype(categorical([1, missing, 3])) ==
Vec{Union{Multiclass{2},Missing}}

a = ["aa", "bb", "aa", "bb"] |> categorical
@test scitype(a[1]) == Multiclass{2}

# NOTE: the slice here does not contain missings but the machine type
# still contains a missing so the scitype remains with a missing
@test scitype(categorical([1, missing, 3])[1:1]) ==
Expand Down Expand Up @@ -169,15 +169,15 @@ end
) == r
# ExtremelyWide row oriented table
@test ST._rows_scitype(
rows,
rows,
Tables.Schema(
Tables.columnnames(iterate(rows, 1)[1]),
(Int, Int, CategoricalValue{Char, UInt32}, Float64);
stored = true
)
) == r

# test schema for column oreinted tables with number of columns
# test schema for column oreinted tables with number of columns
# exceeding COLS_SPECIALIZATION_THRESHOLD.
nt = NamedTuple{
Tuple(Symbol("x$i") for i in Base.OneTo(ST.COLS_SPECIALIZATION_THRESHOLD + 1))
Expand All @@ -189,6 +189,15 @@ end
#issue 146
X = Tables.table(coerce(rand("abc", 5, 3), Multiclass))
@test scitype(X) === Table{AbstractVector{Multiclass{3}}}

# dictionaries are not tables:
_s(str) = SubString(str, 1)
@test !(scitype(Dict("a" => [1, 2, 3], "b" => [4, 5, 6])) <: Table)
@test !(scitype(Dict(_s("a") => [1, 2, 3], _s("b") => [4, 5, 6])) <: Table)

# vectors of dictionaries are not tables:
@test !(scitype([Dict("a" => 1), Dict("a" => 2)]) <: Table)
@test !(scitype([Dict(_s("a") => 1), Dict(_s("a") => 2)]) <: Table)
end

# TODO: re-instate when julia 1.0 is no longer LTS release:
Expand All @@ -199,4 +208,4 @@ end
# file = CSV.File("test.csv")
# @test scitype(file) == scitype(X)
# rm("test.csv")
# end
# end
4 changes: 2 additions & 2 deletions test/schema.jl
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ end
z = categorical(collect("asdfa")),
w = rand(5)
)
s = schema(X)
s = schema(X)
@test s.scitypes == (Continuous, Count, Multiclass{4}, Continuous)
@test s.types == (Float64, Int64, CategoricalValue{Char,UInt32}, Float64)

Expand Down Expand Up @@ -157,4 +157,4 @@ end
)

end


0 comments on commit 3f7cbc0

Please sign in to comment.