From b3a27a7154d0aab4e1c86445abe9e28d9d7b42e2 Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Fri, 3 Jan 2025 22:41:08 +0000 Subject: [PATCH] build based on 3e0d056 --- dev/.documenter-siteinfo.json | 2 +- dev/apiindex/index.html | 14 +++++++------- dev/implementation/index.html | 2 +- dev/index.html | 2 +- dev/using/index.html | 2 +- 5 files changed, 11 insertions(+), 11 deletions(-) diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index cbfb53ff..49b0e678 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.11.2","generation_timestamp":"2024-12-30T21:50:08","documenter_version":"1.8.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.11.2","generation_timestamp":"2025-01-03T22:41:03","documenter_version":"1.8.0"}} \ No newline at end of file diff --git a/dev/apiindex/index.html b/dev/apiindex/index.html index c20246d8..01607df0 100644 --- a/dev/apiindex/index.html +++ b/dev/apiindex/index.html @@ -1,7 +1,7 @@ API index · CategoricalArrays

API Index

CategoricalArrays.CategoricalArrayType
CategoricalArray{T}(undef, dims::Dims; levels=nothing, ordered=false)
 CategoricalArray{T}(undef, dims::Int...; levels=nothing, ordered=false)

Construct an uninitialized CategoricalArray with levels of type T <: Union{AbstractChar, AbstractString, Number} and dimensions dims.

The levels keyword argument can be a vector specifying possible values for the data (this is equivalent to but more efficient than calling levels! on the resulting array). The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

CategoricalArray{T, N, R}(undef, dims::Dims; levels=nothing, ordered=false)
-CategoricalArray{T, N, R}(undef, dims::Int...; levels=nothing, ordered=false)

Similar to definition above, but uses reference type R instead of the default type (UInt32).

CategoricalArray(A::AbstractArray; levels=nothing, ordered=false)

Construct a new CategoricalArray with the values from A and the same element type.

The levels keyword argument can be a vector specifying possible values for the data (this is equivalent to but more efficient than calling levels! on the resulting array). If levels is omitted and the element type supports it, levels are sorted in ascending order; else, they are kept in their order of appearance in A. The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

If A is already a CategoricalArray, its levels, orderedness and reference type are preserved unless explicitly overriden.

source
CategoricalArrays.CategoricalMatrixType
CategoricalMatrix{T}(undef, m::Int, n::Int; levels=nothing, ordered=false)

Construct an uninitialized CategoricalMatrix with levels of type T <: Union{AbstractChar, AbstractString, Number} and dimensions dim. The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

CategoricalMatrix{T, R}(undef, m::Int, n::Int; levels=nothing, ordered=false)

Similar to definition above, but uses reference type R instead of the default type (UInt32).

CategoricalMatrix(A::AbstractMatrix; levels=nothing, ordered=false)

Construct a CategoricalMatrix with the values from A and the same element type.

The levels keyword argument can be a vector specifying possible values for the data (this is equivalent to but more efficient than calling levels! on the resulting array). If levels is omitted and the element type supports it, levels are sorted in ascending order; else, they are kept in their order of appearance in A. The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

If A is already a CategoricalMatrix, its levels, orderedness and reference type are preserved unless explicitly overriden.

source
CategoricalArrays.CategoricalValueType
CategoricalValue{T <: Union{AbstractChar, AbstractString, Number}, R <: Integer}

A wrapper around a value of type T corresponding to a level in a CategoricalPool.

CategoricalValue objects are considered as equal to the value of type T they wrap by == and isequal. However, order comparisons like < and isless are only possible if isordered is true for the value's pool, and in that case the order of the pool's levels is used rather than the standard ordering of values of type T.

source
CategoricalArrays.CategoricalValueMethod
CategoricalValue(value, source::Union{CategoricalValue, CategoricalArray})

Return a CategoricalValue object wrapping value and attached to the CategoricalPool of source.

source
CategoricalArrays.CategoricalVectorType
CategoricalVector{T}(undef, m::Int; levels=nothing, ordered=false)

Construct an uninitialized CategoricalVector with levels of type T <: Union{AbstractChar, AbstractString, Number} and dimensions dim.

The levels keyword argument can be a vector specifying possible values for the data (this is equivalent to but more efficient than calling levels! on the resulting array). The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

CategoricalVector{T, R}(undef, m::Int; levels=nothing, ordered=false)

Similar to definition above, but uses reference type R instead of the default type (UInt32).

CategoricalVector(A::AbstractVector; levels=nothing, ordered=false)

Construct a CategoricalVector with the values from A and the same element type.

The levels keyword argument can be a vector specifying possible values for the data (this is equivalent to but more efficient than calling levels! on the resulting array). If levels is omitted and the element type supports it, levels are sorted in ascending order; else, they are kept in their order of appearance in A. The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

If A is already a CategoricalVector, its levels, orderedness and reference type are preserved unless explicitly overriden.

source
CategoricalArrays.categoricalMethod
categorical(A::AbstractArray; levels=nothing, ordered=false, compress=false)

Construct a categorical array with the values from A.

The levels keyword argument can be a vector specifying possible values for the data (this is equivalent to but more efficient than calling levels! on the resulting array). If levels is omitted and the element type supports it, levels are sorted in ascending order; else, they are kept in their order of appearance in A. The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

If compress is true, the smallest reference type able to hold the number of unique values in A will be used. While this will reduce memory use, passing this parameter will also introduce a type instability which can affect performance inside the function where the call is made. Therefore, use this option with caution (the one-argument version does not suffer from this problem).

categorical(A::CategoricalArray; compress=false, levels=nothing, ordered=false)

If A is already a CategoricalArray, its levels, orderedness and reference type are preserved unless explicitly overriden.

source
CategoricalArrays.compressMethod
compress(A::CategoricalArray)

Return a copy of categorical array A using the smallest reference type able to hold the number of levels of A.

While this will reduce memory use, this function is type-unstable, which can affect performance inside the function where the call is made. Therefore, use it with caution.

source
CategoricalArrays.cutMethod
cut(x::AbstractArray, breaks::AbstractVector;
+CategoricalArray{T, N, R}(undef, dims::Int...; levels=nothing, ordered=false)

Similar to definition above, but uses reference type R instead of the default type (UInt32).

CategoricalArray(A::AbstractArray; levels=nothing, ordered=false)

Construct a new CategoricalArray with the values from A and the same element type.

The levels keyword argument can be a vector specifying possible values for the data (this is equivalent to but more efficient than calling levels! on the resulting array). If levels is omitted and the element type supports it, levels are sorted in ascending order; else, they are kept in their order of appearance in A. The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

If A is already a CategoricalArray, its levels, orderedness and reference type are preserved unless explicitly overriden.

source
CategoricalArrays.CategoricalMatrixType
CategoricalMatrix{T}(undef, m::Int, n::Int; levels=nothing, ordered=false)

Construct an uninitialized CategoricalMatrix with levels of type T <: Union{AbstractChar, AbstractString, Number} and dimensions dim. The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

CategoricalMatrix{T, R}(undef, m::Int, n::Int; levels=nothing, ordered=false)

Similar to definition above, but uses reference type R instead of the default type (UInt32).

CategoricalMatrix(A::AbstractMatrix; levels=nothing, ordered=false)

Construct a CategoricalMatrix with the values from A and the same element type.

The levels keyword argument can be a vector specifying possible values for the data (this is equivalent to but more efficient than calling levels! on the resulting array). If levels is omitted and the element type supports it, levels are sorted in ascending order; else, they are kept in their order of appearance in A. The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

If A is already a CategoricalMatrix, its levels, orderedness and reference type are preserved unless explicitly overriden.

source
CategoricalArrays.CategoricalValueType
CategoricalValue{T <: Union{AbstractChar, AbstractString, Number}, R <: Integer}

A wrapper around a value of type T corresponding to a level in a CategoricalPool.

CategoricalValue objects are considered as equal to the value of type T they wrap by == and isequal. However, order comparisons like < and isless are only possible if isordered is true for the value's pool, and in that case the order of the pool's levels is used rather than the standard ordering of values of type T.

source
CategoricalArrays.CategoricalValueMethod
CategoricalValue(value, source::Union{CategoricalValue, CategoricalArray})

Return a CategoricalValue object wrapping value and attached to the CategoricalPool of source.

source
CategoricalArrays.CategoricalVectorType
CategoricalVector{T}(undef, m::Int; levels=nothing, ordered=false)

Construct an uninitialized CategoricalVector with levels of type T <: Union{AbstractChar, AbstractString, Number} and dimensions dim.

The levels keyword argument can be a vector specifying possible values for the data (this is equivalent to but more efficient than calling levels! on the resulting array). The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

CategoricalVector{T, R}(undef, m::Int; levels=nothing, ordered=false)

Similar to definition above, but uses reference type R instead of the default type (UInt32).

CategoricalVector(A::AbstractVector; levels=nothing, ordered=false)

Construct a CategoricalVector with the values from A and the same element type.

The levels keyword argument can be a vector specifying possible values for the data (this is equivalent to but more efficient than calling levels! on the resulting array). If levels is omitted and the element type supports it, levels are sorted in ascending order; else, they are kept in their order of appearance in A. The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

If A is already a CategoricalVector, its levels, orderedness and reference type are preserved unless explicitly overriden.

source
CategoricalArrays.categoricalMethod
categorical(A::AbstractArray; levels=nothing, ordered=false, compress=false)

Construct a categorical array with the values from A.

The levels keyword argument can be a vector specifying possible values for the data (this is equivalent to but more efficient than calling levels! on the resulting array). If levels is omitted and the element type supports it, levels are sorted in ascending order; else, they are kept in their order of appearance in A. The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

If compress is true, the smallest reference type able to hold the number of unique values in A will be used. While this will reduce memory use, passing this parameter will also introduce a type instability which can affect performance inside the function where the call is made. Therefore, use this option with caution (the one-argument version does not suffer from this problem).

categorical(A::CategoricalArray; compress=false, levels=nothing, ordered=false)

If A is already a CategoricalArray, its levels, orderedness and reference type are preserved unless explicitly overriden.

source
CategoricalArrays.compressMethod
compress(A::CategoricalArray)

Return a copy of categorical array A using the smallest reference type able to hold the number of levels of A.

While this will reduce memory use, this function is type-unstable, which can affect performance inside the function where the call is made. Therefore, use it with caution.

source
CategoricalArrays.cutMethod
cut(x::AbstractArray, breaks::AbstractVector;
     labels::Union{AbstractVector,Function},
     extend::Union{Bool,Missing}=false, allowempty::Bool=false)

Cut a numeric array into intervals at values breaks and return an ordered CategoricalArray indicating the interval into which each entry falls. Intervals are of the form [lower, upper), i.e. the lower bound is included and the upper bound is excluded, except the last interval, which is closed on both ends, i.e. [lower, upper].

If x accepts missing values (i.e. eltype(x) >: Missing) the returned array will also accept them.

Keyword arguments

  • extend::Union{Bool, Missing}=false: when false, an error is raised if some values in x fall outside of the breaks; when true, breaks are automatically added to include all values in x; when missing, values outside of the breaks generate missing entries.
  • labels::Union{AbstractVector, Function}: a vector of strings, characters or numbers giving the names to use for the intervals; or a function f(from, to, i; leftclosed, rightclosed) that generates the labels from the left and right interval boundaries and the group index. Defaults to "[from, to)" (or "[from, to]" for the rightmost interval if extend == true).
  • allowempty::Bool=false: when false, an error is raised if some breaks other than the last one appear multiple times, generating empty intervals; when true, duplicate breaks are allowed and the intervals they generate are kept as unused levels (but duplicate labels are not allowed).

Examples

julia> using CategoricalArrays
 
@@ -46,9 +46,9 @@
  "grp 1 (-1.0//-0.3333333333333335)"
  "grp 2 (-0.3333333333333335//0.33333333333333326)"
  "grp 3 (0.33333333333333326//1.0)"
- "grp 3 (0.33333333333333326//1.0)"
source
CategoricalArrays.cutMethod
cut(x::AbstractArray, ngroups::Integer;
     labels::Union{AbstractVector{<:AbstractString},Function},
-    allowempty::Bool=false)

Cut a numeric array into ngroups quantiles, determined using quantile.

If x contains missing values, they are automatically skipped when computing quantiles.

Keyword arguments

  • labels::Union{AbstractVector, Function}: a vector of strings, characters or numbers giving the names to use for the intervals; or a function f(from, to, i; leftclosed, rightclosed) that generates the labels from the left and right interval boundaries and the group index. Defaults to "Qi: [from, to)" (or "Qi: [from, to]" for the rightmost interval).
  • allowempty::Bool=false: when false, an error is raised if some quantiles breakpoints other than the last one are equal, generating empty intervals; when true, duplicate breaks are allowed and the intervals they generate are kept as unused levels (but duplicate labels are not allowed).
source
CategoricalArrays.decompressMethod
decompress(A::CategoricalArray)

Return a copy of categorical array A using the default reference type (UInt32). If A is using a small reference type (such as UInt8 or UInt16) the decompressed array will have room for more levels.

To avoid the need to call decompress, ensure compress is not called when creating the categorical array.

source
CategoricalArrays.isorderedMethod
isordered(A::CategoricalArray)

Test whether entries in A can be compared using <, > and similar operators, using the ordering of levels.

source
CategoricalArrays.levels!Method
levels!(A::CategoricalArray, newlevels::Vector; allowmissing::Bool=false)

Set the levels categorical array A. The order of appearance of levels will be respected by levels, which may affect display of results in some operations; if A is ordered (see isordered), it will also be used for order comparisons using <, > and similar operators. Reordering levels will never affect the values of entries in the array.

If A accepts missing values (i.e. eltype(A) >: Missing) and allowmissing=true, entries corresponding to omitted levels will be set to missing. Else, newlevels must include all levels which appear in the data.

source
CategoricalArrays.ordered!Method
ordered!(A::CategoricalArray, ordered::Bool)

Set whether entries in A can be compared using <, > and similar operators, using the ordering of levels. Return the modified A.

source
CategoricalArrays.recodeFunction
recode(a::AbstractArray[, default::Any], pairs::Pair...)

Return a copy of a, replacing elements matching a key of pairs with the corresponding value. The type of the array is chosen so that it can hold all recoded elements (but not necessarily original elements from a).

For each Pair in pairs, if the element is equal to (according to isequal) or in the key (first item of the pair), then the corresponding value (second item) is used. If the element matches no key and default is not provided or nothing, it is copied as-is; if default is specified, it is used in place of the original element. If an element matches more than one key, the first match is used.

recode(a::CategoricalArray[, default::Any], pairs::Pair...)

If a is a CategoricalArray then the ordering of resulting levels is determined by the order of passed pairs and default will be the last level if provided.

Examples

julia> using CategoricalArrays
+    allowempty::Bool=false)

Cut a numeric array into ngroups quantiles, determined using quantile.

If x contains missing values, they are automatically skipped when computing quantiles.

Keyword arguments

  • labels::Union{AbstractVector, Function}: a vector of strings, characters or numbers giving the names to use for the intervals; or a function f(from, to, i; leftclosed, rightclosed) that generates the labels from the left and right interval boundaries and the group index. Defaults to "Qi: [from, to)" (or "Qi: [from, to]" for the rightmost interval).
  • allowempty::Bool=false: when false, an error is raised if some quantiles breakpoints other than the last one are equal, generating empty intervals; when true, duplicate breaks are allowed and the intervals they generate are kept as unused levels (but duplicate labels are not allowed).
source
CategoricalArrays.decompressMethod
decompress(A::CategoricalArray)

Return a copy of categorical array A using the default reference type (UInt32). If A is using a small reference type (such as UInt8 or UInt16) the decompressed array will have room for more levels.

To avoid the need to call decompress, ensure compress is not called when creating the categorical array.

source
CategoricalArrays.isorderedMethod
isordered(A::CategoricalArray)

Test whether entries in A can be compared using <, > and similar operators, using the ordering of levels.

source
CategoricalArrays.levels!Method
levels!(A::CategoricalArray, newlevels::Vector; allowmissing::Bool=false)

Set the levels categorical array A. The order of appearance of levels will be respected by levels, which may affect display of results in some operations; if A is ordered (see isordered), it will also be used for order comparisons using <, > and similar operators. Reordering levels will never affect the values of entries in the array.

If A accepts missing values (i.e. eltype(A) >: Missing) and allowmissing=true, entries corresponding to omitted levels will be set to missing. Else, newlevels must include all levels which appear in the data.

source
CategoricalArrays.ordered!Method
ordered!(A::CategoricalArray, ordered::Bool)

Set whether entries in A can be compared using <, > and similar operators, using the ordering of levels. Return the modified A.

source
CategoricalArrays.recodeFunction
recode(a::AbstractArray[, default::Any], pairs::Pair...)

Return a copy of a, replacing elements matching a key of pairs with the corresponding value. The type of the array is chosen so that it can hold all recoded elements (but not necessarily original elements from a).

For each Pair in pairs, if the element is equal to (according to isequal) or in the key (first item of the pair), then the corresponding value (second item) is used. If the element matches no key and default is not provided or nothing, it is copied as-is; if default is specified, it is used in place of the original element. If an element matches more than one key, the first match is used.

recode(a::CategoricalArray[, default::Any], pairs::Pair...)

If a is a CategoricalArray then the ordering of resulting levels is determined by the order of passed pairs and default will be the last level if provided.

Examples

julia> using CategoricalArrays
 
 julia> recode(1:10, 1=>100, 2:4=>0, [5; 9:10]=>-1)
 10-element Vector{Int64}:
@@ -76,7 +76,7 @@
    8
   -1
   -1    
-
source
CategoricalArrays.recode!Function
recode!(dest::AbstractArray, src::AbstractArray[, default::Any], pairs::Pair...)

Fill dest with elements from src, replacing those matching a key of pairs with the corresponding value.

For each Pair in pairs, if the element is equal to (according to isequal)) the key (first item of the pair) or to one of its entries if it is a collection, then the corresponding value (second item) is copied to dest. If the element matches no key and default is not provided or nothing, it is copied as-is; if default is specified, it is used in place of the original element. dest and src must be of the same length, but not necessarily of the same type. Elements of src as well as values from pairs will be converted when possible on assignment. If an element matches more than one key, the first match is used.

recode!(dest::CategoricalArray, src::AbstractArray[, default::Any], pairs::Pair...)

If dest is a CategoricalArray then the ordering of resulting levels is determined by the order of passed pairs and default will be the last level if provided.

recode!(dest::AbstractArray, src::AbstractArray{>:Missing}[, default::Any], pairs::Pair...)

If src contains missing values, they are never replaced with default: use missing in a pair to recode them.

source
CategoricalArrays.recode!Method
recode!(a::AbstractArray[, default::Any], pairs::Pair...)

Convenience function for in-place recoding, equivalent to recode!(a, a, ...).

Examples

julia> using CategoricalArrays
+
source
CategoricalArrays.recode!Function
recode!(dest::AbstractArray, src::AbstractArray[, default::Any], pairs::Pair...)

Fill dest with elements from src, replacing those matching a key of pairs with the corresponding value.

For each Pair in pairs, if the element is equal to (according to isequal)) the key (first item of the pair) or to one of its entries if it is a collection, then the corresponding value (second item) is copied to dest. If the element matches no key and default is not provided or nothing, it is copied as-is; if default is specified, it is used in place of the original element. dest and src must be of the same length, but not necessarily of the same type. Elements of src as well as values from pairs will be converted when possible on assignment. If an element matches more than one key, the first match is used.

recode!(dest::CategoricalArray, src::AbstractArray[, default::Any], pairs::Pair...)

If dest is a CategoricalArray then the ordering of resulting levels is determined by the order of passed pairs and default will be the last level if provided.

recode!(dest::AbstractArray, src::AbstractArray{>:Missing}[, default::Any], pairs::Pair...)

If src contains missing values, they are never replaced with default: use missing in a pair to recode them.

source
CategoricalArrays.recode!Method
recode!(a::AbstractArray[, default::Any], pairs::Pair...)

Convenience function for in-place recoding, equivalent to recode!(a, a, ...).

Examples

julia> using CategoricalArrays
 
 julia> x = collect(1:10);
 
@@ -93,6 +93,6 @@
    7
    8
   -1
-  -1
source
DataAPI.levelsMethod
levels(x::CategoricalArray; skipmissing=true)
-levels(x::CategoricalValue)

Return the levels of categorical array or value x. This may include levels which do not actually appear in the data (see droplevels!). missing will be included only if it appears in the data and skipmissing=false is passed.

The returned vector is an internal field of x which must not be mutated as doing so would corrupt it.

source
DataAPI.unwrapMethod
unwrap(x::CategoricalValue)
-unwrap(x::Missing)

Get the value wrapped by categorical value x. If x is Missing return missing.

source
+ -1source
DataAPI.levelsMethod
levels(x::CategoricalArray; skipmissing=true)
+levels(x::CategoricalValue)

Return the levels of categorical array or value x. This may include levels which do not actually appear in the data (see droplevels!). missing will be included only if it appears in the data and skipmissing=false is passed.

The returned vector is an internal field of x which must not be mutated as doing so would corrupt it.

source
DataAPI.unwrapMethod
unwrap(x::CategoricalValue)
+unwrap(x::Missing)

Get the value wrapped by categorical value x. If x is Missing return missing.

source
diff --git a/dev/implementation/index.html b/dev/implementation/index.html index 0ed8fc58..c6acc009 100644 --- a/dev/implementation/index.html +++ b/dev/implementation/index.html @@ -1,2 +1,2 @@ -Implementation details · CategoricalArrays

Implementation details

CategoricalArray is made of the two fields:

  • refs: an integer array that stores the position of the category level in the levels field of CategoricalPool for each CategoricalArray element; 0 denotes a missing value (for CategoricalArray{Union{T, Missing}} only).
  • pool: the CategoricalPool object that maintains the levels of the array.

The CategoricalPool{V,R,C} type keeps track of the levels of type V and associates them with an integer reference code of type R (for internal use). It offers methods to add new levels, and efficiently get the integer index corresponding to a level and vice-versa. Whether the values of CategoricalArray are ordered or not is defined by an ordered field of the pool.

Do note that CategoricalPool levels are semi-mutable: it is only allowed to add new levels, but never to remove or reorder existing ones. This ensures existing CategoricalValue objects remain valid and always point to the same level as when they were created. Therefore, CategoricalArrays create a new pool each time some of their levels are removed or reordered. This happens when calling levels!, but also when assigning a CategoricalValue via setindex!, push!, append!, copy! or copyto! (as new levels may be added to the front to preserve relative order of both source and destination levels). Doing so requires updating all reference codes to point to the new pool, and makes it impossible to compare existing ordered CategoricalValue objects with values from the array using < and >.

The type parameters of CategoricalArray{T, N, R <: Integer, V, C, U} are a bit complex:

  • T is the type of array elements without CategoricalValue wrappers; if T >: Missing, then the array supports missing values.
  • N is the number of array dimensions.
  • R is the reference type, the element type of the refs field; it allows optimizing memory usage depending on the number of levels (i.e. CategoricalArray with less than 256 levels can use R = UInt8).
  • V is the type of the levels, it is equal to T for arrays which do not support missing values; for arrays which support missing values, T = Union{V, Missing}
  • C is the type of categorical values, i.e. of the objects returned when indexing non-missing elements of CategoricalArray. It is always equal to CategoricalValue{V, R}, and only present for technical reasons (to break the recursive dependency between CategoricalArray and CategoricalValue).
  • U can be either Union{} for arrays which do not support missing values, or Missing for those which support them.

Only T, N and R could be specified upon construction. The last three parameters are chosen automatically, but are needed for the definition of the type. In particular, U allows expressing that CategoricalArray{T, N} inherits from AbstractArray{Union{C, U}, N} (which is equivalent to AbstractArray{C, N} for arrays which do not support missing values, and to AbstractArray{Union{C, Missing}, N} for those which support them).

The CategoricalPool type is designed to limit the need to go over all elements of the vector, either for reading or for writing. This is why unused levels are not dropped automatically (this would force checking all elements on every modification or keeping a counts table), but only when droplevels! is called. levels is a (very fast) O(1) operation since it merely returns the (ordered) vector of levels without accessing the data at all.

Scalar operations between CategoricalValue objects or between a CategoricalValue and a CategoricalArray generally require checking whether pools are equal or whether one is a superset of the other. In order to make these operations efficient, CategoricalPool stores a pointer to the last encountered equal pool in the equalto field, and a pointer to the last encountered strict superset pool in subsetof field. The hash of the levels is computed the first time it is needed and stored in the hash field. These optimizations mean that when looping over values in an array, the cost of comparing pools only has to be paid once.

+Implementation details · CategoricalArrays

Implementation details

CategoricalArray is made of the two fields:

  • refs: an integer array that stores the position of the category level in the levels field of CategoricalPool for each CategoricalArray element; 0 denotes a missing value (for CategoricalArray{Union{T, Missing}} only).
  • pool: the CategoricalPool object that maintains the levels of the array.

The CategoricalPool{V,R,C} type keeps track of the levels of type V and associates them with an integer reference code of type R (for internal use). It offers methods to add new levels, and efficiently get the integer index corresponding to a level and vice-versa. Whether the values of CategoricalArray are ordered or not is defined by an ordered field of the pool.

Do note that CategoricalPool levels are semi-mutable: it is only allowed to add new levels, but never to remove or reorder existing ones. This ensures existing CategoricalValue objects remain valid and always point to the same level as when they were created. Therefore, CategoricalArrays create a new pool each time some of their levels are removed or reordered. This happens when calling levels!, but also when assigning a CategoricalValue via setindex!, push!, append!, copy! or copyto! (as new levels may be added to the front to preserve relative order of both source and destination levels). Doing so requires updating all reference codes to point to the new pool, and makes it impossible to compare existing ordered CategoricalValue objects with values from the array using < and >.

The type parameters of CategoricalArray{T, N, R <: Integer, V, C, U} are a bit complex:

  • T is the type of array elements without CategoricalValue wrappers; if T >: Missing, then the array supports missing values.
  • N is the number of array dimensions.
  • R is the reference type, the element type of the refs field; it allows optimizing memory usage depending on the number of levels (i.e. CategoricalArray with less than 256 levels can use R = UInt8).
  • V is the type of the levels, it is equal to T for arrays which do not support missing values; for arrays which support missing values, T = Union{V, Missing}
  • C is the type of categorical values, i.e. of the objects returned when indexing non-missing elements of CategoricalArray. It is always equal to CategoricalValue{V, R}, and only present for technical reasons (to break the recursive dependency between CategoricalArray and CategoricalValue).
  • U can be either Union{} for arrays which do not support missing values, or Missing for those which support them.

Only T, N and R could be specified upon construction. The last three parameters are chosen automatically, but are needed for the definition of the type. In particular, U allows expressing that CategoricalArray{T, N} inherits from AbstractArray{Union{C, U}, N} (which is equivalent to AbstractArray{C, N} for arrays which do not support missing values, and to AbstractArray{Union{C, Missing}, N} for those which support them).

The CategoricalPool type is designed to limit the need to go over all elements of the vector, either for reading or for writing. This is why unused levels are not dropped automatically (this would force checking all elements on every modification or keeping a counts table), but only when droplevels! is called. levels is a (very fast) O(1) operation since it merely returns the (ordered) vector of levels without accessing the data at all.

Scalar operations between CategoricalValue objects or between a CategoricalValue and a CategoricalArray generally require checking whether pools are equal or whether one is a superset of the other. In order to make these operations efficient, CategoricalPool stores a pointer to the last encountered equal pool in the equalto field, and a pointer to the last encountered strict superset pool in subsetof field. The hash of the levels is computed the first time it is needed and stored in the hash field. These optimizations mean that when looping over values in an array, the cost of comparing pools only has to be paid once.

diff --git a/dev/index.html b/dev/index.html index e3d44b65..1175f46b 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Overview · CategoricalArrays

Overview

The package provides the CategoricalArray type designed to hold categorical data (either unordered/nominal or ordered/ordinal) efficiently and conveniently. CategoricalArray{T} holds values of type T. The CategoricalArray{Union{T, Missing}} variant can also contain missing values (represented as missing, of the Missing type). When indexed, CategoricalArray{T} returns special CategoricalValue{T} objects rather than the original values of type T. CategoricalValue is a simple wrapper around the categorical levels; it allows very efficient retrieval and comparison of actual values. See the PooledArrays.jl and IndirectArrays.jl packages for simpler array types storing data with a small number of values without wrapping them.

The main feature of CategoricalArray is that it maintains a pool of the levels which can appear in the data. These levels are stored in a specific order: for unordered arrays, this order is only used for pretty printing (e.g. in cross tables or plots); for ordered arrays, it also allows comparing values using the < and > operators: the comparison is then based on the ordering of levels stored in the array. An ordered CategoricalValue can be also compared with a value that when converted is equal to one of the levels of this CategoricalValue. Whether an array is ordered can be defined either on construction via the ordered argument, or at any time via the ordered! function. The levels function returns all the levels of CategoricalArray, and the levels! function can be used to set the levels and their order. Levels are also automatically extended when setting an array element to a level not encountered before. But they are never removed without manual intervention: use the droplevels! function for this.

+Overview · CategoricalArrays

Overview

The package provides the CategoricalArray type designed to hold categorical data (either unordered/nominal or ordered/ordinal) efficiently and conveniently. CategoricalArray{T} holds values of type T. The CategoricalArray{Union{T, Missing}} variant can also contain missing values (represented as missing, of the Missing type). When indexed, CategoricalArray{T} returns special CategoricalValue{T} objects rather than the original values of type T. CategoricalValue is a simple wrapper around the categorical levels; it allows very efficient retrieval and comparison of actual values. See the PooledArrays.jl and IndirectArrays.jl packages for simpler array types storing data with a small number of values without wrapping them.

The main feature of CategoricalArray is that it maintains a pool of the levels which can appear in the data. These levels are stored in a specific order: for unordered arrays, this order is only used for pretty printing (e.g. in cross tables or plots); for ordered arrays, it also allows comparing values using the < and > operators: the comparison is then based on the ordering of levels stored in the array. An ordered CategoricalValue can be also compared with a value that when converted is equal to one of the levels of this CategoricalValue. Whether an array is ordered can be defined either on construction via the ordered argument, or at any time via the ordered! function. The levels function returns all the levels of CategoricalArray, and the levels! function can be used to set the levels and their order. Levels are also automatically extended when setting an array element to a level not encountered before. But they are never removed without manual intervention: use the droplevels! function for this.

diff --git a/dev/using/index.html b/dev/using/index.html index 6e05c08a..6de24319 100644 --- a/dev/using/index.html +++ b/dev/using/index.html @@ -205,4 +205,4 @@ "c" julia> isordered(ab2) -false

The resulting array is marked as ordered only if all the source array(s) are ordered, with the exception that unordered arrays with no levels do not prompt the result to be marked as unordered. In particular, this allows assignment of a CategoricalValue to an empty CategoricalArray via setindex! to copy the levels of the source value and to mark the result as ordered.

Do note that in some cases the two sets of levels may have compatible orderings, but it is not possible to determine in what order should levels appear in the merged set. This is the case for example with ["a, "b", "d"] and ["c", "d", "e"]: there is no way to detect that "c" should be inserted exactly after "b" (lexicographic ordering is not relevant here). In such cases, the resulting array is marked as unordered. This situation can only happen when working with data subsets selected based on non-contiguous subsets of levels.

Exported functions

categorical(A) - Construct a categorical array with values from A

compress(A) - Return a copy of categorical array A using the smallest possible reference type

cut(x) - Cut a numeric array into intervals and return an ordered CategoricalArray

decompress(A) - Return a copy of categorical array A using the default reference type

isordered(A) - Test whether entries in A can be compared using <, > and similar operators

ordered!(A, ordered) - Set whether entries in A can be compared using <, > and similar operators

recode(a[, default], pairs...) - Return a copy of a after replacing one or more values

recode!(a[, default], pairs...) - Replace one or more values in a in-place

unwrap(x) - Return the value contained in categorical value x; if x is Missing return missing

levelcode(x) - Return the code of categorical value x, i.e. its index in the set of possible values returned by levels(x).

See API Index for more details.

+false

The resulting array is marked as ordered only if all the source array(s) are ordered, with the exception that unordered arrays with no levels do not prompt the result to be marked as unordered. In particular, this allows assignment of a CategoricalValue to an empty CategoricalArray via setindex! to copy the levels of the source value and to mark the result as ordered.

Do note that in some cases the two sets of levels may have compatible orderings, but it is not possible to determine in what order should levels appear in the merged set. This is the case for example with ["a, "b", "d"] and ["c", "d", "e"]: there is no way to detect that "c" should be inserted exactly after "b" (lexicographic ordering is not relevant here). In such cases, the resulting array is marked as unordered. This situation can only happen when working with data subsets selected based on non-contiguous subsets of levels.

Exported functions

categorical(A) - Construct a categorical array with values from A

compress(A) - Return a copy of categorical array A using the smallest possible reference type

cut(x) - Cut a numeric array into intervals and return an ordered CategoricalArray

decompress(A) - Return a copy of categorical array A using the default reference type

isordered(A) - Test whether entries in A can be compared using <, > and similar operators

ordered!(A, ordered) - Set whether entries in A can be compared using <, > and similar operators

recode(a[, default], pairs...) - Return a copy of a after replacing one or more values

recode!(a[, default], pairs...) - Replace one or more values in a in-place

unwrap(x) - Return the value contained in categorical value x; if x is Missing return missing

levelcode(x) - Return the code of categorical value x, i.e. its index in the set of possible values returned by levels(x).

See API Index for more details.