diff --git a/dev/composition/index.html b/dev/composition/index.html index 4693adf5..b0bbebbe 100644 --- a/dev/composition/index.html +++ b/dev/composition/index.html @@ -1,2 +1,2 @@ -
Settings
This document was generated with Documenter.jl version 0.27.25 on Monday 6 May 2024. Using Julia version 1.10.3.
Settings
This document was generated with Documenter.jl version 0.27.25 on Monday 6 May 2024. Using Julia version 1.10.3.
Load it with DelimitedFiles and Tables
data_raw, data_header = readdlm(fpath, ',', header=true)
data_table = Tables.table(data_raw; header=Symbol.(vec(data_header)))
Retrieve the conversions:
for (n, st) in zip(names(data), scitype_union.(eachcol(data)))
println(":$n=>$st,")
-end
Copy and paste the result in a coerce
data_table = coerce(data_table, ...)
MLJBase.load_dataset
— Methodload_dataset(fpath, coercions)
Load one of standard dataset like Boston etc assuming the file is a comma separated file with a header.
MLJBase.load_sunspots
— MethodLoad a well-known sunspot time series (table with one column). [https://www.sws.bom.gov.au/Educational/2/3/6]](https://www.sws.bom.gov.au/Educational/2/3/6)
MLJBase.@load_ames
— MacroLoad the full version of the well-known Ames Housing task.
MLJBase.@load_boston
— MacroLoad a well-known public regression dataset with Continuous
features.
MLJBase.@load_crabs
— MacroLoad a well-known crab classification dataset with nominal features.
MLJBase.@load_iris
— MacroLoad a well-known public classification task with nominal features.
MLJBase.@load_reduced_ames
— MacroLoad a reduced version of the well-known Ames Housing task
MLJBase.@load_smarket
— MacroLoad S&P Stock Market dataset, as used in (An Introduction to Statistical Learning with applications in R)https://rdrr.io/cran/ISLR/man/Smarket.html, by Witten et al (2013), Springer-Verlag, New York.
MLJBase.@load_sunspots
— MacroLoad a well-known sunspot time series (single table with one column).
MLJBase.x
— Constantfinalize_Xy(X, y, shuffle, as_table, eltype, rng; clf)
Internal function to finalize the make_*
functions.
MLJBase.augment_X
— Methodaugment_X(X, fit_intercept)
Given a matrix X
, append a column of ones if fit_intercept
is true. See make_regression
.
MLJBase.make_blobs
— FunctionX, y = make_blobs(n=100, p=2; kwargs...)
Generate Gaussian blobs for clustering and classification problems.
Return value
By default, a table X
with p
columns (features) and n
rows (observations), together with a corresponding vector of n
Multiclass
target observations y
, indicating blob membership.
Keyword arguments
shuffle=true
: whether to shuffle the resulting points,
centers=3
: either a number of centers or a c x p
matrix with c
pre-determined centers,
cluster_std=1.0
: the standard deviation(s) of each blob,
center_box=(-10. => 10.)
: the limits of the p
-dimensional cube within which the cluster centers are drawn if they are not provided,
eltype=Float64
: machine type of points (any subtype of AbstractFloat
).
rng=Random.GLOBAL_RNG
: any AbstractRNG
object, or integer to seed a MersenneTwister
(for reproducibility).
as_table=true
: whether to return the points as a table (true) or a matrix (false). If false
the target y
has integer element type.
Example
X, y = make_blobs(100, 3; centers=2, cluster_std=[1.0, 3.0])
MLJBase.make_circles
— FunctionX, y = make_circles(n=100; kwargs...)
Generate n
labeled points close to two concentric circles for classification and clustering models.
Return value
By default, a table X
with 2
columns and n
rows (observations), together with a corresponding vector of n
Multiclass
target observations y
. The target is either 0
or 1
, corresponding to membership to the smaller or larger circle, respectively.
Keyword arguments
shuffle=true
: whether to shuffle the resulting points,
noise=0
: standard deviation of the Gaussian noise added to the data,
factor=0.8
: ratio of the smaller radius over the larger one,
eltype=Float64
: machine type of points (any subtype of AbstractFloat
).
rng=Random.GLOBAL_RNG
: any AbstractRNG
object, or integer to seed a MersenneTwister
(for reproducibility).
as_table=true
: whether to return the points as a table (true) or a matrix (false). If false
the target y
has integer element type.
Example
X, y = make_circles(100; noise=0.5, factor=0.3)
MLJBase.make_moons
— Function make_moons(n::Int=100; kwargs...)
Generates labeled two-dimensional points lying close to two interleaved semi-circles, for use with classification and clustering models.
Return value
By default, a table X
with 2
columns and n
rows (observations), together with a corresponding vector of n
Multiclass
target observations y
. The target is either 0
or 1
, corresponding to membership to the left or right semi-circle.
Keyword arguments
shuffle=true
: whether to shuffle the resulting points,
noise=0.1
: standard deviation of the Gaussian noise added to the data,
xshift=1.0
: horizontal translation of the second center with respect to the first one.
yshift=0.3
: vertical translation of the second center with respect to the first one.
eltype=Float64
: machine type of points (any subtype of AbstractFloat
).
rng=Random.GLOBAL_RNG
: any AbstractRNG
object, or integer to seed a MersenneTwister
(for reproducibility).
as_table=true
: whether to return the points as a table (true) or a matrix (false). If false
the target y
has integer element type.
Example
X, y = make_moons(100; noise=0.5)
MLJBase.make_regression
— Functionmake_regression(n, p; kwargs...)
Generate Gaussian input features and a linear response with Gaussian noise, for use with regression models.
Return value
By default, a tuple (X, y)
where table X
has p
columns and n
rows (observations), together with a corresponding vector of n
Continuous
target observations y
.
Keywords
intercept=true
: Whether to generate data from a model with intercept.
n_targets=1
: Number of columns in the target.
sparse=0
: Proportion of the generating weight vector that is sparse.
noise=0.1
: Standard deviation of the Gaussian noise added to the response (target).
outliers=0
: Proportion of the response vector to make as outliers by adding a random quantity with high variance. (Only applied if binary
is false
.)
as_table=true
: Whether X
(and y
, if n_targets > 1
) should be a table or a matrix.
eltype=Float64
: Element type for X
and y
. Must subtype AbstractFloat
.
binary=false
: Whether the target should be binarized (via a sigmoid).
eltype=Float64
: machine type of points (any subtype of AbstractFloat
).
rng=Random.GLOBAL_RNG
: any AbstractRNG
object, or integer to seed a MersenneTwister
(for reproducibility).
as_table=true
: whether to return the points as a table (true) or a matrix (false).
Example
X, y = make_regression(100, 5; noise=0.5, sparse=0.2, outliers=0.1)
MLJBase.outlify!
— MethodAdd outliers to portion s of vector.
MLJBase.runif_ab
— Methodrunif_ab(rng, n, p, a, b)
Internal function to generate n
points in [a, b]ᵖ
uniformly at random.
MLJBase.sigmoid
— Methodsigmoid(x)
Return the sigmoid computed in a numerically stable way:
$σ(x) = 1/(1+exp(-x))$
MLJBase.sparsify!
— Methodsparsify!(rng, θ, s)
Make portion s
of vector θ
exactly 0.
MLJBase.complement
— Methodcomplement(folds, i)
The complement of the i
th fold of folds
in the concatenation of all elements of folds
. Here folds
is a vector or tuple of integer vectors, typically representing row indices or a vector, matrix or table.
complement(([1,2], [3,], [4, 5]), 2) # [1 ,2, 4, 5]
MLJBase.corestrict
— Methodcorestrict(X, folds, i)
The restriction of X
, a vector, matrix or table, to the complement of the i
th fold of folds
, where folds
is a tuple of vectors of row indices.
The method is curried, so that corestrict(folds, i)
is the operator on data defined by corestrict(folds, i)(X) = corestrict(X, folds, i)
.
Example
folds = ([1, 2], [3, 4, 5], [6,])
-corestrict([:x1, :x2, :x3, :x4, :x5, :x6], folds, 2) # [:x1, :x2, :x6]
MLJBase.partition
— Methodpartition(X, fractions...;
+end
Copy and paste the result in a coerce
data_table = coerce(data_table, ...)
MLJBase.load_dataset
— Methodload_dataset(fpath, coercions)
Load one of standard dataset like Boston etc assuming the file is a comma separated file with a header.
MLJBase.load_sunspots
— MethodLoad a well-known sunspot time series (table with one column). [https://www.sws.bom.gov.au/Educational/2/3/6]](https://www.sws.bom.gov.au/Educational/2/3/6)
MLJBase.@load_ames
— MacroLoad the full version of the well-known Ames Housing task.
MLJBase.@load_boston
— MacroLoad a well-known public regression dataset with Continuous
features.
MLJBase.@load_crabs
— MacroLoad a well-known crab classification dataset with nominal features.
MLJBase.@load_iris
— MacroLoad a well-known public classification task with nominal features.
MLJBase.@load_reduced_ames
— MacroLoad a reduced version of the well-known Ames Housing task
MLJBase.@load_smarket
— MacroLoad S&P Stock Market dataset, as used in (An Introduction to Statistical Learning with applications in R)https://rdrr.io/cran/ISLR/man/Smarket.html, by Witten et al (2013), Springer-Verlag, New York.
MLJBase.@load_sunspots
— MacroLoad a well-known sunspot time series (single table with one column).
MLJBase.augment_X
— Methodaugment_X(X, fit_intercept)
Given a matrix X
, append a column of ones if fit_intercept
is true. See make_regression
.
MLJBase.finalize_Xy
— Methodfinalize_Xy(X, y, shuffle, as_table, eltype, rng; clf)
Internal function to finalize the make_*
functions.
MLJBase.make_blobs
— FunctionX, y = make_blobs(n=100, p=2; kwargs...)
Generate Gaussian blobs for clustering and classification problems.
Return value
By default, a table X
with p
columns (features) and n
rows (observations), together with a corresponding vector of n
Multiclass
target observations y
, indicating blob membership.
Keyword arguments
shuffle=true
: whether to shuffle the resulting points,
centers=3
: either a number of centers or a c x p
matrix with c
pre-determined centers,
cluster_std=1.0
: the standard deviation(s) of each blob,
center_box=(-10. => 10.)
: the limits of the p
-dimensional cube within which the cluster centers are drawn if they are not provided,
eltype=Float64
: machine type of points (any subtype of AbstractFloat
).
rng=Random.GLOBAL_RNG
: any AbstractRNG
object, or integer to seed a MersenneTwister
(for reproducibility).
as_table=true
: whether to return the points as a table (true) or a matrix (false). If false
the target y
has integer element type.
Example
X, y = make_blobs(100, 3; centers=2, cluster_std=[1.0, 3.0])
MLJBase.make_circles
— FunctionX, y = make_circles(n=100; kwargs...)
Generate n
labeled points close to two concentric circles for classification and clustering models.
Return value
By default, a table X
with 2
columns and n
rows (observations), together with a corresponding vector of n
Multiclass
target observations y
. The target is either 0
or 1
, corresponding to membership to the smaller or larger circle, respectively.
Keyword arguments
shuffle=true
: whether to shuffle the resulting points,
noise=0
: standard deviation of the Gaussian noise added to the data,
factor=0.8
: ratio of the smaller radius over the larger one,
eltype=Float64
: machine type of points (any subtype of AbstractFloat
).
rng=Random.GLOBAL_RNG
: any AbstractRNG
object, or integer to seed a MersenneTwister
(for reproducibility).
as_table=true
: whether to return the points as a table (true) or a matrix (false). If false
the target y
has integer element type.
Example
X, y = make_circles(100; noise=0.5, factor=0.3)
MLJBase.make_moons
— Functionmake_moons(n::Int=100; kwargs...)
Generates labeled two-dimensional points lying close to two interleaved semi-circles, for use with classification and clustering models.
Return value
By default, a table X
with 2
columns and n
rows (observations), together with a corresponding vector of n
Multiclass
target observations y
. The target is either 0
or 1
, corresponding to membership to the left or right semi-circle.
Keyword arguments
shuffle=true
: whether to shuffle the resulting points,
noise=0.1
: standard deviation of the Gaussian noise added to the data,
xshift=1.0
: horizontal translation of the second center with respect to the first one.
yshift=0.3
: vertical translation of the second center with respect to the first one.
eltype=Float64
: machine type of points (any subtype of AbstractFloat
).
rng=Random.GLOBAL_RNG
: any AbstractRNG
object, or integer to seed a MersenneTwister
(for reproducibility).
as_table=true
: whether to return the points as a table (true) or a matrix (false). If false
the target y
has integer element type.
Example
X, y = make_moons(100; noise=0.5)
MLJBase.make_regression
— Functionmake_regression(n, p; kwargs...)
Generate Gaussian input features and a linear response with Gaussian noise, for use with regression models.
Return value
By default, a tuple (X, y)
where table X
has p
columns and n
rows (observations), together with a corresponding vector of n
Continuous
target observations y
.
Keywords
intercept=true
: Whether to generate data from a model with intercept.
n_targets=1
: Number of columns in the target.
sparse=0
: Proportion of the generating weight vector that is sparse.
noise=0.1
: Standard deviation of the Gaussian noise added to the response (target).
outliers=0
: Proportion of the response vector to make as outliers by adding a random quantity with high variance. (Only applied if binary
is false
.)
as_table=true
: Whether X
(and y
, if n_targets > 1
) should be a table or a matrix.
eltype=Float64
: Element type for X
and y
. Must subtype AbstractFloat
.
binary=false
: Whether the target should be binarized (via a sigmoid).
eltype=Float64
: machine type of points (any subtype of AbstractFloat
).
rng=Random.GLOBAL_RNG
: any AbstractRNG
object, or integer to seed a MersenneTwister
(for reproducibility).
as_table=true
: whether to return the points as a table (true) or a matrix (false).
Example
X, y = make_regression(100, 5; noise=0.5, sparse=0.2, outliers=0.1)
MLJBase.outlify!
— MethodAdd outliers to portion s of vector.
MLJBase.runif_ab
— Methodrunif_ab(rng, n, p, a, b)
Internal function to generate n
points in [a, b]ᵖ
uniformly at random.
MLJBase.sigmoid
— Methodsigmoid(x)
Return the sigmoid computed in a numerically stable way: $σ(x) = 1/(1+exp(-x))$
MLJBase.sparsify!
— Methodsparsify!(rng, θ, s)
Make portion s
of vector θ
exactly 0.
MLJBase.complement
— Methodcomplement(folds, i)
The complement of the i
th fold of folds
in the concatenation of all elements of folds
. Here folds
is a vector or tuple of integer vectors, typically representing row indices or a vector, matrix or table.
complement(([1,2], [3,], [4, 5]), 2) # [1 ,2, 4, 5]
MLJBase.corestrict
— Methodcorestrict(X, folds, i)
The restriction of X
, a vector, matrix or table, to the complement of the i
th fold of folds
, where folds
is a tuple of vectors of row indices.
The method is curried, so that corestrict(folds, i)
is the operator on data defined by corestrict(folds, i)(X) = corestrict(X, folds, i)
.
Example
folds = ([1, 2], [3, 4, 5], [6,])
+corestrict([:x1, :x2, :x3, :x4, :x5, :x6], folds, 2) # [:x1, :x2, :x6]
MLJBase.partition
— Methodpartition(X, fractions...;
shuffle=nothing,
rng=Random.GLOBAL_RNG,
stratify=nothing,
- multi=false)
Splits the vector, matrix or table X
into a tuple of objects of the same type, whose vertical concatenation is X
. The number of rows in each component of the return value is determined by the corresponding fractions
of length(nrows(X))
, where valid fractions are floats between 0 and 1 whose sum is less than one. The last fraction is not provided, as it is inferred from the preceding ones.
For "synchronized" partitioning of multiple objects, use the multi=true
option described below.
julia> partition(1:1000, 0.8)
+ multi=false)
Splits the vector, matrix or table X
into a tuple of objects of the same type, whose vertical concatenation is X
. The number of rows in each component of the return value is determined by the corresponding fractions
of length(nrows(X))
, where valid fractions are floats between 0 and 1 whose sum is less than one. The last fraction is not provided, as it is inferred from the preceding ones.
For synchronized partitioning of multiple objects, use the multi=true
option.
julia> partition(1:1000, 0.8)
([1,...,800], [801,...,1000])
julia> partition(1:1000, 0.2, 0.7)
@@ -18,15 +18,13 @@
julia> partition(reshape(1:10, 5, 2), 0.2, 0.4)
([1 6], [2 7; 3 8], [4 9; 5 10])
-X, y = make_blobs() # a table and vector
-Xtrain, Xtest = partition(X, 0.8, stratify=y)
-
-(Xtrain, Xtest), (ytrain, ytest) = partition((X, y), 0.8, rng=123, multi=true)
Keywords
shuffle=nothing
: if set to true
, shuffles the rows before taking fractions.
rng=Random.GLOBAL_RNG
: specifies the random number generator to be used, can be an integer seed. If specified, and shuffle === nothing
is interpreted as true.
stratify=nothing
: if a vector is specified, the partition will match the stratification of the given vector. In that case, shuffle
cannot be false
.
multi=false
: if true
then X
is expected to be a tuple
of objects sharing a common length, which are each partitioned separately using the same specified fractions
and the same row shuffling. Returns a tuple of partitions (a tuple of tuples).
MLJBase.restrict
— Methodrestrict(X, folds, i)
The restriction of X
, a vector, matrix or table, to the i
th fold of folds
, where folds
is a tuple of vectors of row indices.
The method is curried, so that restrict(folds, i)
is the operator on data defined by restrict(folds, i)(X) = restrict(X, folds, i)
.
Example
folds = ([1, 2], [3, 4, 5], [6,])
-restrict([:x1, :x2, :x3, :x4, :x5, :x6], folds, 2) # [:x3, :x4, :x5]
See also corestrict
MLJBase.skipinvalid
— Methodskipinvalid(itr)
Return an iterator over the elements in itr
skipping missing
and NaN
values. Behaviour is similar to skipmissing
.
skipinvalid(A, B)
For vectors A
and B
of the same length, return a tuple of vectors (A[mask], B[mask])
where mask[i]
is true
if and only if A[i]
and B[i]
are both valid (non-missing
and non-NaN
). Can also called on other iterators of matching length, such as arrays, but always returns a vector. Does not remove Missing
from the element types if present in the original iterators.
MLJBase.unpack
— Methodunpack(table, f1, f2, ... fk;
+julia> X, y = make_blobs() # a table and vector
+julia> Xtrain, Xtest = partition(X, 0.8, stratify=y)
Here's an example of synchronized partitioning of multiple objects:
julia> (Xtrain, Xtest), (ytrain, ytest) = partition((X, y), 0.8, rng=123, multi=true)
Keywords
shuffle=nothing
: if set to true
, shuffles the rows before taking fractions.
rng=Random.GLOBAL_RNG
: specifies the random number generator to be used, can be an integer seed. If specified, and shuffle === nothing
is interpreted as true.
stratify=nothing
: if a vector is specified, the partition will match the stratification of the given vector. In that case, shuffle
cannot be false
.
multi=false
: if true
then X
is expected to be a tuple
of objects sharing a common length, which are each partitioned separately using the same specified fractions
and the same row shuffling. Returns a tuple of partitions (a tuple of tuples).
MLJBase.restrict
— Methodrestrict(X, folds, i)
The restriction of X
, a vector, matrix or table, to the i
th fold of folds
, where folds
is a tuple of vectors of row indices.
The method is curried, so that restrict(folds, i)
is the operator on data defined by restrict(folds, i)(X) = restrict(X, folds, i)
.
Example
folds = ([1, 2], [3, 4, 5], [6,])
+restrict([:x1, :x2, :x3, :x4, :x5, :x6], folds, 2) # [:x3, :x4, :x5]
See also corestrict
MLJBase.skipinvalid
— Methodskipinvalid(itr)
Return an iterator over the elements in itr
skipping missing
and NaN
values. Behaviour is similar to skipmissing
.
skipinvalid(A, B)
For vectors A
and B
of the same length, return a tuple of vectors (A[mask], B[mask])
where mask[i]
is true
if and only if A[i]
and B[i]
are both valid (non-missing
and non-NaN
). Can also called on other iterators of matching length, such as arrays, but always returns a vector. Does not remove Missing
from the element types if present in the original iterators.
MLJBase.unpack
— Methodunpack(table, f1, f2, ... fk;
wrap_singles=false,
shuffle=false,
rng::Union{AbstractRNG,Int,Nothing}=nothing,
- coerce_options...)
Horizontally split any Tables.jl compatible table
into smaller tables or vectors by making column selections determined by the predicates f1
, f2
, ..., fk
. Selection from the column names is without replacement. A predicate is any object f
such that f(name)
is true
or false
for each column name::Symbol
of table
.
Returns a tuple of tables/vectors with length one greater than the number of supplied predicates, with the last component including all previously unselected columns.
julia> table = DataFrame(x=[1,2], y=['a', 'b'], z=[10.0, 20.0], w=["A", "B"])
+ coerce_options...)
Horizontally split any Tables.jl compatible table
into smaller tables or vectors by making column selections determined by the predicates f1
, f2
, ..., fk
. Selection from the column names is without replacement. A predicate is any object f
such that f(name)
is true
or false
for each column name::Symbol
of table
.
Returns a tuple of tables/vectors with length one greater than the number of supplied predicates, with the last component including all previously unselected columns.
julia> table = DataFrame(x=[1,2], y=['a', 'b'], z=[10.0, 20.0], w=["A", "B"])
2×4 DataFrame
Row │ x y z w
│ Int64 Char Float64 String
@@ -51,4 +49,4 @@
julia> W # the column(s) left over
2-element Vector{String}:
"A"
- "B"
Whenever a returned table contains a single column, it is converted to a vector unless wrap_singles=true
.
If coerce_options
are specified then table
is first replaced with coerce(table, coerce_options)
. See ScientificTypes.coerce
for details.
If shuffle=true
then the rows of table
are first shuffled, using the global RNG, unless rng
is specified; if rng
is an integer, it specifies the seed of an automatically generated Mersenne twister. If rng
is specified then shuffle=true
is implicit.
Settings
This document was generated with Documenter.jl version 0.27.25 on Monday 6 May 2024. Using Julia version 1.10.3.