tech.ml.dataset Getting Started
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Getting Started
What kind of data?
TMD processes tabular data, that is, data logically arranged in rows and columns. Similar to a spreadsheet (but handling much larger datasets) or a database (but much more convenient), TMD accelerates exploring, cleaning, and processing data tables. TMD inherits Clojure's data-orientation and flexible dynamic typing, without compromising on being functional; thereby extending the language's reach to new problems and domains.
> (ds/->dataset "lucy.csv")
diff --git a/docs/100-walkthrough.html b/docs/100-walkthrough.html
index 2b29371c..8f534c50 100644
--- a/docs/100-walkthrough.html
+++ b/docs/100-walkthrough.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Walkthrough
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Walkthrough
tech.ml.dataset
(TMD) is a Clojure library designed to ease working with tabular data, similar to data.table
in R or Python's Pandas. TMD takes inspiration from the design of those tools, but does not aim to copy their functionality. Instead, TMD is a building block that increases Clojure's already considerable data processing power.
High Level Design
In TMD, a dataset is logically a map of column name to column data. Column data is typed (e.g., a column of 16 bit integers, or a column of 64 bit floating point numbers), similar to a database. Column names may be any Java object - keywords and strings are typical - and column values may be any Java primitive type, or type supported by tech.datatype
, datetimes, or arbitrary objects. Column data is stored contiguously in JVM arrays, and missing values are indicated with bitsets.
diff --git a/docs/200-quick-reference.html b/docs/200-quick-reference.html
index d08108b6..2655fd4b 100644
--- a/docs/200-quick-reference.html
+++ b/docs/200-quick-reference.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Quick Reference
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Quick Reference
This topic summarizes many of the most frequently used TMD functions, together with some quick notes about their use. Functions here are linked to further documentation, or their source. Note, unless a namespace is specified, each function is accessible via the tech.ml.dataset
namespace.
For a more thorough treatment, the API docs list every available function.
Table of Contents
diff --git a/docs/columns-readers-and-datatypes.html b/docs/columns-readers-and-datatypes.html
index d41ff11a..b0df4eee 100644
--- a/docs/columns-readers-and-datatypes.html
+++ b/docs/columns-readers-and-datatypes.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Columns, Readers, and Datatypes
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Columns, Readers, and Datatypes
In tech.ml.dataset
, columns are composed of three things:
data, metadata, and the missing set.
The column's datatype is the datatype of the data
member. The data member can
diff --git a/docs/index.html b/docs/index.html
index b1f31218..ef81fe83 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -1,10 +1,10 @@
-
TMD 7.027 TMD 7.027
A Clojure high performance data processing system.
Topics
- tech.ml.dataset Getting Started
- tech.ml.dataset Walkthrough
- tech.ml.dataset Quick Reference
- tech.ml.dataset Columns, Readers, and Datatypes
- tech.ml.dataset And nippy
- tech.ml.dataset Supported Datatypes
Namespaces
tech.v3.dataset
Column major dataset abstraction for efficiently manipulating
+ gtag('config', 'G-95TVFC1FEB');
TMD 7.028
A Clojure high performance data processing system.
Topics
- tech.ml.dataset Getting Started
- tech.ml.dataset Walkthrough
- tech.ml.dataset Quick Reference
- tech.ml.dataset Columns, Readers, and Datatypes
- tech.ml.dataset And nippy
- tech.ml.dataset Supported Datatypes
Namespaces
tech.v3.dataset
Column major dataset abstraction for efficiently manipulating
in memory datasets.
Public variables and functions:
- ->>dataset
- ->dataset
- add-column
- add-or-update-column
- all-descriptive-stats-names
- append-columns
- assoc-ds
- assoc-metadata
- bind->
- brief
- categorical->number
- categorical->one-hot
- column
- column->dataset
- column-cast
- column-count
- column-labeled-mapseq
- column-map
- column-map-m
- column-names
- columns
- columns-with-missing-seq
- columnwise-concat
- concat
- concat-copying
- concat-inplace
- data->dataset
- dataset->data
- dataset-name
- dataset-parser
- dataset?
- descriptive-stats
- drop-columns
- drop-missing
- drop-rows
- empty-dataset
- ensure-array-backed
- filter
- filter-column
- filter-dataset
- group-by
- group-by->indexes
- group-by-column
- group-by-column->indexes
- group-by-column-consumer
- has-column?
- head
- induction
- major-version
- mapseq-parser
- mapseq-reader
- mapseq-rf
- min-n-by-column
- missing
- new-column
- new-dataset
- order-column-names
- pmap-ds
- print-all
- rand-nth
- remove-column
- remove-columns
- remove-rows
- rename-columns
- replace-missing
- replace-missing-value
- reverse-rows
- row-at
- row-count
- row-map
- row-mapcat
- rows
- rowvec-at
- rowvecs
- sample
- select
- select-by-index
- select-columns
- select-columns-by-index
- select-missing
- select-rows
- set-dataset-name
- shape
- shuffle
- sort-by
- sort-by-column
- tail
- take-nth
- unique-by
- unique-by-column
- unordered-select
- unroll-column
- update
- update-column
- update-columns
- update-columnwise
- update-elemwise
- value-reader
- write!
tech.v3.dataset.categorical
Conversions of categorical values into numbers and back. Two forms of conversions
are supported, a straight value->integer map and one-hot encoding.
diff --git a/docs/nippy-serialization-rocks.html b/docs/nippy-serialization-rocks.html
index 82cf1caa..c8701e4f 100644
--- a/docs/nippy-serialization-rocks.html
+++ b/docs/nippy-serialization-rocks.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.ml.dataset And nippy
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset And nippy
We are big fans of the nippy system for
freezing/thawing data. So we were pleasantly surprized with how well it performs
with dataset and how easy it was to extend the dataset object to support nippy
diff --git a/docs/supported-datatypes.html b/docs/supported-datatypes.html
index 99391363..420628b2 100644
--- a/docs/supported-datatypes.html
+++ b/docs/supported-datatypes.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.ml.dataset Supported Datatypes
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Supported Datatypes
tech.ml.dataset
supports a wide range of datatypes and has a system for expanding
the supported datatype set, aliasing new names to existing datatypes, and packing
object datatypes into primitive containers. Let's walk through each of these topics
diff --git a/docs/tech.v3.dataset.categorical.html b/docs/tech.v3.dataset.categorical.html
index f9150154..93cd8664 100644
--- a/docs/tech.v3.dataset.categorical.html
+++ b/docs/tech.v3.dataset.categorical.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.categorical
Conversions of categorical values into numbers and back. Two forms of conversions
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.categorical
Conversions of categorical values into numbers and back. Two forms of conversions
are supported, a straight value->integer map and one-hot encoding.
The functions in this namespace manipulate the metadata on the columns of the dataset, wich can be inspected via clojure.core/meta
dataset->categorical-maps
(dataset->categorical-maps dataset)
Given a dataset, return a sequence of categorical map entries.
diff --git a/docs/tech.v3.dataset.clipboard.html b/docs/tech.v3.dataset.clipboard.html
index 882e7162..a82c4892 100644
--- a/docs/tech.v3.dataset.clipboard.html
+++ b/docs/tech.v3.dataset.clipboard.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.clipboard
Optional namespace that copies a dataset to the clipboard for pasting into
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.clipboard
Optional namespace that copies a dataset to the clipboard for pasting into
applications such as excel or google sheets.
Reading defaults to 'csv' format while writing defaults to 'tsv' format.
clipboard
(clipboard)
Get the system clipboard.
diff --git a/docs/tech.v3.dataset.column-filters.html b/docs/tech.v3.dataset.column-filters.html
index e60dd731..992f17a1 100644
--- a/docs/tech.v3.dataset.column-filters.html
+++ b/docs/tech.v3.dataset.column-filters.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.column-filters
Queries to select column subsets that have various properites such as all numeric
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.column-filters
Queries to select column subsets that have various properites such as all numeric
columns, all feature columns, or columns that have a specific datatype.
Further a few set operations (union, intersection, difference) are provided
to further manipulate subsets of columns.
diff --git a/docs/tech.v3.dataset.column.html b/docs/tech.v3.dataset.column.html
index de823647..11585346 100644
--- a/docs/tech.v3.dataset.column.html
+++ b/docs/tech.v3.dataset.column.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.column
clone
(clone col)
Clone this column not changing anything.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.column
column-map
(column-map map-fn res-dtype & args)
Map a scalar function across one or more columns.
This is the semi-missing-set aware version of tech.v3.datatype/emap. This function
is never lazy.
diff --git a/docs/tech.v3.dataset.html b/docs/tech.v3.dataset.html
index 0daa2c9c..8dceb492 100644
--- a/docs/tech.v3.dataset.html
+++ b/docs/tech.v3.dataset.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset
Column major dataset abstraction for efficiently manipulating
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset
Column major dataset abstraction for efficiently manipulating
in memory datasets.
->>dataset
(->>dataset options dataset)
(->>dataset dataset)
Please see documentation of ->dataset. Options are the same.
->dataset
(->dataset dataset options)
(->dataset dataset)
Create a dataset from either csv/tsv or a sequence of maps.
diff --git a/docs/tech.v3.dataset.io.csv.html b/docs/tech.v3.dataset.io.csv.html
index 3ed49469..4f12f5bd 100644
--- a/docs/tech.v3.dataset.io.csv.html
+++ b/docs/tech.v3.dataset.io.csv.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.io.csv
CSV parsing based on charred.api/read-csv.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.io.csv
CSV parsing based on charred.api/read-csv.
csv->dataset
(csv->dataset input & [options])
Read a csv into a dataset. Same options as tech.v3.dataset/->dataset.
csv->dataset-seq
(csv->dataset-seq input & [options])
Read a csv into a lazy sequence of datasets. All options of tech.v3.dataset/->dataset
are suppored aside from :n-initial-skip-rows
with an additional option of
diff --git a/docs/tech.v3.dataset.io.datetime.html b/docs/tech.v3.dataset.io.datetime.html
index e051e2d0..3d1f6447 100644
--- a/docs/tech.v3.dataset.io.datetime.html
+++ b/docs/tech.v3.dataset.io.datetime.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.io.datetime
Helpful and well tested string->datetime pathways.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.io.datetime
Helpful and well tested string->datetime pathways.
datetime-formatter-or-str->parser-fn
(datetime-formatter-or-str->parser-fn datatype format-string-or-formatter)
Given a datatype and one of fn? string? DateTimeFormatter,
return a function that takes strings and returns datetime objects
diff --git a/docs/tech.v3.dataset.io.string-row-parser.html b/docs/tech.v3.dataset.io.string-row-parser.html
index d2f1f1a3..a14d2bae 100644
--- a/docs/tech.v3.dataset.io.string-row-parser.html
+++ b/docs/tech.v3.dataset.io.string-row-parser.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.io.string-row-parser
Parsing functions based on raw data that is represented by a sequence
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.io.string-row-parser
Parsing functions based on raw data that is represented by a sequence
of string arrays.
partition-all-rows
(partition-all-rows {:keys [header-row?], :or {header-row? true}} n row-seq)
Given a sequence of rows, partition into an undefined number of partitions of at most
N rows but keep the header row as the first for all sequences.
diff --git a/docs/tech.v3.dataset.io.univocity.html b/docs/tech.v3.dataset.io.univocity.html
index 2df29848..8f52b2b4 100644
--- a/docs/tech.v3.dataset.io.univocity.html
+++ b/docs/tech.v3.dataset.io.univocity.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.io.univocity
Bindings to univocity. Transforms csv's, tsv's into sequences
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.io.univocity
Bindings to univocity. Transforms csv's, tsv's into sequences
of string arrays that are then passed into tech.v3.dataset.io.string-row-parser
methods.
create-csv-parser
(create-csv-parser {:keys [header-row? num-rows column-whitelist column-blacklist column-allowlist column-blocklist separator n-initial-skip-rows], :or {header-row? true}, :as options})
Create an implementation of univocity csv parser.
diff --git a/docs/tech.v3.dataset.join.html b/docs/tech.v3.dataset.join.html
index 4c8788fb..f30a9c4f 100644
--- a/docs/tech.v3.dataset.join.html
+++ b/docs/tech.v3.dataset.join.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.join
implementation of join algorithms, both exact (hash-join) and near.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.join
implementation of join algorithms, both exact (hash-join) and near.
hash-join
(hash-join colname lhs rhs)
(hash-join colname lhs rhs {:keys [operation-space], :or {operation-space :int32}, :as options})
Join by column. For efficiency, lhs should be smaller than rhs.
colname - may be a single item or a tuple in which is destructures as:
(let lhs-colname rhs-colname colname] ...)
diff --git a/docs/tech.v3.dataset.math.html b/docs/tech.v3.dataset.math.html
index 585ed5f8..633e8809 100644
--- a/docs/tech.v3.dataset.math.html
+++ b/docs/tech.v3.dataset.math.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.math
Various mathematic transformations of datasets such as (inefficiently)
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.math
Various mathematic transformations of datasets such as (inefficiently)
building simple tables, pca, and normalizing columns to have mean of 0 and variance of 1.
More in-depth transformations are found at tech.v3.dataset.neanderthal
.
correlation-table
(correlation-table dataset & {:keys [correlation-type colname-seq]})
Return a map of colname->list of sorted tuple of colname, coefficient.
diff --git a/docs/tech.v3.dataset.metamorph.html b/docs/tech.v3.dataset.metamorph.html
index d10e5e03..986af380 100644
--- a/docs/tech.v3.dataset.metamorph.html
+++ b/docs/tech.v3.dataset.metamorph.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.metamorph
This is an auto-generated api system - it scans the namespaces and changes the first
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.metamorph
This is an auto-generated api system - it scans the namespaces and changes the first
to be metamorph-compliant which means transforming an argument that is just a dataset into
an argument that is a metamorph context - a map of {:metamorph/data ds}
. They also return
their result as a metamorph context.
diff --git a/docs/tech.v3.dataset.modelling.html b/docs/tech.v3.dataset.modelling.html
index 6a03f7aa..d57b9d66 100644
--- a/docs/tech.v3.dataset.modelling.html
+++ b/docs/tech.v3.dataset.modelling.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.modelling
Methods related specifically to machine learning such as setting the inference
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.modelling
Methods related specifically to machine learning such as setting the inference
target. This file integrates tightly with tech.v3.dataset.categorical which provides
categorical -> number and one-hot transformation pathways.
The functions in this namespace manipulate the metadata on the columns of the dataset, wich can be inspected via clojure.core/meta
diff --git a/docs/tech.v3.dataset.neanderthal.html b/docs/tech.v3.dataset.neanderthal.html
index 1dbdfa34..423bc1b1 100644
--- a/docs/tech.v3.dataset.neanderthal.html
+++ b/docs/tech.v3.dataset.neanderthal.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.neanderthal
Conversion of a dataset to/from a neanderthal dense matrix as well as various
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.neanderthal
Conversion of a dataset to/from a neanderthal dense matrix as well as various
dataset transformations such as pca, covariance and correlation matrixes.
Please include these additional dependencies in your project:
[uncomplicate/neanderthal "0.45.0"]
diff --git a/docs/tech.v3.dataset.print.html b/docs/tech.v3.dataset.print.html
index 75d08d65..df4a8a20 100644
--- a/docs/tech.v3.dataset.print.html
+++ b/docs/tech.v3.dataset.print.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.print
dataset->str
(dataset->str ds options)
(dataset->str ds)
Convert a dataset to a string. Prints a single line header and then calls
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.print
dataset->str
(dataset->str ds options)
(dataset->str ds)
Convert a dataset to a string. Prints a single line header and then calls
dataset-data->str.
For options documentation see dataset-data->str.
dataset-data->str
(dataset-data->str dataset)
(dataset-data->str dataset options)
Convert the dataset values to a string.
diff --git a/docs/tech.v3.dataset.reductions.apache-data-sketch.html b/docs/tech.v3.dataset.reductions.apache-data-sketch.html
index de096f77..66f8b4aa 100644
--- a/docs/tech.v3.dataset.reductions.apache-data-sketch.html
+++ b/docs/tech.v3.dataset.reductions.apache-data-sketch.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.reductions.apache-data-sketch
Reduction reducers based on the apache data sketch family of algorithms.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.reductions.apache-data-sketch
Reduction reducers based on the apache data sketch family of algorithms.
diff --git a/docs/tech.v3.dataset.reductions.html b/docs/tech.v3.dataset.reductions.html
index 7d91d2c9..3c2c1a40 100644
--- a/docs/tech.v3.dataset.reductions.html
+++ b/docs/tech.v3.dataset.reductions.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.reductions
Specific high performance reductions intended to be performend over a sequence
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.reductions
Specific high performance reductions intended to be performend over a sequence
of datasets. This allows aggregations to be done in situations where the dataset is
larger than what will fit in memory on a normal machine. Due to this fact, summation
is implemented using Kahan algorithm and various statistical methods are done in using
@@ -57,10 +57,10 @@
:n-dates (ds-reduce/count-distinct :date :int32)}
[ds-seq])
-
distinct
(distinct colname finalizer)
(distinct colname)
Create a reducer that will return a set of values.
-distinct-int32
(distinct-int32 colname finalizer)
(distinct-int32 colname)
Get the set of distinct items given you know the space is no larger than int32
+
distinct
(distinct colname finalizer)
(distinct colname)
Create a reducer that will return a set of values.
+distinct-int32
(distinct-int32 colname finalizer)
(distinct-int32 colname)
Get the set of distinct items given you know the space is no larger than int32
space. The optional finalizer allows you to post-process the data.
-group-by-column-agg
(group-by-column-agg colname agg-map options ds-seq)
(group-by-column-agg colname agg-map ds-seq)
Group a sequence of datasets by a column and aggregate down into a new dataset.
+group-by-column-agg
(group-by-column-agg colname agg-map options ds-seq)
(group-by-column-agg colname agg-map ds-seq)
Group a sequence of datasets by a column and aggregate down into a new dataset.
-
colname - Either a single scalar column name or a vector of column names to group by.
@@ -127,7 +127,7 @@
| b | 42 | 1 |
| a | 22 | 2 |
-
group-by-column-agg-rf
(group-by-column-agg-rf colname agg-map)
(group-by-column-agg-rf colname agg-map options)
Produce a transduce-compatible rf that will perform the group-by-column-agg pathway.
+
group-by-column-agg-rf
(group-by-column-agg-rf colname agg-map)
(group-by-column-agg-rf colname agg-map options)
Produce a transduce-compatible rf that will perform the group-by-column-agg pathway.
See documentation for group-by-column-agg.
tech.v3.dataset.reductions-test> (def stocks (ds/->dataset "test/data/stocks.csv" {:key-fn keyword}))
#'tech.v3.dataset.reductions-test/stocks
@@ -153,24 +153,24 @@
| AMZN | 18 | 126.97833333 | 2285.61 | 6 |
| GOOG | 204 | 415.87044118 | 84837.57 | 68 |
-prob-cdf
(prob-cdf colname cdf)
(prob-cdf colname cdf k)
See docs for tech.v3.dataset.reductions.apache-data-sketch/prob-cdf
+prob-cdf
(prob-cdf colname cdf)
(prob-cdf colname cdf k)
See docs for tech.v3.dataset.reductions.apache-data-sketch/prob-cdf
- k - defaults to 128. This produces a normalized rank error of about 1.7%
-prob-interquartile-range
(prob-interquartile-range colname k)
(prob-interquartile-range colname)
See docs for [[tech.v3.dataset.reductions.apache-data-sketch/prob-interquartile-range
+prob-interquartile-range
(prob-interquartile-range colname k)
(prob-interquartile-range colname)
See docs for [[tech.v3.dataset.reductions.apache-data-sketch/prob-interquartile-range
- k - defaults to 128. This produces a normalized rank error of about 1.7%
-prob-median
(prob-median colname)
(prob-median colname k)
See docs for tech.v3.dataset.reductions.apache-data-sketch/prob-median
+prob-median
(prob-median colname)
(prob-median colname k)
See docs for tech.v3.dataset.reductions.apache-data-sketch/prob-median
- k - defaults to 128. This produces a normalized rank error of about 1.7%
-prob-quantile
(prob-quantile colname quantile)
(prob-quantile colname quantile k)
See docs for tech.v3.dataset.reductions.apache-data-sketch/prob-quantile
+prob-quantile
(prob-quantile colname quantile)
(prob-quantile colname quantile k)
See docs for tech.v3.dataset.reductions.apache-data-sketch/prob-quantile
- k - defaults to 128. This produces a normalized rank error of about 1.7%
-prob-set-cardinality
(prob-set-cardinality colname options)
(prob-set-cardinality colname)
prob-set-cardinality
(prob-set-cardinality colname options)
(prob-set-cardinality colname)
See docs for tech.v3.dataset.reductions.apache-data-sketch/prob-set-cardinality.
Options:
:hll-lgk
- defaults to 12, this is log-base2 of k, so k = 4096. lgK can be
@@ -182,7 +182,7 @@
slower than the other two, especially during union operations.
:datatype
- One of :float64, :int64, :string
-reducer
(reducer column-name init-val-fn rfn merge-fn finalize-fn)
(reducer column-name rfn)
Make a group-by-agg reducer.
+reducer
(reducer column-name init-val-fn rfn merge-fn finalize-fn)
(reducer column-name rfn)
Make a group-by-agg reducer.
column-name
- Single column name or multiple columns.
init-val-fn
- Function to produce initial accumulators
@@ -194,13 +194,13 @@
finalize-fn
- finalize the result after aggregation. Optional, will be replaced
with identity of not provided.
-reducer->column-reducer
(reducer->column-reducer reducer cname)
(reducer->column-reducer reducer op-space cname)
Given a hamf parallel reducer and a column name, return a dataset reducer of one column.
-reservoir-dataset
(reservoir-dataset reservoir-size)
(reservoir-dataset reservoir-size options)
reservoir-desc-stat
(reservoir-desc-stat colname reservoir-size stat-name options)
(reservoir-desc-stat colname reservoir-size stat-name)
Calculate a descriptive statistic using reservoir sampling. A list of statistic
+
reducer->column-reducer
(reducer->column-reducer reducer cname)
(reducer->column-reducer reducer op-space cname)
Given a hamf parallel reducer and a column name, return a dataset reducer of one column.
+reservoir-dataset
(reservoir-dataset reservoir-size)
(reservoir-dataset reservoir-size options)
reservoir-desc-stat
(reservoir-desc-stat colname reservoir-size stat-name options)
(reservoir-desc-stat colname reservoir-size stat-name)
Calculate a descriptive statistic using reservoir sampling. A list of statistic
names are found in tech.v3.datatype.statistics/all-descriptive-stats-names
.
Options are options used in
reservoir-sampler.
Note that this method will not convert datetime objects to milliseconds for you as
in descriptive-stats.
-row-count
(row-count)
Create a simple reducer that returns the number of times reduceIndex was called.
-
\ No newline at end of file
+row-count
(row-count)
Create a simple reducer that returns the number of times reduceIndex was called.
+