Add data set for text analysis #19

ablaom · 2024-01-08T20:53:56Z

Taken from the MLJText.jl requirements for transformers:

Generate a vector whose elements are either tokenized documents or bags of words/ngrams. Specifically, each element would be one of the following:

A vector of abstract strings (tokens), e.g., ["I", "like", "Sam",
".", "Sam", "is", "nice", "."] (scitype AbstractVector{Textual})
A dictionary of counts, indexed on abstract strings, e.g.,
Dict("I"=>1, "Sam"=>2, "Sam is"=>1) (scitype Multiset{Textual}})
A dictionary of counts, indexed on plain ngrams, e.g.,
Dict(("I",)=>1, ("Sam",)=>2, ("I", "Sam")=>1) (scitype
Multiset{<:NTuple{N,Textual} where N}); here a plain ngram is a
tuple of abstract strings.

Provide feedback