@@ -119,7 +119,7 @@ The transformer converts a collection of documents, tokenized or pre-parsed as b
119
119
words/ngrams, to a matrix of [Okapi BM25 document-word
120
120
statistics](https://en.wikipedia.org/wiki/Okapi_BM25). The BM25 scoring function uses both
121
121
term frequency (TF) and inverse document frequency (IDF, defined below), as in
122
- [`TfidfTransformer`](ref), but additionally adjusts for the probability that a user will
122
+ [`TfidfTransformer`](@ ref), but additionally adjusts for the probability that a user will
123
123
consider a search result relevant based, on the terms in the search query and those in
124
124
each document.
125
125
@@ -137,21 +137,21 @@ In MLJ or MLJBase, bind an instance `model` to data with
137
137
138
138
mach = machine(model, X)
139
139
140
- $DOC_IDF
140
+ $DOC_TRANSFORMER_INPUTS
141
141
142
142
Train the machine using `fit!(mach, rows=...)`.
143
143
144
144
# Hyper-parameters
145
145
146
- - `max_doc_freq=1.0`: Restricts the vocabulary that the transformer will consider.
147
- Terms that occur in `> max_doc_freq` documents will not be considered by the
148
- transformer. For example, if `max_doc_freq` is set to 0.9, terms that are in more than
149
- 90% of the documents will be removed.
146
+ - `max_doc_freq=1.0`: Restricts the vocabulary that the transformer will consider. Terms
147
+ that occur in `> max_doc_freq` documents will not be considered by the transformer. For
148
+ example, if `max_doc_freq` is set to 0.9, terms that are in more than 90% of the
149
+ documents will be removed.
150
150
151
- - `min_doc_freq=0.0`: Restricts the vocabulary that the transformer will consider.
152
- Terms that occur in `< max_doc_freq` documents will not be considered by the
153
- transformer. A value of 0.01 means that only terms that are at least in 1% of the
154
- documents will be included.
151
+ - `min_doc_freq=0.0`: Restricts the vocabulary that the transformer will consider. Terms
152
+ that occur in `< max_doc_freq` documents will not be considered by the transformer. A
153
+ value of 0.01 means that only terms that are at least in 1% of the documents will be
154
+ included.
155
155
156
156
- `κ=2`: The term frequency saturation characteristic. Higher values represent slower
157
157
saturation. What we mean by saturation is the degree to which a term occurring extra
0 commit comments