From 39ae22ed642900dfa688aef1596cc014c86175de Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Danie=CC=88l=20de=20Kok?= <me@danieldk.eu>
Date: Sat, 27 Apr 2024 10:02:15 +0200
Subject: [PATCH 1/2] First stab at v4 page

---
 website/docs/api/entitylinker.mdx |   4 +-
 website/docs/usage/v4.mdx         | 191 ++++++++++++++++++++++++++++++
 website/meta/sidebars.json        |   4 +-
 3 files changed, 195 insertions(+), 4 deletions(-)
 create mode 100644 website/docs/usage/v4.mdx

diff --git a/website/docs/api/entitylinker.mdx b/website/docs/api/entitylinker.mdx
index 08e2151c03f..37459a3ec08 100644
--- a/website/docs/api/entitylinker.mdx
+++ b/website/docs/api/entitylinker.mdx
@@ -74,10 +74,10 @@ architectures and their arguments and hyperparameters.
 Prior to spaCy v4.0 `get_candidates()` returns a single `Iterable` of candidates
 for one specific mention, i. e. the function was typed as
 `Callable[[KnowledgeBase, Span], Iterable[Candidate]]`. To retrieve candidates
-batch-wise, spaCy >= 3.5 exposes `get_candidates_batched()`, which identifies
+batch-wise, spaCy >= 3.5 exposes `get_candidates_batch()`, which identifies
 candidates for an arbitrary number of spans:
 `Callable[[KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]]`. The
-main difference between `get_candidates_batched()` and `get_candidates()` in
+main difference between `get_candidates_batch()` and `get_candidates()` in
 spaCy >= 4.0 is that the latter considers the grouping of provided mention spans
 per `Doc` instance.
 
diff --git a/website/docs/usage/v4.mdx b/website/docs/usage/v4.mdx
new file mode 100644
index 00000000000..b9a54ee8058
--- /dev/null
+++ b/website/docs/usage/v4.mdx
@@ -0,0 +1,191 @@
+---
+title: What's New in v4.0
+teaser: New features and how to upgrade
+menu:
+  - ['New Features', 'features']
+  - ['Upgrading Notes', 'upgrading']
+---
+
+## New features {id="features",hidden="true"}
+
+spaCy v4.0 supports more flexible learning rates and adds experimental support
+for model distillation. This release also fixes some long-standing issues that
+require minor API changes.
+
+spaCy v4.0 drops support for Python 3.7 and 3.8.
+
+### Flexible learning rates {id="learn-rate"}
+
+Thinc 9 adds support for more flexible learning rates that can use the step,
+parameter names, and results from prior evaluations. spaCy v4 makes use of these
+flexible learning rates by passing the aggregate score of the most recent
+evaluation to the learning rate schedule. This makes it possible for schedules
+like [`plateau`](https://thinc.ai/docs/api-schedules#plateau) to adjust the
+learning rate when training is stagnant.
+
+### Experimental support for model distillation {id="distillation"}
+
+spaCy v4 lays the groundwork for model distillation. Distillation trains a
+_student_ model on the predictions of a _teacher_ model using an unannotated
+corpus. One of the more exciting applications of distillation is extracting
+small, task-focused models from large, pretrained transformer models.
+
+Support for distillation support consists of several parts:
+
+- [`TrainablePipe`](/api/pipe) now provides a [`distill`](/api/pipe#distill)
+  method. This can be used to perform a distillation step, where a student is
+  updated to mimick the outputs of the teacher.
+- A configuration section called `distilation` for configuring various
+  distillation settings.
+- The distillation loop.
+- The [`distill`](/api/cli#distill) subcommand to run distillation from the
+  command-line.
+
+Most of the trainable pipeline components are updated to support distillation.
+
+### Saving activations {id="save-activation"}
+
+Trainable pipes can now save the pipe's model activations for a document in the
+[`Doc.activations`](/api/doc#attributes) dictionary. You can use this
+functionality to get programmatic access to e.g. the probability distibution of
+a pipe's classifier.
+
+The following activations are currently available:
+
+- `EditTreeLemmatizer`: `probabilities` and `tree_ids`
+- `EntityLinker`: `ents` and `scores`
+- `Morphologizer`: `probabilities` and `label_ids`
+- `SentenceRecognizer`: `probabilities` and `label_ids`
+- `SpanCategorizer`: `indices` and `scores`
+- `Tagger`: `probabilities` and `label_ids`
+- `TextCategorizer`: `probabilities`
+
+> #### Example
+>
+> ```python
+> import spacy
+> nlp = spacy.load("de_core_news_lg")
+> nlp.get_pipe("tagger").save_activations = True
+> doc = nlp("Hallo Welt!")
+> assert "tagger" in doc.activations
+> assert "probabilities" in doc.activations["tagger"]
+> ```
+
+### Additional features and improvements {id="additional-features-and-improvements"}
+
+- The `--code` option that is used by several CLI subcommands now accepts
+  multiple files to load by separating them with a comma.
+- `spacy download` does not redownload models that are already installed.
+- When modifying a `Span` that was retrieved through a `SpanGroup`, the change
+  is now reflected in the `SpanGroup`.
+- Lookups can now be downloaded from a URL using
+  `spacy.LookupsDataLoaderFromURL.v1`.
+
+## Notes about upgrading from v3.7 {id="upgrading"}
+
+This release drops support for Python 3.7 and 3.8. Most configuration files from
+spaCy 3.7 can be used with spaCy 4.0 without any modifications (excepting
+configurations that use `EntityLinker.v1`, see below). However, spaCy 4.0
+introduces some (minor) API changes that are discussed in the remainder of this
+section.
+
+### Removal of the `EntityRuler` class
+
+The `EntityRuler` class is removed. The entity ruler is implemented as a special
+case of the `SpanRuler` component.
+
+See the [migration guide](/api/entityruler#migrating) for differences between
+the v3 `EntityRuler` and v4 `SpanRuler` implementations of the `entity_ruler`
+component.
+
+### Renamed language codes: `is` -> `isl` and `xx` to `mul`
+
+The language code for Icelandic has been changed from `is` to `isl` to avoid
+incompatibilities with the Python `is` keyword. The language code for
+multilingual models has been changed from `xx` to `mul`. Existing code that uses
+these language codes should be adjusted accordingly.
+
+### Removal of the `sentiment` attribute
+
+The `sentiment` attribute is removed the `Token`, `Span`, `Doc` and `Lexeme`
+classes. If you used this attribute in a `sentiment` analysis component, we
+recommend you to store the sentiment analysis in an
+[extension attribute](/usage/processing-pipelines#custom-components-attributes)
+instead.
+
+### Removal of `get_candidates_batch`
+
+Prior to spaCy v4, `get_candidates()` returned an `Iterable` of candidates for a
+specific mention. spaCy >= 3.5 provides `get_candidates_batch()` for looking up
+multiple mentions — given an `Iterable[Span]` of mentions, it returns for each
+mention the candidates.
+
+spaCy v4 replaces both functions by a single function
+[`get_candidates`](/api/entitylinker#config) that does doc-wise batching. For an
+`Iterator[SpanGroup]` it returns for each mention in the spangroup the
+candidates. The batching is by doc since the [`Span`](/api/span)s in a
+[`SpanGroup`](/api/spangroup) belong to the same [`Doc`](/api/doc).
+
+### Removal of pool argument from `Vocab.get` and `Vocab.get_by_orth`
+
+The memory pool argument was removed from the `Vocab.get` and
+`Vocab.get_by_orth` Cython cdef methods. These methods can now be called without
+providing the memory pool as an argument.
+
+### Optional arguments of `Span.char_span` are now keyword-only
+
+> #### Example
+>
+> ```python
+> doc = nlp("I like New York")
+> # Permitted in spaCy 3
+> span = doc[1:4].char_span(5, 13, "GPE", 42)
+> # spaCy 4
+> span = doc[1:4].char_span(5, 13, "GPE", kb_id=42)
+> ```
+
+The optional arguments for [`Span.char_span`](/api/span#char_span) are now
+keyword-only. Existing code that uses a positional argument to pass an optional
+argument to `char_span` needs to be updated to pass a keyword argument.
+
+### Remove backoff from `Doc.vector` to `Doc.tensor`
+
+In spaCy v3 and earlier, small (`sm`) pipeline packages supported
+[`Doc.vector`](/api/doc#vector) and [`Token.vector`](/api/token#vector) by
+backing off to context-sensitive tensors from the `tok2vec` component. These
+tensors do not work well for this purpose and this backoff has been removed in
+spaCy v4.
+
+### Multiple spans returned as `Tuple[Span]`
+
+In spaCy v3 some methods that returned multiple `Span` objects would return an
+`Iterator[Span]`, while others would return `Tuple[Span]`. In spaCy v4 such
+methods always return `Tuple[Span]`.
+
+### Support for `EntityLinker.v1` is dropped
+
+Support for `EntityLinker.v1` is dropped, migrate to `EntityLinker.v2`.
+
+### `spacy[apple]` removed from extras
+
+The `thinc-apple-ops` package has been merged into Thinc v9. spaCy v4 always
+uses Apple ops on Macs, so the `apple` extra is not needed anymore.
+
+### Pipeline package version compatibility {id="version-compat"}
+
+spaCy v3.x pipelines are not compatible with spaCy v4.0 and need to be
+retrained.
+
+### Updating v3.7 configs
+
+To update a config from spaCy v3.7 with the new v4.0 settings, run
+[`init fill-config`](/api/cli#init-fill-config):
+
+```cli
+$ python -m spacy init fill-config config-v3.7.cfg config-v4.0.cfg
+```
+
+In many cases ([`spacy train`](/api/cli#train),
+[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
+automatically, but you'll need to fill in the new settings to run
+[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
diff --git a/website/meta/sidebars.json b/website/meta/sidebars.json
index 2df120ffa7b..4732c7bd1fd 100644
--- a/website/meta/sidebars.json
+++ b/website/meta/sidebars.json
@@ -9,9 +9,9 @@
                     { "text": "Models & Languages", "url": "/usage/models" },
                     { "text": "Facts & Figures", "url": "/usage/facts-figures" },
                     { "text": "spaCy 101", "url": "/usage/spacy-101" },
+                    { "text": "New in v4.0", "url": "/usage/v4" },
                     { "text": "New in v3.7", "url": "/usage/v3-7" },
-                    { "text": "New in v3.6", "url": "/usage/v3-6" },
-                    { "text": "New in v3.5", "url": "/usage/v3-5" }
+                    { "text": "New in v3.6", "url": "/usage/v3-6" }
                 ]
             },
             {

From 8cbdd5c8013e66963f1b8b1127b0e8a290e3dbe3 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dani=C3=ABl=20de=20Kok?= <me@github.danieldk.eu>
Date: Wed, 5 Jun 2024 19:20:22 +0200
Subject: [PATCH 2/2] Fixes

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
---
 website/docs/usage/v4.mdx | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/website/docs/usage/v4.mdx b/website/docs/usage/v4.mdx
index b9a54ee8058..75a995cee0e 100644
--- a/website/docs/usage/v4.mdx
+++ b/website/docs/usage/v4.mdx
@@ -107,7 +107,7 @@ these language codes should be adjusted accordingly.
 
 ### Removal of the `sentiment` attribute
 
-The `sentiment` attribute is removed the `Token`, `Span`, `Doc` and `Lexeme`
+The `sentiment` attribute is removed from the `Token`, `Span`, `Doc` and `Lexeme`
 classes. If you used this attribute in a `sentiment` analysis component, we
 recommend you to store the sentiment analysis in an
 [extension attribute](/usage/processing-pipelines#custom-components-attributes)
@@ -123,7 +123,7 @@ mention the candidates.
 spaCy v4 replaces both functions by a single function
 [`get_candidates`](/api/entitylinker#config) that does doc-wise batching. For an
 `Iterator[SpanGroup]` it returns for each mention in the spangroup the
-candidates. The batching is by doc since the [`Span`](/api/span)s in a
+candidates. The batching is by doc since the [`Span`](/api/span) objects in a
 [`SpanGroup`](/api/spangroup) belong to the same [`Doc`](/api/doc).
 
 ### Removal of pool argument from `Vocab.get` and `Vocab.get_by_orth`