From 39ae22ed642900dfa688aef1596cc014c86175de Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Danie=CC=88l=20de=20Kok?= Date: Sat, 27 Apr 2024 10:02:15 +0200 Subject: [PATCH 1/2] First stab at v4 page --- website/docs/api/entitylinker.mdx | 4 +- website/docs/usage/v4.mdx | 191 ++++++++++++++++++++++++++++++ website/meta/sidebars.json | 4 +- 3 files changed, 195 insertions(+), 4 deletions(-) create mode 100644 website/docs/usage/v4.mdx diff --git a/website/docs/api/entitylinker.mdx b/website/docs/api/entitylinker.mdx index 08e2151c03f..37459a3ec08 100644 --- a/website/docs/api/entitylinker.mdx +++ b/website/docs/api/entitylinker.mdx @@ -74,10 +74,10 @@ architectures and their arguments and hyperparameters. Prior to spaCy v4.0 `get_candidates()` returns a single `Iterable` of candidates for one specific mention, i. e. the function was typed as `Callable[[KnowledgeBase, Span], Iterable[Candidate]]`. To retrieve candidates -batch-wise, spaCy >= 3.5 exposes `get_candidates_batched()`, which identifies +batch-wise, spaCy >= 3.5 exposes `get_candidates_batch()`, which identifies candidates for an arbitrary number of spans: `Callable[[KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]]`. The -main difference between `get_candidates_batched()` and `get_candidates()` in +main difference between `get_candidates_batch()` and `get_candidates()` in spaCy >= 4.0 is that the latter considers the grouping of provided mention spans per `Doc` instance. diff --git a/website/docs/usage/v4.mdx b/website/docs/usage/v4.mdx new file mode 100644 index 00000000000..b9a54ee8058 --- /dev/null +++ b/website/docs/usage/v4.mdx @@ -0,0 +1,191 @@ +--- +title: What's New in v4.0 +teaser: New features and how to upgrade +menu: + - ['New Features', 'features'] + - ['Upgrading Notes', 'upgrading'] +--- + +## New features {id="features",hidden="true"} + +spaCy v4.0 supports more flexible learning rates and adds experimental support +for model distillation. This release also fixes some long-standing issues that +require minor API changes. + +spaCy v4.0 drops support for Python 3.7 and 3.8. + +### Flexible learning rates {id="learn-rate"} + +Thinc 9 adds support for more flexible learning rates that can use the step, +parameter names, and results from prior evaluations. spaCy v4 makes use of these +flexible learning rates by passing the aggregate score of the most recent +evaluation to the learning rate schedule. This makes it possible for schedules +like [`plateau`](https://thinc.ai/docs/api-schedules#plateau) to adjust the +learning rate when training is stagnant. + +### Experimental support for model distillation {id="distillation"} + +spaCy v4 lays the groundwork for model distillation. Distillation trains a +_student_ model on the predictions of a _teacher_ model using an unannotated +corpus. One of the more exciting applications of distillation is extracting +small, task-focused models from large, pretrained transformer models. + +Support for distillation support consists of several parts: + +- [`TrainablePipe`](/api/pipe) now provides a [`distill`](/api/pipe#distill) + method. This can be used to perform a distillation step, where a student is + updated to mimick the outputs of the teacher. +- A configuration section called `distilation` for configuring various + distillation settings. +- The distillation loop. +- The [`distill`](/api/cli#distill) subcommand to run distillation from the + command-line. + +Most of the trainable pipeline components are updated to support distillation. + +### Saving activations {id="save-activation"} + +Trainable pipes can now save the pipe's model activations for a document in the +[`Doc.activations`](/api/doc#attributes) dictionary. You can use this +functionality to get programmatic access to e.g. the probability distibution of +a pipe's classifier. + +The following activations are currently available: + +- `EditTreeLemmatizer`: `probabilities` and `tree_ids` +- `EntityLinker`: `ents` and `scores` +- `Morphologizer`: `probabilities` and `label_ids` +- `SentenceRecognizer`: `probabilities` and `label_ids` +- `SpanCategorizer`: `indices` and `scores` +- `Tagger`: `probabilities` and `label_ids` +- `TextCategorizer`: `probabilities` + +> #### Example +> +> ```python +> import spacy +> nlp = spacy.load("de_core_news_lg") +> nlp.get_pipe("tagger").save_activations = True +> doc = nlp("Hallo Welt!") +> assert "tagger" in doc.activations +> assert "probabilities" in doc.activations["tagger"] +> ``` + +### Additional features and improvements {id="additional-features-and-improvements"} + +- The `--code` option that is used by several CLI subcommands now accepts + multiple files to load by separating them with a comma. +- `spacy download` does not redownload models that are already installed. +- When modifying a `Span` that was retrieved through a `SpanGroup`, the change + is now reflected in the `SpanGroup`. +- Lookups can now be downloaded from a URL using + `spacy.LookupsDataLoaderFromURL.v1`. + +## Notes about upgrading from v3.7 {id="upgrading"} + +This release drops support for Python 3.7 and 3.8. Most configuration files from +spaCy 3.7 can be used with spaCy 4.0 without any modifications (excepting +configurations that use `EntityLinker.v1`, see below). However, spaCy 4.0 +introduces some (minor) API changes that are discussed in the remainder of this +section. + +### Removal of the `EntityRuler` class + +The `EntityRuler` class is removed. The entity ruler is implemented as a special +case of the `SpanRuler` component. + +See the [migration guide](/api/entityruler#migrating) for differences between +the v3 `EntityRuler` and v4 `SpanRuler` implementations of the `entity_ruler` +component. + +### Renamed language codes: `is` -> `isl` and `xx` to `mul` + +The language code for Icelandic has been changed from `is` to `isl` to avoid +incompatibilities with the Python `is` keyword. The language code for +multilingual models has been changed from `xx` to `mul`. Existing code that uses +these language codes should be adjusted accordingly. + +### Removal of the `sentiment` attribute + +The `sentiment` attribute is removed the `Token`, `Span`, `Doc` and `Lexeme` +classes. If you used this attribute in a `sentiment` analysis component, we +recommend you to store the sentiment analysis in an +[extension attribute](/usage/processing-pipelines#custom-components-attributes) +instead. + +### Removal of `get_candidates_batch` + +Prior to spaCy v4, `get_candidates()` returned an `Iterable` of candidates for a +specific mention. spaCy >= 3.5 provides `get_candidates_batch()` for looking up +multiple mentions — given an `Iterable[Span]` of mentions, it returns for each +mention the candidates. + +spaCy v4 replaces both functions by a single function +[`get_candidates`](/api/entitylinker#config) that does doc-wise batching. For an +`Iterator[SpanGroup]` it returns for each mention in the spangroup the +candidates. The batching is by doc since the [`Span`](/api/span)s in a +[`SpanGroup`](/api/spangroup) belong to the same [`Doc`](/api/doc). + +### Removal of pool argument from `Vocab.get` and `Vocab.get_by_orth` + +The memory pool argument was removed from the `Vocab.get` and +`Vocab.get_by_orth` Cython cdef methods. These methods can now be called without +providing the memory pool as an argument. + +### Optional arguments of `Span.char_span` are now keyword-only + +> #### Example +> +> ```python +> doc = nlp("I like New York") +> # Permitted in spaCy 3 +> span = doc[1:4].char_span(5, 13, "GPE", 42) +> # spaCy 4 +> span = doc[1:4].char_span(5, 13, "GPE", kb_id=42) +> ``` + +The optional arguments for [`Span.char_span`](/api/span#char_span) are now +keyword-only. Existing code that uses a positional argument to pass an optional +argument to `char_span` needs to be updated to pass a keyword argument. + +### Remove backoff from `Doc.vector` to `Doc.tensor` + +In spaCy v3 and earlier, small (`sm`) pipeline packages supported +[`Doc.vector`](/api/doc#vector) and [`Token.vector`](/api/token#vector) by +backing off to context-sensitive tensors from the `tok2vec` component. These +tensors do not work well for this purpose and this backoff has been removed in +spaCy v4. + +### Multiple spans returned as `Tuple[Span]` + +In spaCy v3 some methods that returned multiple `Span` objects would return an +`Iterator[Span]`, while others would return `Tuple[Span]`. In spaCy v4 such +methods always return `Tuple[Span]`. + +### Support for `EntityLinker.v1` is dropped + +Support for `EntityLinker.v1` is dropped, migrate to `EntityLinker.v2`. + +### `spacy[apple]` removed from extras + +The `thinc-apple-ops` package has been merged into Thinc v9. spaCy v4 always +uses Apple ops on Macs, so the `apple` extra is not needed anymore. + +### Pipeline package version compatibility {id="version-compat"} + +spaCy v3.x pipelines are not compatible with spaCy v4.0 and need to be +retrained. + +### Updating v3.7 configs + +To update a config from spaCy v3.7 with the new v4.0 settings, run +[`init fill-config`](/api/cli#init-fill-config): + +```cli +$ python -m spacy init fill-config config-v3.7.cfg config-v4.0.cfg +``` + +In many cases ([`spacy train`](/api/cli#train), +[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in +automatically, but you'll need to fill in the new settings to run +[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data). diff --git a/website/meta/sidebars.json b/website/meta/sidebars.json index 2df120ffa7b..4732c7bd1fd 100644 --- a/website/meta/sidebars.json +++ b/website/meta/sidebars.json @@ -9,9 +9,9 @@ { "text": "Models & Languages", "url": "/usage/models" }, { "text": "Facts & Figures", "url": "/usage/facts-figures" }, { "text": "spaCy 101", "url": "/usage/spacy-101" }, + { "text": "New in v4.0", "url": "/usage/v4" }, { "text": "New in v3.7", "url": "/usage/v3-7" }, - { "text": "New in v3.6", "url": "/usage/v3-6" }, - { "text": "New in v3.5", "url": "/usage/v3-5" } + { "text": "New in v3.6", "url": "/usage/v3-6" } ] }, { From 8cbdd5c8013e66963f1b8b1127b0e8a290e3dbe3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Dani=C3=ABl=20de=20Kok?= Date: Wed, 5 Jun 2024 19:20:22 +0200 Subject: [PATCH 2/2] Fixes Co-authored-by: Sofie Van Landeghem --- website/docs/usage/v4.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/website/docs/usage/v4.mdx b/website/docs/usage/v4.mdx index b9a54ee8058..75a995cee0e 100644 --- a/website/docs/usage/v4.mdx +++ b/website/docs/usage/v4.mdx @@ -107,7 +107,7 @@ these language codes should be adjusted accordingly. ### Removal of the `sentiment` attribute -The `sentiment` attribute is removed the `Token`, `Span`, `Doc` and `Lexeme` +The `sentiment` attribute is removed from the `Token`, `Span`, `Doc` and `Lexeme` classes. If you used this attribute in a `sentiment` analysis component, we recommend you to store the sentiment analysis in an [extension attribute](/usage/processing-pipelines#custom-components-attributes) @@ -123,7 +123,7 @@ mention the candidates. spaCy v4 replaces both functions by a single function [`get_candidates`](/api/entitylinker#config) that does doc-wise batching. For an `Iterator[SpanGroup]` it returns for each mention in the spangroup the -candidates. The batching is by doc since the [`Span`](/api/span)s in a +candidates. The batching is by doc since the [`Span`](/api/span) objects in a [`SpanGroup`](/api/spangroup) belong to the same [`Doc`](/api/doc). ### Removal of pool argument from `Vocab.get` and `Vocab.get_by_orth`