Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add commands for automatically modifying configs #12020

Open
wants to merge 25 commits into
base: v4
Choose a base branch
from

Conversation

polm
Copy link
Contributor

@polm polm commented Dec 23, 2022

Description

This continues work started in explosion/projects#147, which provides features for automatically manipulating pipelines and configs. The functions included are:

  • merge: combine components from two pipelines and handle listeners
  • use_transformer: use transformer as feature source
  • use_tok2vec: use CNN tok2vec as feature source
  • resume: make a version of a config for resuming training

Currently these are all grouped under a new spacy configure command. That may not be the best place for them; in particular, merge may belong elsewhere, since it outputs a pipeline rather than a config.

The current state of the PR is that the commands run, but there's only one small test, and docs haven't been written yet. Docs can be started but will depend somewhat on how the naming issues work out.

Types of change

enhancement

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

This continues work started in
explosion/projects#147,
which provides features for automatically manipulating pipelines and
configs. The functions included are:

- merge: combine components from two pipelines and handle listeners
- use_transformer: use transformer as feature source
- use_tok2vec: use CNN tok2vec as feature source
- resume: make a version of a config for resuming training

Currently these are all grouped under a new `spacy configure` command.
That may not be the best place for them; in particular, `merge` may
belong elsewhere, since it outputs a pipeline rather than a config.

The current state of the PR is that the commands run, but there's only
one small test, and docs haven't been written yet. Docs can be started
but will depend somewhat on how the naming issues work out.
@polm polm added enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components feat / cli Feature: Command-line interface feat / config Feature: Training config labels Dec 23, 2022
@polm polm closed this Dec 28, 2022
@polm polm reopened this Dec 28, 2022
polm added 5 commits January 11, 2023 16:06
This also change the `output_file` arg to match other commands.
This removes one old print statement and some old TODOs. Some TODOs are
left as future work.
@polm polm marked this pull request as ready for review January 13, 2023 05:31
@polm
Copy link
Contributor Author

polm commented Jan 13, 2023

This might require more adjustment - for example, maybe merge should be split out into a separate command - but it is substantially complete and ready for review. Any feedback on how to make the design clearer would be welcome.

Copy link
Contributor

@rmitsch rmitsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, I think this functionality can be super handy! I just did a superficial pass now and will continue reviewing at a later point.

spacy/cli/configure.py Show resolved Hide resolved
spacy/cli/configure.py Outdated Show resolved Hide resolved
spacy/cli/configure.py Show resolved Hide resolved
spacy/cli/configure.py Outdated Show resolved Hide resolved
spacy/cli/configure.py Outdated Show resolved Hide resolved
website/docs/api/cli.mdx Outdated Show resolved Hide resolved
@rmitsch
Copy link
Contributor

rmitsch commented Feb 8, 2023

Currently these are all grouped under a new spacy configure command. That may not be the best place for them; in particular, merge may belong elsewhere, since it outputs a pipeline rather than a config.

I agree - having merge under spacy configure would be confusing. I'd prefer it as a separate top-level command.

@polm
Copy link
Contributor Author

polm commented Feb 9, 2023

Thanks for the feedback, I moved merge to a separate top-level command.

Copy link
Contributor

@rmitsch rmitsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the command naming and some smaller stuff this looks good, but I'd definitely get feedback from a third party before moving forward.

spacy/cli/configure.py Outdated Show resolved Hide resolved
spacy/cli/configure.py Outdated Show resolved Hide resolved
spacy/cli/configure.py Outdated Show resolved Hide resolved
spacy/cli/configure.py Show resolved Hide resolved
spacy/cli/merge.py Outdated Show resolved Hide resolved
spacy/cli/merge.py Outdated Show resolved Hide resolved
website/docs/api/cli.mdx Show resolved Hide resolved
website/docs/api/cli.mdx Show resolved Hide resolved
website/docs/api/cli.mdx Show resolved Hide resolved
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of features in this PR and ideally we would have split this up into separate PRs to make reviewing easier and the commits in the history more atomic. I guess we can keep it as is for now, but we'll want to review this carefully as there's a lot going on :-)

"pooling": {"@layers": "reduce_mean.v1"},
}
nlp.config["components"][listener]["model"]["tok2vec"] = listener_config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also need to update the [training] block.

"upstream": "tok2vec",
}
nlp.config["components"][listener]["model"]["tok2vec"] = listener_config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may also need to update the [training] block. (I know that tok2vec->transformer doesn't work. I'm not 100% sure it doesn't work the other way around, but probably the tok2vec defaults are better.)

Comment on lines +13 to +22
TOK2VEC_ARCHS = [
("spacy", "Tok2Vec"),
("spacy", "HashEmbedCNN"),
("spacy-transformers", "TransformerModel"),
]
# These are the listeners.
LISTENER_ARCHS = [
("spacy", "Tok2VecListener"),
("spacy-transformers", "TransformerListener"),
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a more general way to determine these lists?

@svlandeg svlandeg changed the base branch from master to main January 29, 2024 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / cli Feature: Command-line interface feat / config Feature: Training config feat / pipeline Feature: Processing pipeline and components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants