Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence Segmentation Issue with spacy_llm and en_core_web_trf #494

Open
neostrange opened this issue Jan 19, 2025 · 0 comments
Open

Sentence Segmentation Issue with spacy_llm and en_core_web_trf #494

neostrange opened this issue Jan 19, 2025 · 0 comments

Comments

@neostrange
Copy link

neostrange commented Jan 19, 2025

Description:

I'm encountering a problem with sentence segmentation when integrating spacy_llm components into a spaCy pipeline that is based on en_core_web_trf.

Observed Behavior:

  • Sentence segmentation fails when spacy_llm components are added to the pipeline.
  • The issue does not occur when using spacy_llm components in a blank pipeline.

Environment:

Using the latest versions of spaCy and en_core_web_trf.

  • Config File (Example):
[paths]
examples = "examples.json"

[nlp]
lang = "en"
pipeline = ["transformer", "tagger", "parser", "lemmatizer", "llm", "llm_rel"]

[components]

[components.transformer]
source = "en_core_web_trf"

[components.tagger]
source = "en_core_web_trf"

[components.parser]
source = "en_core_web_trf"


[components.lemmatizer]
source = "en_core_web_trf"


[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "spacy.NER.v3"
labels = ["DISH", "INGREDIENT", "EQUIPMENT", "PERSON", "LOCATION"]
description = "Entities are the names food dishes,
    ingredients, and any kind of cooking equipment.
    Adjectives, verbs, adverbs are not entities.
    Pronouns are not entities."

[components.llm.task.label_definitions]
DISH = "Known food dishes, e.g. Lobster Ravioli, garlic bread"
INGREDIENT = "Individual parts of a food dish, including herbs and spices."
EQUIPMENT = "Any kind of cooking equipment. e.g. oven, cooking pot, grill"

[components.llm.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "examples.json"

[components.llm.model]
@llm_models = "spacy.Ollama.3.1.8b"

[components.llm_rel]
factory = "llm_rel"

[components.llm_rel.task]
@llm_tasks = "spacy.REL.v1"
labels = LivesIn,Visits

[components.llm_rel.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "examples.jsonl"

[components.llm_rel.model]
@llm_models = "spacy.Ollama.3.1.8b"


Steps to Reproduce:

  1. Load the en_core_web_trf pipeline with the modified config.
  2. Process a text with the modified pipeline.
  3. Observe the lack of sentence segmentation.

Troubleshooting:

  • Tried explicitly adding sentencizer to the pipeline.
  • Experimented with different component orders.
  • Verified config loading process.

If i run this code:

`       self.nlp = spacy.load('en_core_web_trf')

        self.nlp = assemble(config_path=self.config_path, overrides={"paths.examples": str(self.examples_path)})

        print("config: ", self.nlp.config.to_str())

        print("PIPELINE:  ", self.nlp.pipeline)
`
  • It gives me the following pipeline configuration:

IPELINE: [('transformer', <spacy_curated_transformers.pipeline.transformer.CuratedTransformer object at 0x7f2b36231960>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7f2b64d6e1a0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7f2af922fed0>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7f2b34e629c0>), ('llm', <spacy_llm.pipeline.llm.LLMWrapper object at 0x7f2b2ae4e2c0>), ('llm_rel', <spacy_llm.pipeline.llm.LLMWrapper object at 0x7f2b52016740>)]

But while processing text, it gives me the following error:

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe('sentencizer'). Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting doc[i].is_sent_start.

I would appreciate any guidance or assistance in resolving this issue. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant