Sentence Segmentation Issue with spacy_llm and en_core_web_trf #494

neostrange · 2025-01-19T13:40:32Z

Description:

I'm encountering a problem with sentence segmentation when integrating spacy_llm components into a spaCy pipeline that is based on en_core_web_trf.

Observed Behavior:

Sentence segmentation fails when spacy_llm components are added to the pipeline.
The issue does not occur when using spacy_llm components in a blank pipeline.

Environment:

Using the latest versions of spaCy and en_core_web_trf.

Config File (Example):

[paths]
examples = "examples.json"

[nlp]
lang = "en"
pipeline = ["transformer", "tagger", "parser", "lemmatizer", "llm", "llm_rel"]

[components]

[components.transformer]
source = "en_core_web_trf"

[components.tagger]
source = "en_core_web_trf"

[components.parser]
source = "en_core_web_trf"


[components.lemmatizer]
source = "en_core_web_trf"


[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "spacy.NER.v3"
labels = ["DISH", "INGREDIENT", "EQUIPMENT", "PERSON", "LOCATION"]
description = "Entities are the names food dishes,
    ingredients, and any kind of cooking equipment.
    Adjectives, verbs, adverbs are not entities.
    Pronouns are not entities."

[components.llm.task.label_definitions]
DISH = "Known food dishes, e.g. Lobster Ravioli, garlic bread"
INGREDIENT = "Individual parts of a food dish, including herbs and spices."
EQUIPMENT = "Any kind of cooking equipment. e.g. oven, cooking pot, grill"

[components.llm.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "examples.json"

[components.llm.model]
@llm_models = "spacy.Ollama.3.1.8b"

[components.llm_rel]
factory = "llm_rel"

[components.llm_rel.task]
@llm_tasks = "spacy.REL.v1"
labels = LivesIn,Visits

[components.llm_rel.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "examples.jsonl"

[components.llm_rel.model]
@llm_models = "spacy.Ollama.3.1.8b"

Steps to Reproduce:

Load the en_core_web_trf pipeline with the modified config.
Process a text with the modified pipeline.
Observe the lack of sentence segmentation.

Troubleshooting:

Tried explicitly adding sentencizer to the pipeline.
Experimented with different component orders.
Verified config loading process.

If i run this code:

`       self.nlp = spacy.load('en_core_web_trf')

        self.nlp = assemble(config_path=self.config_path, overrides={"paths.examples": str(self.examples_path)})

        print("config: ", self.nlp.config.to_str())

        print("PIPELINE:  ", self.nlp.pipeline)
`

It gives me the following pipeline configuration:

IPELINE: [('transformer', <spacy_curated_transformers.pipeline.transformer.CuratedTransformer object at 0x7f2b36231960>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7f2b64d6e1a0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7f2af922fed0>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7f2b34e629c0>), ('llm', <spacy_llm.pipeline.llm.LLMWrapper object at 0x7f2b2ae4e2c0>), ('llm_rel', <spacy_llm.pipeline.llm.LLMWrapper object at 0x7f2b52016740>)]

But while processing text, it gives me the following error:

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe('sentencizer'). Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting doc[i].is_sent_start.

I would appreciate any guidance or assistance in resolving this issue. Thank you!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence Segmentation Issue with spacy_llm and en_core_web_trf #494

Sentence Segmentation Issue with spacy_llm and en_core_web_trf #494

neostrange commented Jan 19, 2025 •

edited

Loading

Sentence Segmentation Issue with spacy_llm and en_core_web_trf #494

Sentence Segmentation Issue with spacy_llm and en_core_web_trf #494

Comments

neostrange commented Jan 19, 2025 • edited Loading

neostrange commented Jan 19, 2025 •

edited

Loading