Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid ValueError: substring not found #65

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Yueeeeeeee
Copy link

in some cases, answers can't be found in the input text and ValueError would appear, add try except to avoid such errors.

@violetcodes
Copy link

in my case, substring was not found because ans were padded (like Ans Entity). Strangly, this error was only encountered when I do this using jupyter, when I do it from terminal, no such error was found.

@matt-mkidd-ko
Copy link

matt-mkidd-ko commented Apr 20, 2021

Yes please! I have found the same error but hadn't fully worked out why just yet.

Here is a minimal example:

from pipelines import pipeline

# load in the multi task qa qg
MODEL = pipeline("multitask-qa-qg")

# problem text
text = 'The herb is generally safe to use. There is limited research to suggest that stinging nettle is an effective remedy. Researchers need to do more studies before they can confirm the health benefits of stinging nettle.'

MODEL(text)

Full stack trace:

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-59-1ab007d28390> in <module>()
      7 text = 'The herb is generally safe to use. There is limited research to suggest that stinging nettle is an effective remedy. Researchers need to do more studies before they can confirm the health benefits of stinging nettle.'
      8 
----> 9 MODEL(text)

2 frames

/content/question_generation/pipelines.py in _prepare_inputs_for_qg_from_answers_hl(self, sents, answers)
    140                 answer_text = answer_text.strip()
    141 
--> 142                 ans_start_idx = sent.index(answer_text)
    143 
    144                 sent = f"{sent[:ans_start_idx]} <hl> {answer_text} <hl> {sent[ans_start_idx + len(answer_text): ]}"

ValueError: substring not found

@Yueeeeeeee
Copy link
Author

Yes please! I have found the same error but hadn't fully worked out why just yet.

Here is a minimal example:

from pipelines import pipeline

# load in the multi task qa qg
MODEL = pipeline("multitask-qa-qg")

# problem text
text = 'The herb is generally safe to use. There is limited research to suggest that stinging nettle is an effective remedy. Researchers need to do more studies before they can confirm the health benefits of stinging nettle.'

MODEL(text)

Full stack trace:

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-59-1ab007d28390> in <module>()
      7 text = 'The herb is generally safe to use. There is limited research to suggest that stinging nettle is an effective remedy. Researchers need to do more studies before they can confirm the health benefits of stinging nettle.'
      8 
----> 9 MODEL(text)

2 frames

/content/question_generation/pipelines.py in _prepare_inputs_for_qg_from_answers_hl(self, sents, answers)
    140                 answer_text = answer_text.strip()
    141 
--> 142                 ans_start_idx = sent.index(answer_text)
    143 
    144                 sent = f"{sent[:ans_start_idx]} <hl> {answer_text} <hl> {sent[ans_start_idx + len(answer_text): ]}"

ValueError: substring not found

In this specific case, I found out that the error occurred because in sentence "Researchers need to do more studies before they can confirm the health benefits of stinging nettle.", the generated answer is "Do more studies" instead of "do more studies", in ans_start_idx = sent.index(answer_text) (line 142), this index function is case-sensitive, so indexing "Do more studies" will give you this value error.

Since the T5 model is uncased anyway, a simple solution would be replacing line 137 and line 140 in pipelines.py respectively with:
sent = sents[i].lower()
answer_text = answer_text.strip().lower()

This should solve your problem :)

@mukulmalik18
Copy link

mukulmalik18 commented May 19, 2021

Yes please! I have found the same error but hadn't fully worked out why just yet.
Here is a minimal example:

from pipelines import pipeline

# load in the multi task qa qg
MODEL = pipeline("multitask-qa-qg")

# problem text
text = 'The herb is generally safe to use. There is limited research to suggest that stinging nettle is an effective remedy. Researchers need to do more studies before they can confirm the health benefits of stinging nettle.'

MODEL(text)

Full stack trace:

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-59-1ab007d28390> in <module>()
      7 text = 'The herb is generally safe to use. There is limited research to suggest that stinging nettle is an effective remedy. Researchers need to do more studies before they can confirm the health benefits of stinging nettle.'
      8 
----> 9 MODEL(text)

2 frames

/content/question_generation/pipelines.py in _prepare_inputs_for_qg_from_answers_hl(self, sents, answers)
    140                 answer_text = answer_text.strip()
    141 
--> 142                 ans_start_idx = sent.index(answer_text)
    143 
    144                 sent = f"{sent[:ans_start_idx]} <hl> {answer_text} <hl> {sent[ans_start_idx + len(answer_text): ]}"

ValueError: substring not found

In this specific case, I found out that the error occurred because in sentence "Researchers need to do more studies before they can confirm the health benefits of stinging nettle.", the generated answer is "Do more studies" instead of "do more studies", in ans_start_idx = sent.index(answer_text) (line 142), this index function is case-sensitive, so indexing "Do more studies" will give you this value error.

Since the T5 model is uncased anyway, a simple solution would be replacing line 137 and line 140 in pipelines.py respectively with:
sent = sents[i].lower()
answer_text = answer_text.strip().lower()

This should solve your problem :)

The error was mainly because of the occurrence of the "<pad>" token at the beginning of some answers. Due to which the index of the answer couldn't be found in "sent".

So I added the following line at 141 to remove the token from the answer:

answer_text = re.sub("<pad> | <pad>", "", answer_text)

Post this addition, the code has been working on all the example that I've seen so far.

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants