Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

context-free grammar generation stuck forever #1312

Open
htang6 opened this issue Dec 3, 2024 · 3 comments
Open

context-free grammar generation stuck forever #1312

htang6 opened this issue Dec 3, 2024 · 3 comments
Labels

Comments

@htang6
Copy link

htang6 commented Dec 3, 2024

Describe the issue as clearly as possible:

I can correctly run the context-free grammar example, but when I use my custumed context free grammar of a domain specific language, it stucked after call generator. I think my grammar is correct because it can pass the lark compilation.

Steps/code to reproduce the bug:

import outlines

snl_grammar = r'''
start: if_else
if_else: "如果" conditions "那么" conditions
conditions: condition (("并且"|"或者") condition)*
condition: name "的" property property_cmp (("并且"|"或者")property_cmp)*
property_cmp: num_cmp|str_cmp
num_cmp: (num_comp_op number) | (num_comp_op property_val_expr)|(num_comp_op simple_expr)
simple_expr: (name "的" property)
property_val_expr: (number num_cal_op simple_expr)|(simple_expr num_cal_op number)|(simple_expr num_cal_op simple_expr)
str_cmp: str_comp_op ESCAPED_STRING
num_comp_op: ">"|"<"|">="|"<="
num_cal_op: "+"|"-"|"*"|"/"
str_comp_op: "包含"|"不包含"|"匹配"|"不匹配"
property: name
name: WORD
number: SIGNED_NUMBER
LCASE_LETTER: "a".."z"
UCASE_LETTER: "A".."Z"
CN_ZH_LETTER: /[u"\u4e00-\u9fa5"]/
LETTER: UCASE_LETTER | LCASE_LETTER | CN_ZH_LETTER
WORD: LETTER+

%import common.SIGNED_NUMBER
%import common.WS
%import common.ESCAPED_STRING
%ignore WS
'''

prompt_test = '''
The following the a context free grammar for a domain specific language:
start: if_else
if_else: "如果" conditions "那么" conditions
conditions: condition (("并且"|"或者") condition)*
condition: name "的" property property_cmp (("并且"|"或者")property_cmp)*
property_cmp: num_cmp|str_cmp
num_cmp: (num_comp_op number) | (num_comp_op property_val_expr)|(num_comp_op simple_expr)
simple_expr: (name "的" property)
property_val_expr: (number num_cal_op simple_expr)|(simple_expr num_cal_op number)|(simple_expr num_cal_op simple_expr)
str_cmp: str_comp_op ESCAPED_STRING
num_comp_op: ">"|"<"|">="|"<="
num_cal_op: "+"|"-"|"*"|"/"
str_comp_op: "包含"|"不包含"|"匹配"|"不匹配"
property: name
name: WORD
number: SIGNED_NUMBER
LCASE_LETTER: "a".."z"
UCASE_LETTER: "A".."Z"
CN_ZH_LETTER: /[u"\u4e00-\u9fa5"]/
LETTER: UCASE_LETTER | LCASE_LETTER | CN_ZH_LETTER
WORD: LETTER+

%import common.SIGNED_NUMBER
%import common.WS
%import common.ESCAPED_STRING
%ignore WS

Please convert the following text to domain specific language

Text:
4.2.6 管廊的柱距应满足大多数管道的跨距要求,宜为6m~9m。
Output:

'''

import time
start = time.time()
model = outlines.models.transformers("/home/yd/llm_weights/Qwen2.5-7B-Instruct")
generator = outlines.generate.cfg(model, snl_grammar)
sequence = generator(prompt_test)
print(sequence)
total = time.time() - start

print(total)

Expected result:

It should output a valid sentence based on my cfg

Error message:

No error message, it get stuck after printing:
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.41it/s]
/home/user/.conda/envs/hc_general/lib/python3.12/site-packages/outlines/fsm/guide.py:110: UserWarning: Outlines' public *community-contributed* CFG structured generation is experimental. Please review https://dottxt-ai.github.io/outlines/latest/reference/generation/cfg#disclaimer
  warnings.warn(

Outlines/Python version information:

Version information

``` 0.1.7 Python 3.12.1 ```

Context for the issue:

Currently I'm trying to automatically convert text to a domain specific language, the text and DSL are both in Chinese, I want to use cfg constrained decoding to improve generation accuracy, but both use outlines and vllm doesn't seem to work

@htang6 htang6 added the bug label Dec 3, 2024
@cpfiffer
Copy link
Contributor

cpfiffer commented Dec 5, 2024

It's hard to debug this without more information about the result of your DSL. In general with infinite generation like this, it's likely a small issue with the CFG. It may be syntactically valid, but may not be semantically valid.

To help debug, I'd try limiting token generation and inspecting it to see if it's what you expect:

sequence = generator(prompt_test, max_tokens=10)

I got this but have no clue what it means.

如果管廊的柱距小于等于  

@htang6
Copy link
Author

htang6 commented Dec 7, 2024

Thanks! I can get the result after restricting the length to smaller number like 10. But the generation is pretty slow, for qwen-7b and a nvidia 4090, it takes 57s to generate 10 characters. I think the cfg isn't very complicated though, It defined a simple language similar to a if-else statement of python with syntax replaced using Chinese words. Does the speed normal?

Copy link
Contributor

cpfiffer commented Dec 9, 2024

Depends. You can benchmark the speed against

generator = outlines.generate.text(...)

which is unstructured — if the generation time is similar, there's likely not much to do on the Outlines side.

However, you should note that the Outlines CFG tooling is community provided, and it's not the most performant tooling out there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants