context-free grammar generation stuck forever #1312

htang6 · 2024-12-03T00:47:21Z

Describe the issue as clearly as possible:

I can correctly run the context-free grammar example, but when I use my custumed context free grammar of a domain specific language, it stucked after call generator. I think my grammar is correct because it can pass the lark compilation.

Steps/code to reproduce the bug:

import outlines

snl_grammar = r'''
start: if_else
if_else: "如果" conditions "那么" conditions
conditions: condition (("并且"|"或者") condition)*
condition: name "的" property property_cmp (("并且"|"或者")property_cmp)*
property_cmp: num_cmp|str_cmp
num_cmp: (num_comp_op number) | (num_comp_op property_val_expr)|(num_comp_op simple_expr)
simple_expr: (name "的" property)
property_val_expr: (number num_cal_op simple_expr)|(simple_expr num_cal_op number)|(simple_expr num_cal_op simple_expr)
str_cmp: str_comp_op ESCAPED_STRING
num_comp_op: ">"|"<"|">="|"<="
num_cal_op: "+"|"-"|"*"|"/"
str_comp_op: "包含"|"不包含"|"匹配"|"不匹配"
property: name
name: WORD
number: SIGNED_NUMBER
LCASE_LETTER: "a".."z"
UCASE_LETTER: "A".."Z"
CN_ZH_LETTER: /[u"\u4e00-\u9fa5"]/
LETTER: UCASE_LETTER | LCASE_LETTER | CN_ZH_LETTER
WORD: LETTER+

%import common.SIGNED_NUMBER
%import common.WS
%import common.ESCAPED_STRING
%ignore WS
'''

prompt_test = '''
The following the a context free grammar for a domain specific language:
start: if_else
if_else: "如果" conditions "那么" conditions
conditions: condition (("并且"|"或者") condition)*
condition: name "的" property property_cmp (("并且"|"或者")property_cmp)*
property_cmp: num_cmp|str_cmp
num_cmp: (num_comp_op number) | (num_comp_op property_val_expr)|(num_comp_op simple_expr)
simple_expr: (name "的" property)
property_val_expr: (number num_cal_op simple_expr)|(simple_expr num_cal_op number)|(simple_expr num_cal_op simple_expr)
str_cmp: str_comp_op ESCAPED_STRING
num_comp_op: ">"|"<"|">="|"<="
num_cal_op: "+"|"-"|"*"|"/"
str_comp_op: "包含"|"不包含"|"匹配"|"不匹配"
property: name
name: WORD
number: SIGNED_NUMBER
LCASE_LETTER: "a".."z"
UCASE_LETTER: "A".."Z"
CN_ZH_LETTER: /[u"\u4e00-\u9fa5"]/
LETTER: UCASE_LETTER | LCASE_LETTER | CN_ZH_LETTER
WORD: LETTER+

%import common.SIGNED_NUMBER
%import common.WS
%import common.ESCAPED_STRING
%ignore WS

Please convert the following text to domain specific language

Text:
4.2.6 管廊的柱距应满足大多数管道的跨距要求，宜为6m～9m。
Output:

'''

import time
start = time.time()
model = outlines.models.transformers("/home/yd/llm_weights/Qwen2.5-7B-Instruct")
generator = outlines.generate.cfg(model, snl_grammar)
sequence = generator(prompt_test)
print(sequence)
total = time.time() - start

print(total)

Expected result:

It should output a valid sentence based on my cfg

Error message:

No error message, it get stuck after printing:
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.41it/s]
/home/user/.conda/envs/hc_general/lib/python3.12/site-packages/outlines/fsm/guide.py:110: UserWarning: Outlines' public *community-contributed* CFG structured generation is experimental. Please review https://dottxt-ai.github.io/outlines/latest/reference/generation/cfg#disclaimer
  warnings.warn(

Outlines/Python version information:

Version information

``` 0.1.7 Python 3.12.1 ```

Context for the issue:

Currently I'm trying to automatically convert text to a domain specific language, the text and DSL are both in Chinese, I want to use cfg constrained decoding to improve generation accuracy, but both use outlines and vllm doesn't seem to work

cpfiffer · 2024-12-05T19:01:49Z

It's hard to debug this without more information about the result of your DSL. In general with infinite generation like this, it's likely a small issue with the CFG. It may be syntactically valid, but may not be semantically valid.

To help debug, I'd try limiting token generation and inspecting it to see if it's what you expect:

sequence = generator(prompt_test, max_tokens=10)

I got this but have no clue what it means.

如果管廊的柱距小于等于

htang6 · 2024-12-07T08:10:41Z

Thanks! I can get the result after restricting the length to smaller number like 10. But the generation is pretty slow, for qwen-7b and a nvidia 4090, it takes 57s to generate 10 characters. I think the cfg isn't very complicated though, It defined a simple language similar to a if-else statement of python with syntax replaced using Chinese words. Does the speed normal?

cpfiffer · 2024-12-09T16:52:59Z

Depends. You can benchmark the speed against

generator = outlines.generate.text(...)

which is unstructured — if the generation time is similar, there's likely not much to do on the Outlines side.

However, you should note that the Outlines CFG tooling is community provided, and it's not the most performant tooling out there.

htang6 added the bug label Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

context-free grammar generation stuck forever #1312

context-free grammar generation stuck forever #1312

htang6 commented Dec 3, 2024

cpfiffer commented Dec 5, 2024

htang6 commented Dec 7, 2024

cpfiffer commented Dec 9, 2024

context-free grammar generation stuck forever #1312

context-free grammar generation stuck forever #1312

Comments

htang6 commented Dec 3, 2024

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Expected result:

Error message:

Outlines/Python version information:

Context for the issue:

cpfiffer commented Dec 5, 2024

htang6 commented Dec 7, 2024

cpfiffer commented Dec 9, 2024