Add `with_sequence` for decode stream #1725

ArthurZucker · 2025-01-21T10:37:35Z

No description provided.

njhill · 2025-01-21T16:38:23Z

Thank you for this @ArthurZucker!

What do you think about having a version of step that can take a sequence tokens? That could be used for prefilling and also for incrementing the stream with chunks of tokens when needed?

I'm also thinking through how this would be used in practice. For very long prompts, we ideally don't want to decode the whole thing since we would typically already have just tokenized the text prompt. But we need the last couple of prompt tokens to ensure we can continue the prompt text cleanly such that the concatenation of the first streamed string with the original prompt is exactly equal to all of the tokens being decoded together.

Perhaps that's up to the user of the API to sort out, but it might be nice for the prefilled tokens to be excluded from the subsequent step output (or at least have the option for that).

ArthurZucker · 2025-01-22T11:30:50Z

For sure! I am actually a lot less familiar than you about the actual use-cases! Super thankful for the feedback!
Indeed makes senses that you don't want it all. Was wondering if this is also compatible with batches in general or not, as each sample needs a stream with the current implementation

alvarobartt · 2025-02-11T09:30:36Z

bindings/python/py_src/tokenizers/pre_tokenizers/__init__.pyi

            otherwise we consider is as a string pattern. For example `pattern="|"`
            means you want to split on `|` (imagine a csv file for example), while
-            `pattern=tokenizers.Regex("1|2")` means you split on either '1' or '2'.
+            `patter=tokenizer.Regex("1|2")` means you split on either '1' or '2'.


Suggested change

`patter=tokenizer.Regex("1|2")` means you split on either '1' or '2'.

`pattern=tokenizer.Regex("1|2")` means you split on either '1' or '2'.

nits

b7947d1

ArthurZucker mentioned this pull request Jan 21, 2025

Decode stream python #1678

Merged

ArthurZucker added 6 commits January 21, 2025 11:44

with

2ce721b

update

46be059

update

69206e2

zut

24d1068

& bad

3e19357

stub

d1a7c66

ArthurZucker changed the title ~~Add form sequence for decode stream~~ Add with_sequence for decode stream Jan 21, 2025

alvarobartt reviewed Feb 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `with_sequence` for decode stream #1725

Add `with_sequence` for decode stream #1725

ArthurZucker commented Jan 21, 2025

njhill commented Jan 21, 2025

ArthurZucker commented Jan 22, 2025

alvarobartt Feb 11, 2025

	`patter=tokenizer.Regex("1\|2")` means you split on either '1' or '2'.
	`pattern=tokenizer.Regex("1\|2")` means you split on either '1' or '2'.

Add with_sequence for decode stream #1725

Are you sure you want to change the base?

Add with_sequence for decode stream #1725

Conversation

ArthurZucker commented Jan 21, 2025

njhill commented Jan 21, 2025

ArthurZucker commented Jan 22, 2025

alvarobartt Feb 11, 2025

Choose a reason for hiding this comment

Add `with_sequence` for decode stream #1725

Add `with_sequence` for decode stream #1725