OCR source text extraction guide #1381

prhbrt · 2025-01-17T13:09:30Z

I created a Guide to extract spans from source texts, as if they are fields from a registry.

https://github.com/UG-Team-Data-Science/outlines-ocr-guidance

Would this be something to integrate in outlines?

rlouf · 2025-01-17T13:33:34Z

@cpfiffer

cpfiffer · 2025-01-17T16:26:17Z

Wow, cool! You've built a lot of stuff.

What does registry mean here? I'm not familiar with some of the language.

prhbrt · 2025-01-20T10:34:14Z

Wow, cool! You've built a lot of stuff.

What does registry mean here? I'm not familiar with some of the language.

Good question :) A registry is a printed database, in the sense of how they existed before computers. This is one page of a registry, in this case of German patents from 1890. It lists the technological class, an entry code, a description and the inventors and their addresses. For this data set we had data between 1888 and 1945, and the researchers wanted to find relations between inventor gender and technological classes over time. For the years where the first names were included or the marital status was, we could do this.

But we have researchers that have such datasets for historic Russian companies

Tax records,

Russian Gulag prisoners,

Parliametary notes

Historic balance sheets

etc. Currently the pipeline is basic pre-processing, straightening, removing bastard characters, layout detection, ocr, regular expressions, excel sheet for the researcher.

We're trying to help researchers with an friendly way to digitize their copora, and LLMs would fit well as a replacement for the regular expressions if the records are reasonably textual rather than tabular, like the first one of patents. However, we want to have an approach that doesn't hallucinate records.

prhbrt · 2025-01-20T10:35:06Z

PS: I moved the repository from my private account to my team's.

https://github.com/UG-Team-Data-Science/outlines-ocr-guidance

cpfiffer · 2025-01-22T18:57:51Z

That's very helpful, thank you!

I'm not sure exactly where I'd put this in Outlines, since it's a heavier application. It is also an extremely clever use of Outlines, so I feel like it would be a shame not to showcase it somehow.

I could imagine

A small cookbook walking through a simplified version of this
Adding demonstration code to our demos repo
A blog post, though this takes a significant amount of time on both sides and may not be preferable

Of these, I think linking/adding this to the demos repo would be great, since the code is mostly already in demo-shape.

prhbrt · 2025-01-22T20:11:08Z

Thanks for your suggestions, they all seem like good ideas to me.

I think a demo would work great, however I would also be happy to work towards a blog post, as it supports my personal agenda of getting ideas and attention for scaling this up to more complicated registries. Now it just works on simple toy problems.

However, I can imagine you want to consider whether this is something you want to pay time and/or spotlight to, so I wouldn't hold it against you if you'd pass.

cpfiffer · 2025-01-28T02:37:03Z

Okay -- here's a question for you.

Could we reduce this into a cookbook-style document that narrows focus to the structured gen + FSM approach you used? To me, that's the core innovation here, and one I think we should find a way to amplify.

If you're interested in a cookbook, I don't think we should focus on the OCR so much as just the "organization" of tokens. A blog post would be a fine fit for the specifics of your application but, I think I'm too bandwidth-limited to shepherd it through for the moment. If you write a post on your blog we'd love to boost it!

For context, I've noticed that we often have requests in the Discord for the ability to match substrings in a document for highlighting/citation purposes. It's hard to write this regular expression, and can be difficult to compile in advance.

What you've built is essentially simple, clean code to get around the complexity required when using the raw regex/json interface. I want to find a way to communicate that to everyone.

I don't necessarily want to implement sentence substring matching, but I do want to start building an arsenal of tools to help people understand how to mess with the Guide and other slightly more internal tools.

Thoughts?

prhbrt · 2025-01-30T14:35:36Z

Could we reduce this into a cookbook-style document that narrows focus to the structured gen + FSM approach you used?

Yes.

Do you have a particular suggestion for a substring extraction problem? We used NER to extract locations from (Dutch) parliamentary questions (and track local representation) site (long load). A cookbook could do this via an LLM, or find the actual municipality of the mention from google search hits.

I tried something, but performance isn't too good.

Alternatively, I also have a climate litigation corpus, suitable to extract excepts listing particular arguments.

If you're interested in a cookbook, I don't think we should focus on the OCR so much as just the "organization" of tokens.

The essence is substring-generation though, otherwise the feature adds nothing to what the pydantic generation already does. Or am I missing something?

requests in the Discord for the ability to match substrings in a document for highlighting/citation purposes.

Token classification via outlines might be cool, and would be helpful for researchers I help. Then I'd also like to extract span locations. Is there a risk in tracking and deep-copying spans in Guide, considering e.g. beam-search samplers?

This would allow cookers to also get acquainted with shadowing states.

It's hard to write this regular expression, and can be difficult to compile in advance.

I have the same quadratic state-space explosion, with [start, end), but avoid a gigantic regex.

Thoughts?
``
I am still confused about Write as per this issue from when outlines wasn't rustified yet.

Before making a cookbook, I'd like to know if this is a bug or pebcak. Could we have a Write that adds tokens without LLM-autoregression?

Thoughts?

Chain-of-thought prompting may significantly improve accuracy. Having a Write in my toolbox allows me to efficiently generate. E.g. an fsm could Write this to trigger a chain of though.

# the telephone number may include numbers (0-9), spaces, pound signs (#) and plus signs (+), but nothing else.
 - phone:

A power-feature might be ExpiringWrite(token, expire_after)s to save token bandwidth. I.e. its tokens are cleared from the generated text after a while, assuming the chain-of-thought was already primed. This may slightly adversely affect the LLM's matrix caching strategies, and I don't know if taking away the LLM's 'notes' mid-generation has other adverse effects.

cpfiffer · 2025-01-31T17:36:38Z

Good cookbooks don't typically use meaningful datasets. They use something very simple and illustrate how it could be scaled in quality, scale, focus, etc.

If you want to showcase substring extraction, we'd basically want a single short string people can play with.

The simplest possible example I could suggest is something like

Clifford the red dog wore a tiny little hat made of feathers. It wasn't a good looking hat, but it made him feel good.

Then you might want to ask a question like "what is Clifford's hat made of". I would want a model to return a substring containing either "made of feathers" or "a tiny little hat made of feathers".

The essence is substring-generation though, otherwise the feature adds nothing to what the pydantic generation already does. Or am I missing something?

Do you mean the OCR part? I guess I'm not sure quite what you mean. My statement there was that OCR is an application but that core technique you've used is significantly more general. I suggested using a simpler approach to demonstrate the core technique without using OCR as a motivating example.

There's some other stuff in your response, like Write, that is probably better handled in separate issues. Is #942 up to date? We can flag a maintainer to see if they can clarify anything over there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR source text extraction guide #1381

OCR source text extraction guide #1381

prhbrt commented Jan 17, 2025 •

edited

Loading

rlouf commented Jan 17, 2025

cpfiffer commented Jan 17, 2025

prhbrt commented Jan 20, 2025 •

edited

Loading

prhbrt commented Jan 20, 2025

cpfiffer commented Jan 22, 2025

prhbrt commented Jan 22, 2025

cpfiffer commented Jan 28, 2025

prhbrt commented Jan 30, 2025

cpfiffer commented Jan 31, 2025

OCR source text extraction guide #1381

OCR source text extraction guide #1381

Comments

prhbrt commented Jan 17, 2025 • edited Loading

rlouf commented Jan 17, 2025

cpfiffer commented Jan 17, 2025

prhbrt commented Jan 20, 2025 • edited Loading

prhbrt commented Jan 20, 2025

cpfiffer commented Jan 22, 2025

prhbrt commented Jan 22, 2025

cpfiffer commented Jan 28, 2025

prhbrt commented Jan 30, 2025

cpfiffer commented Jan 31, 2025

prhbrt commented Jan 17, 2025 •

edited

Loading

prhbrt commented Jan 20, 2025 •

edited

Loading