Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR source text extraction guide #1381

Open
prhbrt opened this issue Jan 17, 2025 · 9 comments
Open

OCR source text extraction guide #1381

prhbrt opened this issue Jan 17, 2025 · 9 comments

Comments

@prhbrt
Copy link

prhbrt commented Jan 17, 2025

I created a Guide to extract spans from source texts, as if they are fields from a registry.

https://github.com/UG-Team-Data-Science/outlines-ocr-guidance

Would this be something to integrate in outlines?

@rlouf
Copy link
Member

rlouf commented Jan 17, 2025

@cpfiffer

@cpfiffer
Copy link
Contributor

Wow, cool! You've built a lot of stuff.

What does registry mean here? I'm not familiar with some of the language.

@prhbrt
Copy link
Author

prhbrt commented Jan 20, 2025

Wow, cool! You've built a lot of stuff.

What does registry mean here? I'm not familiar with some of the language.

Good question :) A registry is a printed database, in the sense of how they existed before computers. This is one page of a registry, in this case of German patents from 1890. It lists the technological class, an entry code, a description and the inventors and their addresses. For this data set we had data between 1888 and 1945, and the researchers wanted to find relations between inventor gender and technological classes over time. For the years where the first names were included or the marital status was, we could do this.

Image

But we have researchers that have such datasets for historic Russian companies

Image

Tax records,

Image

Russian Gulag prisoners,

Image

Parliametary notes

Image

Historic balance sheets

Image

etc. Currently the pipeline is basic pre-processing, straightening, removing bastard characters, layout detection, ocr, regular expressions, excel sheet for the researcher.

We're trying to help researchers with an friendly way to digitize their copora, and LLMs would fit well as a replacement for the regular expressions if the records are reasonably textual rather than tabular, like the first one of patents. However, we want to have an approach that doesn't hallucinate records.

@prhbrt
Copy link
Author

prhbrt commented Jan 20, 2025

PS: I moved the repository from my private account to my team's.

https://github.com/UG-Team-Data-Science/outlines-ocr-guidance

@cpfiffer
Copy link
Contributor

That's very helpful, thank you!

I'm not sure exactly where I'd put this in Outlines, since it's a heavier application. It is also an extremely clever use of Outlines, so I feel like it would be a shame not to showcase it somehow.

I could imagine

  • A small cookbook walking through a simplified version of this
  • Adding demonstration code to our demos repo
  • A blog post, though this takes a significant amount of time on both sides and may not be preferable

Of these, I think linking/adding this to the demos repo would be great, since the code is mostly already in demo-shape.

@prhbrt
Copy link
Author

prhbrt commented Jan 22, 2025

Thanks for your suggestions, they all seem like good ideas to me.

I think a demo would work great, however I would also be happy to work towards a blog post, as it supports my personal agenda of getting ideas and attention for scaling this up to more complicated registries. Now it just works on simple toy problems.

However, I can imagine you want to consider whether this is something you want to pay time and/or spotlight to, so I wouldn't hold it against you if you'd pass.

@cpfiffer
Copy link
Contributor

Okay -- here's a question for you.

Could we reduce this into a cookbook-style document that narrows focus to the structured gen + FSM approach you used? To me, that's the core innovation here, and one I think we should find a way to amplify.

If you're interested in a cookbook, I don't think we should focus on the OCR so much as just the "organization" of tokens. A blog post would be a fine fit for the specifics of your application but, I think I'm too bandwidth-limited to shepherd it through for the moment. If you write a post on your blog we'd love to boost it!

For context, I've noticed that we often have requests in the Discord for the ability to match substrings in a document for highlighting/citation purposes. It's hard to write this regular expression, and can be difficult to compile in advance.

What you've built is essentially simple, clean code to get around the complexity required when using the raw regex/json interface. I want to find a way to communicate that to everyone.

I don't necessarily want to implement sentence substring matching, but I do want to start building an arsenal of tools to help people understand how to mess with the Guide and other slightly more internal tools.

Thoughts?

@prhbrt
Copy link
Author

prhbrt commented Jan 30, 2025

Could we reduce this into a cookbook-style document that narrows focus to the structured gen + FSM approach you used?

Yes.

Do you have a particular suggestion for a substring extraction problem? We used NER to extract locations from (Dutch) parliamentary questions (and track local representation) site (long load). A cookbook could do this via an LLM, or find the actual municipality of the mention from google search hits.

I tried something, but performance isn't too good.

Alternatively, I also have a climate litigation corpus, suitable to extract excepts listing particular arguments.

If you're interested in a cookbook, I don't think we should focus on the OCR so much as just the "organization" of tokens.

The essence is substring-generation though, otherwise the feature adds nothing to what the pydantic generation already does. Or am I missing something?

requests in the Discord for the ability to match substrings in a document for highlighting/citation purposes.

Token classification via outlines might be cool, and would be helpful for researchers I help. Then I'd also like to extract span locations. Is there a risk in tracking and deep-copying spans in Guide, considering e.g. beam-search samplers?

This would allow cookers to also get acquainted with shadowing states.

It's hard to write this regular expression, and can be difficult to compile in advance.

I have the same quadratic state-space explosion, with [start, end), but avoid a gigantic regex.

Thoughts?
``
I am still confused about Write as per this issue from when outlines wasn't rustified yet.

Before making a cookbook, I'd like to know if this is a bug or pebcak. Could we have a Write that adds tokens without LLM-autoregression?

Thoughts?

Chain-of-thought prompting may significantly improve accuracy. Having a Write in my toolbox allows me to efficiently generate. E.g. an fsm could Write this to trigger a chain of though.

# the telephone number may include numbers (0-9), spaces, pound signs (#) and plus signs (+), but nothing else.
 - phone:

A power-feature might be ExpiringWrite(token, expire_after)s to save token bandwidth. I.e. its tokens are cleared from the generated text after a while, assuming the chain-of-thought was already primed. This may slightly adversely affect the LLM's matrix caching strategies, and I don't know if taking away the LLM's 'notes' mid-generation has other adverse effects.

@cpfiffer
Copy link
Contributor

Good cookbooks don't typically use meaningful datasets. They use something very simple and illustrate how it could be scaled in quality, scale, focus, etc.

If you want to showcase substring extraction, we'd basically want a single short string people can play with.

The simplest possible example I could suggest is something like

Clifford the red dog wore a tiny little hat made of feathers. It wasn't a good looking hat, but it made him feel good.

Then you might want to ask a question like "what is Clifford's hat made of". I would want a model to return a substring containing either "made of feathers" or "a tiny little hat made of feathers".

The essence is substring-generation though, otherwise the feature adds nothing to what the pydantic generation already does. Or am I missing something?

Do you mean the OCR part? I guess I'm not sure quite what you mean. My statement there was that OCR is an application but that core technique you've used is significantly more general. I suggested using a simpler approach to demonstrate the core technique without using OCR as a motivating example.

There's some other stuff in your response, like Write, that is probably better handled in separate issues. Is #942 up to date? We can flag a maintainer to see if they can clarify anything over there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants