-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCR source text extraction guide #1381
Comments
Wow, cool! You've built a lot of stuff. What does registry mean here? I'm not familiar with some of the language. |
Good question :) A registry is a printed database, in the sense of how they existed before computers. This is one page of a registry, in this case of German patents from 1890. It lists the technological class, an entry code, a description and the inventors and their addresses. For this data set we had data between 1888 and 1945, and the researchers wanted to find relations between inventor gender and technological classes over time. For the years where the first names were included or the marital status was, we could do this. But we have researchers that have such datasets for historic Russian companies Tax records, Russian Gulag prisoners, Parliametary notes Historic balance sheets etc. Currently the pipeline is basic pre-processing, straightening, removing bastard characters, layout detection, ocr, regular expressions, excel sheet for the researcher. We're trying to help researchers with an friendly way to digitize their copora, and LLMs would fit well as a replacement for the regular expressions if the records are reasonably textual rather than tabular, like the first one of patents. However, we want to have an approach that doesn't hallucinate records. |
PS: I moved the repository from my private account to my team's. https://github.com/UG-Team-Data-Science/outlines-ocr-guidance |
That's very helpful, thank you! I'm not sure exactly where I'd put this in Outlines, since it's a heavier application. It is also an extremely clever use of Outlines, so I feel like it would be a shame not to showcase it somehow. I could imagine
Of these, I think linking/adding this to the demos repo would be great, since the code is mostly already in demo-shape. |
Thanks for your suggestions, they all seem like good ideas to me. I think a demo would work great, however I would also be happy to work towards a blog post, as it supports my personal agenda of getting ideas and attention for scaling this up to more complicated registries. Now it just works on simple toy problems. However, I can imagine you want to consider whether this is something you want to pay time and/or spotlight to, so I wouldn't hold it against you if you'd pass. |
Okay -- here's a question for you. Could we reduce this into a cookbook-style document that narrows focus to the structured gen + FSM approach you used? To me, that's the core innovation here, and one I think we should find a way to amplify. If you're interested in a cookbook, I don't think we should focus on the OCR so much as just the "organization" of tokens. A blog post would be a fine fit for the specifics of your application but, I think I'm too bandwidth-limited to shepherd it through for the moment. If you write a post on your blog we'd love to boost it! For context, I've noticed that we often have requests in the Discord for the ability to match substrings in a document for highlighting/citation purposes. It's hard to write this regular expression, and can be difficult to compile in advance. What you've built is essentially simple, clean code to get around the complexity required when using the raw regex/json interface. I want to find a way to communicate that to everyone. I don't necessarily want to implement sentence substring matching, but I do want to start building an arsenal of tools to help people understand how to mess with the Thoughts? |
Yes. Do you have a particular suggestion for a substring extraction problem? We used NER to extract locations from (Dutch) parliamentary questions (and track local representation) site (long load). A cookbook could do this via an LLM, or find the actual municipality of the mention from google search hits. I tried something, but performance isn't too good. Alternatively, I also have a climate litigation corpus, suitable to extract excepts listing particular arguments.
The essence is substring-generation though, otherwise the feature adds nothing to what the pydantic generation already does. Or am I missing something?
Token classification via outlines might be cool, and would be helpful for researchers I help. Then I'd also like to extract span locations. Is there a risk in tracking and deep-copying spans in This would allow cookers to also get acquainted with shadowing states.
I have the same quadratic state-space explosion, with
Before making a cookbook, I'd like to know if this is a bug or pebcak. Could we have a
Chain-of-thought prompting may significantly improve accuracy. Having a
A power-feature might be |
Good cookbooks don't typically use meaningful datasets. They use something very simple and illustrate how it could be scaled in quality, scale, focus, etc. If you want to showcase substring extraction, we'd basically want a single short string people can play with. The simplest possible example I could suggest is something like
Then you might want to ask a question like "what is Clifford's hat made of". I would want a model to return a substring containing either "made of feathers" or "a tiny little hat made of feathers".
Do you mean the OCR part? I guess I'm not sure quite what you mean. My statement there was that OCR is an application but that core technique you've used is significantly more general. I suggested using a simpler approach to demonstrate the core technique without using OCR as a motivating example. There's some other stuff in your response, like |
I created a Guide to extract spans from source texts, as if they are fields from a registry.
https://github.com/UG-Team-Data-Science/outlines-ocr-guidance
Would this be something to integrate in outlines?
The text was updated successfully, but these errors were encountered: