-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds ftfy
to DictionaryWordPredictor
to fix unicode oddities.
#149
base: main
Are you sure you want to change the base?
Conversation
When you consider the underlying character stream from the PDF document then it's interesting to note that Introducing this into a predictor rather than into a parser may cause unexpected alignment issues if I really care about aligning characters to the original document. Examples of this may be when I want to use different x/y tolerances for detecting words (though this doesn't seem a possible concern with current Does this have consequences for referencing The overall point may be moot since Could a different set up be something like:
Fixing ligatures is not so much a model as a detect and replace configuration. The dictionary word predictor is a predictor from one perspective because it can be "trained" on the PDF of concern to have a local dictionary and capture words like Another thought on global token fixing is simply that most(?) models are likely to work better with |
Yes, but that's already the case with WordPredictor. A sequence of tokens
I'm starting to think At a higher-level, this seems to boil down to -- what are Fixing ligatures, in spirit, is doing the same thing the DictWordPredictor is trying to do -- that is, create Maybe what we should do is rename DictWordPredictor to just a generic WordPredictor. Implementation-wise, it would have separate internal methods for handling the Dict-aspect of forming words, as well as the ligature-transformation. In the long-run, I'm thinking more and more this is a task for an efficient PostEditingModel that scans PDF tokens & outputs proposed edits to form words. thoughts? |
That's a good point. I think the key difference with "-" is that one is removing characters and in theory can still index into symbols using all Span start/end. So, the individual character indices in symbols can still line up for spans. There is no longer a span that includes the "-". De-hyphenation compresses a span or discards symbols from the Document as not useful to meaning. To your point, if we are keeping SpanGroup.(whatever_method_reaches_doc_symbols) -> just the original characters then everything seems OK. My original comment forgot that we tend to build up new items as we go (symbols then we have tokens then "words" are potentially entirely separate). Seems OK to skip renaming, etc. for now. |
This PR adds
ftfy
toDictionaryWordPredictor
to mitigate some issues in character parsing from pdfplumber. In short, it callsftfy.fix_text
to replace corrupted or low frequencies characters such as ligatures (e.g.Verification
, wherefi
is a single character) with more common representations.