Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spaces are randomly sprinkled through output #64

Open
eflister opened this issue Nov 23, 2020 · 3 comments
Open

spaces are randomly sprinkled through output #64

eflister opened this issue Nov 23, 2020 · 3 comments

Comments

@eflister
Copy link

i'm reading data out of a 3 column table in some pdf weekly covid reports.  it usually works fine, but in two out of ~30 pdf's, pdf-toolbox has started sprinkling spaces into a few of the numbers apparently randomly. here's an example:
randomspaces.pdf

the table covers the final few pages of the pdf.  it lists zip codes in numerical order.  most come out fine, but here's some code that prints out the lines with the extra spaces. i contrast it with pdftotext, a binding to poppler. both it and the command line pdftotext that comes with poppler show the correct output without spaces.

main = do
  let f = "randomspaces.pdf"
      check t = do
        -- skip to the table we're interested in
        let table = dropWhile (not . T.isInfixOf (T.toCaseFold "cases by ZIP")) (T.lines $ T.toCaseFold t)

        -- pdftotext sees a few more lines than pdf-toolbox, having to do with blank lines, headers/footers, etc
        putStrLn $ "\nlines: " ++ show (length table)
        -- mapM_ print table

        -- display lines with offending spaces
        putStrLn "bads:"
        mapM_ print $ filter (and . ([ T.isPrefixOf "97" . T.concat
                                     , not . T.isInfixOf "n/a" . T.concat
                                     , (/= 3) . length
                                     ] <*>) . pure) $ T.words <$> table

  -- pdf-toolbox puts random spaces in 10 different lines
  withPdfFile f $ \pdf -> check =<< extract pdf =<< catalogPageNode =<< documentCatalog =<< document pdf

  -- spaces not present in poppler bindings
  check =<< pdftotext Physical <$> fromJust <$> openFile f

extract pdf = (T.concat <$>) . (traverse ((extract' =<<) . loadPageNode pdf) =<<) . pageNodeKids
  where extract' (PageTreeLeaf tn) = putStr "." >> pageExtractText tn
        extract' (PageTreeNode tn) = do
         (putStr . show) =<< pageNodeNKids tn
         extract pdf tn

output:

lines: 385
bads:
["970","34","114","603.0"]
["97060","396","186","5.6"]
["971","33","18","450.0"]
["97210","59","5","41.9"]
["973","05","1082","2693.2"]
["97405","154","344.","9"]
["97470","73","36","5.3"]
["97520","114","4","65.9"]
["97603","200","677.","4"]
["979","13","337","6097.3"]

lines: 390
bads:
@Yuras
Copy link
Owner

Yuras commented Nov 25, 2020

Thank you for the bug report.
There is a fuzzy logic that inserts missing spaces (and also newlines):

else Text.Builder.singleton ' '
You can get the actual glyphs using this function:
pageExtractGlyphs :: Page -> IO [Span]
, but you'll have to deal with missing spaces instead :)
The reason for the but seems to be failure to parse a font. I'll take a closer look later this week (I hope)

@Yuras
Copy link
Owner

Yuras commented Nov 30, 2020

So I checked the file. It uses the standard fonts, they don't include char widths. So we extract glyphs with incorrect bounding box.
To fix it we need to include AFM files (like there) and parse them to get widths for the standard fonts. I'll try to find time for that, but it'd be faster if someone will take care of it.

@eflister
Copy link
Author

thanks for looking into it! it's not urgent for me, i have other solutions, but great to know the cause :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants