spaces are randomly sprinkled through output #64

eflister · 2020-11-23T01:36:00Z

i'm reading data out of a 3 column table in some pdf weekly covid reports. it usually works fine, but in two out of ~30 pdf's, pdf-toolbox has started sprinkling spaces into a few of the numbers apparently randomly. here's an example:
randomspaces.pdf

the table covers the final few pages of the pdf. it lists zip codes in numerical order. most come out fine, but here's some code that prints out the lines with the extra spaces. i contrast it with pdftotext, a binding to poppler. both it and the command line pdftotext that comes with poppler show the correct output without spaces.

main = do
  let f = "randomspaces.pdf"
      check t = do
        -- skip to the table we're interested in
        let table = dropWhile (not . T.isInfixOf (T.toCaseFold "cases by ZIP")) (T.lines $ T.toCaseFold t)

        -- pdftotext sees a few more lines than pdf-toolbox, having to do with blank lines, headers/footers, etc
        putStrLn $ "\nlines: " ++ show (length table)
        -- mapM_ print table

        -- display lines with offending spaces
        putStrLn "bads:"
        mapM_ print $ filter (and . ([ T.isPrefixOf "97" . T.concat
                                     , not . T.isInfixOf "n/a" . T.concat
                                     , (/= 3) . length
                                     ] <*>) . pure) $ T.words <$> table

  -- pdf-toolbox puts random spaces in 10 different lines
  withPdfFile f $ \pdf -> check =<< extract pdf =<< catalogPageNode =<< documentCatalog =<< document pdf

  -- spaces not present in poppler bindings
  check =<< pdftotext Physical <$> fromJust <$> openFile f

extract pdf = (T.concat <$>) . (traverse ((extract' =<<) . loadPageNode pdf) =<<) . pageNodeKids
  where extract' (PageTreeLeaf tn) = putStr "." >> pageExtractText tn
        extract' (PageTreeNode tn) = do
         (putStr . show) =<< pageNodeNKids tn
         extract pdf tn

output:

lines: 385
bads:
["970","34","114","603.0"]
["97060","396","186","5.6"]
["971","33","18","450.0"]
["97210","59","5","41.9"]
["973","05","1082","2693.2"]
["97405","154","344.","9"]
["97470","73","36","5.3"]
["97520","114","4","65.9"]
["97603","200","677.","4"]
["979","13","337","6097.3"]

lines: 390
bads:

The text was updated successfully, but these errors were encountered:

Yuras · 2020-11-25T14:45:05Z

Thank you for the bug report.
There is a fuzzy logic that inserts missing spaces (and also newlines):

pdf-toolbox/document/lib/Pdf/Document/Page.hs

Line 293 in f1d2047

else Text.Builder.singleton ' '

You can get the actual glyphs using this function:

pdf-toolbox/document/lib/Pdf/Document/Page.hs

Line 205 in f1d2047

pageExtractGlyphs :: Page -> IO [Span]

, but you'll have to deal with missing spaces instead :)
The reason for the but seems to be failure to parse a font. I'll take a closer look later this week (I hope)

Yuras · 2020-11-30T14:38:27Z

So I checked the file. It uses the standard fonts, they don't include char widths. So we extract glyphs with incorrect bounding box.
To fix it we need to include AFM files (like there) and parse them to get widths for the standard fonts. I'll try to find time for that, but it'd be faster if someone will take care of it.

eflister · 2020-11-30T16:50:03Z

thanks for looking into it! it's not urgent for me, i have other solutions, but great to know the cause :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spaces are randomly sprinkled through output #64

spaces are randomly sprinkled through output #64

eflister commented Nov 23, 2020

Yuras commented Nov 25, 2020

Yuras commented Nov 30, 2020

eflister commented Nov 30, 2020

spaces are randomly sprinkled through output #64

spaces are randomly sprinkled through output #64

Comments

eflister commented Nov 23, 2020

Yuras commented Nov 25, 2020

Yuras commented Nov 30, 2020

eflister commented Nov 30, 2020