You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i'm reading data out of a 3 column table in some pdf weekly covid reports. it usually works fine, but in two out of ~30 pdf's, pdf-toolbox has started sprinkling spaces into a few of the numbers apparently randomly. here's an example: randomspaces.pdf
the table covers the final few pages of the pdf. it lists zip codes in numerical order. most come out fine, but here's some code that prints out the lines with the extra spaces. i contrast it with pdftotext, a binding to poppler. both it and the command line pdftotext that comes with poppler show the correct output without spaces.
main =dolet f ="randomspaces.pdf"
check t =do-- skip to the table we're interested inlet table =dropWhile (not.T.isInfixOf (T.toCaseFold "cases by ZIP")) (T.lines$T.toCaseFold t)
-- pdftotext sees a few more lines than pdf-toolbox, having to do with blank lines, headers/footers, etcputStrLn$"\nlines: "++show (length table)
-- mapM_ print table-- display lines with offending spacesputStrLn"bads:"mapM_print$filter (and. ([ T.isPrefixOf "97".T.concat
, not.T.isInfixOf "n/a".T.concat
, (/=3) .length
] <*>) .pure) $T.words<$> table
-- pdf-toolbox puts random spaces in 10 different lines
withPdfFile f $\pdf -> check =<< extract pdf =<< catalogPageNode =<< documentCatalog =<< document pdf
-- spaces not present in poppler bindings
check =<< pdftotext Physical<$> fromJust <$> openFile f
extract pdf = (T.concat<$>) . (traverse ((extract' =<<) . loadPageNode pdf) =<<) . pageNodeKids
where extract' (PageTreeLeaf tn) =putStr".">> pageExtractText tn
extract' (PageTreeNode tn) =do
(putStr.show) =<< pageNodeNKids tn
extract pdf tn
, but you'll have to deal with missing spaces instead :)
The reason for the but seems to be failure to parse a font. I'll take a closer look later this week (I hope)
So I checked the file. It uses the standard fonts, they don't include char widths. So we extract glyphs with incorrect bounding box.
To fix it we need to include AFM files (like there) and parse them to get widths for the standard fonts. I'll try to find time for that, but it'd be faster if someone will take care of it.
i'm reading data out of a 3 column table in some pdf weekly covid reports. it usually works fine, but in two out of ~30 pdf's,
pdf-toolbox
has started sprinkling spaces into a few of the numbers apparently randomly. here's an example:randomspaces.pdf
the table covers the final few pages of the pdf. it lists zip codes in numerical order. most come out fine, but here's some code that prints out the lines with the extra spaces. i contrast it with
pdftotext
, a binding topoppler
. both it and the command linepdftotext
that comes withpoppler
show the correct output without spaces.output:
The text was updated successfully, but these errors were encountered: