Skip to content

Releases: UglyToad/PdfPig

Put Together In The Same Factory

14 Aug 08:25
Compare
Choose a tag to compare

This release fixes a major regression in 0.0.7 which broke consuming documents via streams. It also adds new features:

  • Document Layout Analysis: Adds the Docstrum (Doc Spectrum) algorithm for page segmentation.
  • Document segmentation approaches (Docstrum and RecursiveXYCut) implement the IPageSegmenter interface which now returns a list of TextBlocks. XYLeaf and XYNode are now internal.
  • TextEdgesExtractor is a new class which can be used to detect shared alignment in sections of text.
  • Letters now have a Color property. This is one of the types implementing IColor. These are GrayColor, RGBColor and CMYKColor, other color spaces are not currently supported and default to GrayColor.Black.
  • PdfDocument now has a TryGetXmpMetadata(out XmpMetadata metadata) method which will retrieve the XML XMP Metadata object from the document if one is present.

And You Have The Milk

03 Aug 15:16
Compare
Choose a tag to compare

This release primarily focuses on more bug-fixing to improve stability of extracting text content. The main new features are full support for encrypted documents, Document Layout Analysis tools and early-access path information.

  • Fix a bug using DefaultWordExtractor where the Letters collection on all words would be empty.
  • Supports UTF-16 encoded strings in document content, such as document information dictionaries, and in HexToken based strings.
  • Supports all forms of document encryption up to and including revision 6 in PDF 2.0 spec.
  • Prevents crashes where PDF contains circular object references.
  • The new DocumentLayoutAnalysis namespace supports nearest-neighbour word extraction and recursive X-Y cut document segmentation. RecursiveXYCut.GetBlocks implements the Recursive X-Y cut algorithm https://en.wikipedia.org/wiki/Recursive_X-Y_cut. NearestNeighbourWordExtractor can be provided to Page.GetWords for a different word extraction technique.
  • Fix bug where some letters had a width or height of zero.
  • More tolerant search for cross-reference offsets, if the cross-reference offsets are incorrect we search for the corresponding object.
  • Handle a case where CidFonts contained hex rather than string tokens for registry-ordering-supplement information.
  • Support cross-reference tables even if they appear after the first %%EOF end of file marker.
  • Support rotated pages. Page now contains a Rotation property indicating if the page is rotated at the top level. Valid values for rotation are 0, 90, 180 and 270. The currently reported PageSize does not take rotation into account yet. This also adds support for properly rotating letters and page content.
  • Change internal letter point size calculation, Page.ExperimentalAccess.GetPointSize(Letter letter) now reports the point size with an updated calculation which handles rotated letters.
  • Map character codes directly to ASCII character values where there's no corresponding Unicode value. This matches PDFBox 1.8/9 behaviour where if no Unicode value can be found, the integer value is mapped directly to a character.
  • Expose PdfPath information from the page's content stream. Early access to path/geometry information parsed from the page's content. Use Page.ExperimentalAccess.Paths to access lines, rectangles, curves, etc declared by the page.

Cows In The North

19 May 12:40
Compare
Choose a tag to compare

This release focuses on stability improvements and has been tested on far more document types than previous releases. The 2 main new features are support for full framework versions of .NET back to .NET 4.5 making this library available to more users and initial support for encrypted documents using the most basic form of document encryption.

The release may contain a bug in System Font loading which has not been replicated but may make the library crash on some systems. Please file a bug report if you encounter an error on this package version.

  • Adds the ability to access all raw operations in a page's content stream. This is the set of instructions which form the graphical features on the page. Access using page.Operations.
  • Supports defining operations on a PdfPageBuilder directly using builder.Advanced.Operations.
  • Support for full framework .NET versions back to .NET 4.5.
  • Support for Compact Font Format CID fonts.
  • Support for Standard 14 fonts which are incorrectly declared as TrueType fonts.
  • Performance improvements for System Fonts, where the document relies on fonts installed on the host operating system, only tested on Windows.
  • Many stability fixes for all font types and parsing documents.
  • Text direction added to letter and word. Indicates the rotation of the text.
  • Add support for encrypted documents, documents using the newer AES encryption will still throw but RC4 encryption is now supported. A password may be supplied in ParsingOptions.
  • Support for LZW filters which were the last filter left to be implemented.

Cows In The South

30 Dec 16:45
Compare
Choose a tag to compare

Adds new document creation and provides access to per-page annotations.

Red Cake with Great Big Red Cherries

27 Nov 07:57
Compare
Choose a tag to compare
  • Reworks the public API of Letter to provide height information. See the Letters page on the wiki.
  • Adds support for Type 1 fonts with Compact Font Format fonts and retrieving height information.
  • Bug fixes, stability improvements and performance improvements.
  • PdfDocument now has a Structure property. This is an UglyToad.PdfPig.Structure object which provides access to the tokenized content of the PDF file and the merged Cross Reference Table in the document. Any objects in the PDF file may be accessed by object reference number allowing consumers to work around missing functionality. All tokens used internally when interpreting PDF documents are available on the public API.
  • Page now has a IEnumerable GetWords() method which uses a default word extractor to attempt merging letters into words based on heuristics using letter positions. Consumers may provide their own IWordExtractor to the method to improve on the very basic approach used in this release or continue using the raw letters.

Version 0.0.1

26 Feb 21:31
Compare
Choose a tag to compare

The first non pre-release version.

Alpha 002

22 Jan 21:19
Compare
Choose a tag to compare
Alpha 002 Pre-release
Pre-release

Fixes an issue where the only encoding present is embedded in the font program.
Supports reading from streams.

Very Stable Genius

10 Jan 22:51
Compare
Choose a tag to compare
Very Stable Genius Pre-release
Pre-release

The initial alpha release