Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get Layered PDF using pdfrw #244

Open
rafayaar opened this issue Apr 15, 2024 · 1 comment
Open

Get Layered PDF using pdfrw #244

rafayaar opened this issue Apr 15, 2024 · 1 comment

Comments

@rafayaar
Copy link

I have few concerns:

1- When reading pdf using PdfReader, we get reader object which when printed shows quite detailed meta data. I cant see the text content in that meta data

2- /Parent': {...} this shows in metadata which i dont understand why Ellipsis operator is used over here

3- I am trying to get layers of PDF, such that I can get background, text content along with font family, weight, dimensions etc. Even images, graphics and everything. Is there any way I can do that

@sl2c
Copy link

sl2c commented May 13, 2024

I have few concerns:

1- When reading pdf using PdfReader, we get reader object which when printed shows quite detailed meta data. I cant see the text content in that meta data

If by "text content" you mean the text that you see on a PDF page when it's rendered on screen then the text content, as well as anything else that might be rendered, is contained in the PDF dictionary streams. When you print an instance of PdfReader, this only prints the dictionary headers. This is because all pdfrw does is deal with the headers, with only rudimentary support for streams decompression, and no support for streams parsing. For all of that I may recommend pdfrwx — I am actively developing it at this moment

2- /Parent': {...} this shows in metadata which i dont understand why Ellipsis operator is used over here

It's a good idea to recursively print only the objects that are "below" the object being printed, otherwise you will soon run into an infinite recursion. Besides this, the __repr__() function also does not explicitely recurse into some branches that are too large to print, in order not to clutter the output.

3- I am trying to get layers of PDF, such that I can get background, text content along with font family, weight, dimensions etc. Even images, graphics and everything. Is there any way I can do that

This functionality is not available as a ready-made function in pdfrw. However, it only takes a couple of hundred lines of code to do what you want, including teaching pdfrw to parse streams. For a reference implementation, please see pdfstreamparser.py, specifically — the PdfStream class constructor, which includes exactly the options that you are looking for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants