You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1- When reading pdf using PdfReader, we get reader object which when printed shows quite detailed meta data. I cant see the text content in that meta data
2- /Parent': {...} this shows in metadata which i dont understand why Ellipsis operator is used over here
3- I am trying to get layers of PDF, such that I can get background, text content along with font family, weight, dimensions etc. Even images, graphics and everything. Is there any way I can do that
The text was updated successfully, but these errors were encountered:
1- When reading pdf using PdfReader, we get reader object which when printed shows quite detailed meta data. I cant see the text content in that meta data
If by "text content" you mean the text that you see on a PDF page when it's rendered on screen then the text content, as well as anything else that might be rendered, is contained in the PDF dictionary streams. When you print an instance of PdfReader, this only prints the dictionary headers. This is because all pdfrw does is deal with the headers, with only rudimentary support for streams decompression, and no support for streams parsing. For all of that I may recommend pdfrwx — I am actively developing it at this moment
2- /Parent': {...} this shows in metadata which i dont understand why Ellipsis operator is used over here
It's a good idea to recursively print only the objects that are "below" the object being printed, otherwise you will soon run into an infinite recursion. Besides this, the __repr__() function also does not explicitely recurse into some branches that are too large to print, in order not to clutter the output.
3- I am trying to get layers of PDF, such that I can get background, text content along with font family, weight, dimensions etc. Even images, graphics and everything. Is there any way I can do that
This functionality is not available as a ready-made function in pdfrw. However, it only takes a couple of hundred lines of code to do what you want, including teaching pdfrw to parse streams. For a reference implementation, please see pdfstreamparser.py, specifically — the PdfStream class constructor, which includes exactly the options that you are looking for.
I have few concerns:
1- When reading pdf using PdfReader, we get reader object which when printed shows quite detailed meta data. I cant see the text content in that meta data
2- /Parent': {...} this shows in metadata which i dont understand why Ellipsis operator is used over here
3- I am trying to get layers of PDF, such that I can get background, text content along with font family, weight, dimensions etc. Even images, graphics and everything. Is there any way I can do that
The text was updated successfully, but these errors were encountered: