Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with annotations #53

Open
adamestein opened this issue Mar 21, 2017 · 3 comments
Open

Error with annotations #53

adamestein opened this issue Mar 21, 2017 · 3 comments

Comments

@adamestein
Copy link

Found an issue when upgrading from pdfquery 0.2.7 to 0.4.3. Looks like starting in 0.3.0, support for annotations was added. This is what appears to be happening. In the _add_annots() method in pdfquery.py, an annotation object is found by pdfminer. _add_annots() retrieves this object and converts all information into strings (via obj_to_string()). This method is called again and pdfminer returns a cached version of the annotation object, only this time, all the information has been converted into strings by pdfquery. This leads to an error on line 649:

annot['URI'] = resolve1(annot['A'])['URI']

The first time through _add_annots(), resolve1(annot['A']) returns a dict with 'URI' being one of the keys. On the second time through, annot['A'] is a string representation (converted by obj_to_string) of that dict and so the line fails.

I've attached a PDF file (annot.pdf) to show the problem. This file only has one line of text (a company's home page URL) which is being seen as an annotation.

This error has been found with:

  • pdfquery version 0.3.0, 0.4.x
  • pdfminer 20140328
  • python 2.7.1
  • Fedora Linux 23

If there's any other information that would help, let me know.

@jcushman
Copy link
Owner

Do you have example code that reproduces this error? pdf.load() is working for me with your supplied file.

@adamestein
Copy link
Author

I'm assuming you have a little script to send the file to pdf.load(). Could you attach that? That way, I can run exactly what you did with the same file. If it doesn't work for me, that could indicate something else is causing the issue. If it does work, I can trace the difference between what you did vs what my code is doing. It could be also possible that by removing the private information from the PDF file, I also removed what was causing the problem. It's been so long since I've submitted this that I don't remember. I think I verified that I was still having the issue with the PDF I've attached, but can't remember for sure.

@jacksongs
Copy link

I seem to be getting this error as well. PDF file here. I have also tested on this

  • pdfquery 0.4.3
  • pdfminer 20170720
  • python 3.6
  • OSX 10.12.6

Error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-6d31003dedab> in <module>()
     13     pdf = pdfquery.PDFQuery("../"+name)
     14     pdf.load()
---> 15     tree = pdf.get_tree()
     16     #tree.write("current.xml", pretty_print=True)
     17 

~/anaconda3/lib/python3.6/site-packages/pdfquery/pdfquery.py in get_tree(self, *page_numbers)
    485                 else:
    486                     pages = enumerate(self.get_layouts())
--> 487                 for n, page in pages:
    488                     page = self._xmlize(page)
    489                     page.set('page_index', obj_to_string(n))

~/anaconda3/lib/python3.6/site-packages/pdfquery/pdfquery.py in <genexpr>(.0)
    606     def get_layouts(self):
    607         """ Get list of PDFMiner Layout objects for each page. """
--> 608         return (self.get_layout(page) for page in self._cached_pages())
    609 
    610     def _cached_pages(self, target_page=-1):

~/anaconda3/lib/python3.6/site-packages/pdfquery/pdfquery.py in get_layout(self, page)
    601         self.interpreter.process_page(page)
    602         layout = self.device.get_result()
--> 603         layout = self._add_annots(layout, page.annots)
    604         return layout
    605 

~/anaconda3/lib/python3.6/site-packages/pdfquery/pdfquery.py in _add_annots(self, layout, annots)
    647                     annot = self._set_hwxy_attrs(annot)
    648                 try:
--> 649                     annot['URI'] = resolve1(annot['A'])['URI']
    650                 except KeyError:
    651                     pass

TypeError: string indices must be integers

jacksongs added a commit to jacksongs/pdfquery that referenced this issue Feb 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants