Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to run the instructions.ipynb got this error : XMLSyntaxError: Document is empty, line 1, column 1 (, line 1) #6

Open
parth-jr opened this issue Jul 6, 2023 · 2 comments

Comments

@parth-jr
Copy link

parth-jr commented Jul 6, 2023

XMLSyntaxError                            Traceback (most recent call last)
[.\slicemypdf.py](.\slicemypdf.py) in _get_token_coordinates(self, draw_img, entry)
    538             parser = etree.XMLParser(recover=True)
--> 539             xml_doc = etree.fromstring(xml_doc, parser=parser)
    540 

src/lxml/etree.pyx in lxml.etree.fromstring()

src/lxml/parser.pxi in lxml.etree._parseMemoryDocument()

src/lxml/parser.pxi in lxml.etree._parseDoc()

src/lxml/parser.pxi in lxml.etree._BaseParser._parseUnicodeDoc()

src/lxml/parser.pxi in lxml.etree._ParserContext._handleParseResultDoc()

src/lxml/parser.pxi in lxml.etree._handleParseResult()

src/lxml/parser.pxi in lxml.etree._raiseParseError()

XMLSyntaxError: Document is empty, line 1, column 1 (, line 1)

During handling of the above exception, another exception occurred:
...
--> 879             raise Exception("Unable to locate coordinates for text! Provide a valid path to a text-based PDF with a single table")
    880         return coordinate_table, vertical_distance_list,\
    881             horizontal_distance_list, img, original

Exception: Unable to locate coordinates for text! Provide a valid path to a text-based PDF with a single table.
@hjmok
Copy link

hjmok commented Sep 29, 2023

got the same issue using their example pdf

@CappucciNOPE
Copy link

CappucciNOPE commented Jan 2, 2024

  1. I had to edit the slicemypdf.py file so that when it ran constructed the command line instructions in _create_coordinate_table and _create_coordinate_from_html_table, it automatically puts quotes around the filepath. Those functions look like this now:
    ` def _create_coordinate_table(self,
    pdf_text_path=settings["pdf_text_path"]):
    #Function to recursively parse the layout tree.

     cmd = """{0} -bbox-layout -enc UTF-8 -f {1} -l {1} \"{2}\" -""".format(
         pdf_text_path, self.page, self.pdf_loc
     )
     a = delegator.run(cmd)
     #("a",a,a.out)
     return a.out
    

    def _create_coordinate_from_html_table(self,
    pdf_text_path=settings["pdf_html_path"]
    ):
    #Function to recursively parse the layout tree.

     cmd = """{0} -xml -fontfullname -nodrm  -hidden  -i -f {1} -l {1} \"{2}\" output.xml""".format(
         pdf_text_path, self.page, self.pdf_loc
     )
     a = delegator.run(cmd)
     b = delegator.run("cat output.xml")
     xml_op = b.out
     b = delegator.run("rm output.xml")
     return xml_op`
    
  2. I'm also running it without anaconda and had to do some funky stuff to import it from a filepath using importlib.util. You can find decent instructions here: https://www.delftstack.com/howto/python/import-python-file-from-path/. I used the first example.

  3. Because I am using slicemypdf without conda, I also had to edit the line where the settings.yml file is read and just included the full filepath to the setting.yml file. Also had to change the first two lines of the yml file so that it reads as follows:
    pdf_text_path: "pdftotext" pdf_html_path: "pdftohtml"

  4. There is an issue with importing delegator where there are two different modules: delegator and delegator.py and python always wants to import the first. I used the import from specified filepath to make sure the slicemypdf.py only used the module in delegator.py, available here: https://github.com/amitt001/delegator.py.

After fixing all that, it worked like a charm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants