Trying to run the instructions.ipynb got this error : XMLSyntaxError: Document is empty, line 1, column 1 (, line 1) #6

parth-jr · 2023-07-06T12:44:46Z

XMLSyntaxError                            Traceback (most recent call last)
[.\slicemypdf.py](.\slicemypdf.py) in _get_token_coordinates(self, draw_img, entry)
    538             parser = etree.XMLParser(recover=True)
--> 539             xml_doc = etree.fromstring(xml_doc, parser=parser)
    540 

src/lxml/etree.pyx in lxml.etree.fromstring()

src/lxml/parser.pxi in lxml.etree._parseMemoryDocument()

src/lxml/parser.pxi in lxml.etree._parseDoc()

src/lxml/parser.pxi in lxml.etree._BaseParser._parseUnicodeDoc()

src/lxml/parser.pxi in lxml.etree._ParserContext._handleParseResultDoc()

src/lxml/parser.pxi in lxml.etree._handleParseResult()

src/lxml/parser.pxi in lxml.etree._raiseParseError()

XMLSyntaxError: Document is empty, line 1, column 1 (, line 1)

During handling of the above exception, another exception occurred:
...
--> 879             raise Exception("Unable to locate coordinates for text! Provide a valid path to a text-based PDF with a single table")
    880         return coordinate_table, vertical_distance_list,\
    881             horizontal_distance_list, img, original

Exception: Unable to locate coordinates for text! Provide a valid path to a text-based PDF with a single table.

The text was updated successfully, but these errors were encountered:

hjmok · 2023-09-29T18:23:52Z

got the same issue using their example pdf

CappucciNOPE · 2024-01-02T02:59:11Z

I had to edit the slicemypdf.py file so that when it ran constructed the command line instructions in _create_coordinate_table and _create_coordinate_from_html_table, it automatically puts quotes around the filepath. Those functions look like this now:
` def _create_coordinate_table(self,
pdf_text_path=settings["pdf_text_path"]):
#Function to recursively parse the layout tree.
```
 cmd = """{0} -bbox-layout -enc UTF-8 -f {1} -l {1} \"{2}\" -""".format(
     pdf_text_path, self.page, self.pdf_loc
 )
 a = delegator.run(cmd)
 #("a",a,a.out)
 return a.out
```
def _create_coordinate_from_html_table(self,
pdf_text_path=settings["pdf_html_path"]
):
#Function to recursively parse the layout tree.
```
 cmd = """{0} -xml -fontfullname -nodrm  -hidden  -i -f {1} -l {1} \"{2}\" output.xml""".format(
     pdf_text_path, self.page, self.pdf_loc
 )
 a = delegator.run(cmd)
 b = delegator.run("cat output.xml")
 xml_op = b.out
 b = delegator.run("rm output.xml")
 return xml_op`
```
I'm also running it without anaconda and had to do some funky stuff to import it from a filepath using importlib.util. You can find decent instructions here: https://www.delftstack.com/howto/python/import-python-file-from-path/. I used the first example.
Because I am using slicemypdf without conda, I also had to edit the line where the settings.yml file is read and just included the full filepath to the setting.yml file. Also had to change the first two lines of the yml file so that it reads as follows:
pdf_text_path: "pdftotext" pdf_html_path: "pdftohtml"
There is an issue with importing delegator where there are two different modules: delegator and delegator.py and python always wants to import the first. I used the import from specified filepath to make sure the slicemypdf.py only used the module in delegator.py, available here: https://github.com/amitt001/delegator.py.

After fixing all that, it worked like a charm!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to run the instructions.ipynb got this error : XMLSyntaxError: Document is empty, line 1, column 1 (, line 1) #6

Trying to run the instructions.ipynb got this error : XMLSyntaxError: Document is empty, line 1, column 1 (, line 1) #6

parth-jr commented Jul 6, 2023

hjmok commented Sep 29, 2023

CappucciNOPE commented Jan 2, 2024 •

edited

Loading

Trying to run the instructions.ipynb got this error : XMLSyntaxError: Document is empty, line 1, column 1 (, line 1) #6

Trying to run the instructions.ipynb got this error : XMLSyntaxError: Document is empty, line 1, column 1 (, line 1) #6

Comments

parth-jr commented Jul 6, 2023

hjmok commented Sep 29, 2023

CappucciNOPE commented Jan 2, 2024 • edited Loading

CappucciNOPE commented Jan 2, 2024 •

edited

Loading