Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does pdfquery determine the index? #66

Open
SalmonTT opened this issue Jun 13, 2018 · 0 comments
Open

How does pdfquery determine the index? #66

SalmonTT opened this issue Jun 13, 2018 · 0 comments

Comments

@SalmonTT
Copy link

Amazon_CF.pdf

Amazon.txt
Hi jcushman!

I am a freshman from Hong Kong and currently trying to find a way to read tables from PDF and work with its data.

I tried the following code with the PDF attached and obtained the results stored in the .txt file which I have also attached.
pdf = pdfquery.PDFQuery('Amazon_CF.pdf')
pdf.load()
pdf.tree.write('test.xml', pretty_print=True)

My questions are:

  1. How are the index determined? It appears that the index order does not follow line-by-line order.
  2. Are their any methods to re-arrange the index? Preferably in the order of line-by-line and left-to-right.

Hopefully my explanation is clear enough.
Any help would be greatly appreciated!

Cheers,
Simon

@SalmonTT SalmonTT changed the title How the pdfquery determine the index? How does pdfquery determine the index? Jun 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant