Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow conversion without image #26

Merged
merged 7 commits into from
Feb 3, 2025
Merged

Conversation

joewiz
Copy link
Contributor

@joewiz joewiz commented Jan 3, 2025

As proposed in #25, this PR adds a convert_file_without_image function, which can convert AWS-Textract-JSON to PAGE-XML without needing to read the original input image. This function takes the image’s dimensions as inputs, instead of reading these from the image.

The PR also adds “image-width” and “image-height” options to the CLI command, which triggers the use of the new function if they are supplied.

I lack python skills and cobbled this together, so pardon the lack of tests covering the new functionality.

I did run make test-api, which passed. (But make test-cli failed with the same error as before my changes ("No rule to make target OUT, needed by test-cli. Stop.")

I performed the following tests manually, confirming that the results were as expected:

Test 1: Using original features

The following command uses the original features of the utility:

textract2page --output-file page.xml tesseract.json source.jpg

... returned a 66k PAGE-XML file, with the following excerpt on line 8 referencing the input image file:

<pc:Page imageFilename="source.jpg" imageWidth="4593" imageHeight="7163">

Test 2: Using the new features via CLI

The following command uses the new features added in this PR:

textract2page --output-file page2.xml tesseract.json foo.bar --image-width 4593 --image-height 7163

... returned an identical file, except for the following excerpt on line 8, corresponding to the one above:

<pc:Page imageFilename="foo.bar" imageWidth="4593" imageHeight="7163">

I'd greatly appreciate any feedback!

- add convert_file_without_image function, which can convert AWS-Textract-JSON to PAGE-XML without reading the original input image; this function takes the image’s dimensions as inputs, instead of reading these from the image
- add optional “image-width” and “image-height” options to the CLI command, which triggers the use of the new function
textract2page/cli.py Outdated Show resolved Hide resolved
Copy link
Member

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks @joewiz, much appreciated!

Your implementation is already near perfect IMO – see below for some cosmetic requests. (If you don't want to work them in, let me know and I'll do it myself.)

textract2page/convert_aws.py Outdated Show resolved Hide resolved
textract2page/cli.py Outdated Show resolved Hide resolved
textract2page/cli.py Outdated Show resolved Hide resolved
textract2page/cli.py Show resolved Hide resolved
textract2page/convert_aws.py Show resolved Hide resolved
@bertsky
Copy link
Member

bertsky commented Jan 6, 2025

(But make test-cli failed with the same error as before my changes ("No rule to make target OUT, needed by test-cli. Stop.")

Strange! Sounds like your make does not know about target-specific variables, yet. Which version are you using?

@joewiz
Copy link
Contributor Author

joewiz commented Feb 2, 2025

@bertsky Thank you for the review and suggestions! I've incorporated all of them. I would be happy to clean up the PR with a rebase, or the PR could be squashed - either way works for me.

@bertsky
Copy link
Member

bertsky commented Feb 3, 2025

Excellent, many thanks! I have pushed some fixups and extended the tests. Will merge as soon as CI is happy.

@bertsky bertsky self-requested a review February 3, 2025 07:13
@bertsky bertsky merged commit 68a7da0 into slub:master Feb 3, 2025
3 checks passed
@joewiz joewiz deleted the without-image branch February 3, 2025 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants