-
Notifications
You must be signed in to change notification settings - Fork 32
new attempt at #173 (valid polygons, faster deskewing, various fixes) #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
09ece86
b48c41e
66b2bce
afba70c
41cc38c
7b51fd6
e730725
17bcf1a
a433c73
0650274
f3faa29
7a9e825
11e143a
235539a
bca2ae3
9b5182c
5bff2d1
5b16c2f
4337d62
f458e3e
dc0caad
abf5c0f
8be2c79
31f240c
0662ece
04c3d7d
b94c96f
0366707
7586024
13f85b0
c0137c2
f857ee7
08c8c26
b21051d
375e026
61b20cc
a3d8197
c86e59f
ad129ed
7daec39
f0de1ad
3aa7ad0
0b9d490
81827c2
8c3d5eb
3f3353e
415b2cb
a1c8fd4
4950e6b
7387f5a
e9bb62b
e674ea0
29b4527
d774a23
73e5a1d
0f33c21
0e00d78
155b8f6
fe60318
6e57ab3
595ed02
a1904fa
2353599
18bbdb7
d53f829
2e90787
dfdc705
0a80cd5
fd43e78
02a347a
d88ca18
e324797
cbbb324
75823f9
5e11a68
ca72a09
e5b5264
839b7c4
1d4815b
027b87d
096def1
8a2d682
b3d29be
a144026
e1b56d9
cab3926
d96af42
ecb5305
c4cb16c
374818d
4e9a161
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,11 @@ | |
__pycache__ | ||
sbb_newspapers_org_image/pylint.log | ||
models_eynollah* | ||
models_ocr* | ||
models_layout* | ||
default-2021-03-09 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO for me: Rename the binarization model to include the version as well. |
||
output.html | ||
/build | ||
/dist | ||
*.tif | ||
TAGS |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,55 @@ Versioned according to [Semantic Versioning](http://semver.org/). | |
|
||
## Unreleased | ||
|
||
Fixed: | ||
|
||
* continue processing when no columns detected but text regions exist | ||
* convert marginalia to main text if no main text is present | ||
* reset deskewing angle to 0° when text covers <30% image area and detected angle >45° | ||
* :fire: polygons: avoid invalid paths (use `Polygon.buffer()` instead of dilation etc.) | ||
* `return_boxes_of_images_by_order_of_reading_new`: avoid Numpy.dtype mismatch, simplify | ||
* `return_boxes_of_images_by_order_of_reading_new`: log any exceptions instead of ignoring | ||
* `filter_contours_without_textline_inside`: avoid removing from duplicate lists twice | ||
* `get_marginals`: exit early if no peaks found to avoid spurious overlap mask | ||
* `get_smallest_skew`: after shifting search range of rotation angle, use overall best result | ||
* Dockerfile: fix CUDA installation (cuDNN contested between Torch and TF due to extra OCR) | ||
* OCR: re-instate missing methods and fix `utils_ocr` function calls | ||
* mbreorder/enhancement CLIs: missing imports | ||
* :fire: writer: `SeparatorRegion` needs `SeparatorRegionType` (not `ImageRegionType`) | ||
f458e3e | ||
* tests: switch from `pytest-subtests` to `parametrize` so we can use `pytest-isolate` | ||
(so CUDA memory gets freed between tests if running on GPU) | ||
|
||
Added: | ||
* :fire: `layout` CLI: new option `--model_version` to override default choices | ||
* test coverage for OCR options in `layout` | ||
* test coverage for table detection in `layout` | ||
* CI linting with ruff | ||
|
||
Changed: | ||
|
||
* polygons: slightly widen for regions and lines, increase for separators | ||
* various refactorings, some code style and identifier improvements | ||
* deskewing/multiprocessing: switch back to ProcessPoolExecutor (faster), | ||
but use shared memory if necessary, and switch back from `loky` to stdlib, | ||
and shutdown in `del()` instead of `atexit` | ||
* :fire: OCR: switch CNN-RNN model to `20250930` version compatible with TF 2.12 on CPU, too | ||
* OCR: allow running `-tr` without `-fl`, too | ||
* :fire: writer: use `@type='heading'` instead of `'header'` for headings | ||
* :fire: performance gains via refactoring (simplification, less copy-code, vectorization, | ||
avoiding unused calculations, avoiding unnecessary 3-channel image operations) | ||
* :fire: heuristic reading order detection: many improvements | ||
- contour vs splitter box matching: | ||
* contour must be contained in box exactly instead of heuristics | ||
* make fallback center matching, center must be contained in box | ||
- original vs deskewed contour matching: | ||
* same min-area filter on both sides | ||
* similar area score in addition to center proximity | ||
* avoid duplicate and missing mappings by allowing N:M | ||
matches and splitting+joining where necessary | ||
* CI: update+improve model caching | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, @vahidrezanezhad something along those lines would be very helpful for #186 |
||
|
||
|
||
## [0.5.0] - 2025-09-26 | ||
|
||
Fixed: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,12 +13,18 @@ DOCKER ?= docker | |
#SEG_MODEL := https://github.com/qurator-spk/eynollah/releases/download/v0.3.0/models_eynollah.tar.gz | ||
#SEG_MODEL := https://github.com/qurator-spk/eynollah/releases/download/v0.3.1/models_eynollah.tar.gz | ||
SEG_MODEL := https://zenodo.org/records/17194824/files/models_layout_v0_5_0.tar.gz?download=1 | ||
SEG_MODELFILE = $(notdir $(patsubst %?download=1,%,$(SEG_MODEL))) | ||
SEG_MODELNAME = $(SEG_MODELFILE:%.tar.gz=%) | ||
|
||
BIN_MODEL := https://github.com/qurator-spk/sbb_binarization/releases/download/v0.0.11/saved_model_2021_03_09.zip | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO for me: Replace with the zenodo location. |
||
BIN_MODELFILE = $(notdir $(BIN_MODEL)) | ||
BIN_MODELNAME := default-2021-03-09 | ||
|
||
OCR_MODEL := https://zenodo.org/records/17194824/files/models_ocr_v0_5_0.tar.gz?download=1 | ||
OCR_MODEL := https://zenodo.org/records/17236998/files/models_ocr_v0_5_1.tar.gz?download=1 | ||
OCR_MODELFILE = $(notdir $(patsubst %?download=1,%,$(OCR_MODEL))) | ||
OCR_MODELNAME = $(OCR_MODELFILE:%.tar.gz=%) | ||
|
||
PYTEST_ARGS ?= -vv | ||
PYTEST_ARGS ?= -vv --isolate | ||
|
||
# BEGIN-EVAL makefile-parser --make-help Makefile | ||
|
||
|
@@ -31,7 +37,8 @@ help: | |
@echo " install Install package with pip" | ||
@echo " install-dev Install editable with pip" | ||
@echo " deps-test Install test dependencies with pip" | ||
@echo " models Download and extract models to $(CURDIR)/models_layout_v0_5_0" | ||
@echo " models Download and extract models to $(CURDIR):" | ||
@echo " $(BIN_MODELNAME) $(SEG_MODELNAME) $(OCR_MODELNAME)" | ||
@echo " smoke-test Run simple CLI check" | ||
@echo " ocrd-test Run OCR-D CLI check" | ||
@echo " test Run unit tests" | ||
|
@@ -42,33 +49,32 @@ help: | |
@echo " PYTEST_ARGS pytest args for 'test' (Set to '-s' to see log output during test execution, '-vv' to see individual tests. [$(PYTEST_ARGS)]" | ||
@echo " SEG_MODEL URL of 'models' archive to download for segmentation 'test' [$(SEG_MODEL)]" | ||
@echo " BIN_MODEL URL of 'models' archive to download for binarization 'test' [$(BIN_MODEL)]" | ||
@echo " OCR_MODEL URL of 'models' archive to download for binarization 'test' [$(OCR_MODEL)]" | ||
@echo "" | ||
|
||
# END-EVAL | ||
|
||
|
||
# Download and extract models to $(PWD)/models_layout_v0_5_0 | ||
models: models_layout_v0_5_0 models_ocr_v0_5_0 default-2021-03-09 | ||
models: $(BIN_MODELNAME) $(SEG_MODELNAME) $(OCR_MODELNAME) | ||
|
||
models_layout_v0_5_0: models_layout_v0_5_0.tar.gz | ||
tar zxf models_layout_v0_5_0.tar.gz | ||
# do not download these files if we already have the directories | ||
.INTERMEDIATE: $(BIN_MODELFILE) $(SEG_MODELFILE) $(OCR_MODELFILE) | ||
|
||
models_layout_v0_5_0.tar.gz: | ||
$(BIN_MODELFILE): | ||
wget -O $@ $(BIN_MODEL) | ||
$(SEG_MODELFILE): | ||
wget -O $@ $(SEG_MODEL) | ||
|
||
models_ocr_v0_5_0: models_ocr_v0_5_0.tar.gz | ||
tar zxf models_ocr_v0_5_0.tar.gz | ||
|
||
models_ocr_v0_5_0.tar.gz: | ||
$(OCR_MODELFILE): | ||
wget -O $@ $(OCR_MODEL) | ||
|
||
default-2021-03-09: $(notdir $(BIN_MODEL)) | ||
unzip $(notdir $(BIN_MODEL)) | ||
$(BIN_MODELNAME): $(BIN_MODELFILE) | ||
mkdir $@ | ||
mv $(basename $(notdir $(BIN_MODEL))) $@ | ||
|
||
$(notdir $(BIN_MODEL)): | ||
wget $(BIN_MODEL) | ||
unzip -d $@ $< | ||
$(SEG_MODELNAME): $(SEG_MODELFILE) | ||
tar zxf $< | ||
$(OCR_MODELNAME): $(OCR_MODELFILE) | ||
tar zxf $< | ||
|
||
build: | ||
$(PIP) install build | ||
|
@@ -82,28 +88,34 @@ install: | |
install-dev: | ||
$(PIP) install -e .$(and $(EXTRAS),[$(EXTRAS)]) | ||
|
||
deps-test: models_layout_v0_5_0 | ||
ifeq (OCR,$(findstring OCR, $(EXTRAS))) | ||
deps-test: $(OCR_MODELNAME) | ||
endif | ||
deps-test: $(BIN_MODELNAME) $(SEG_MODELNAME) | ||
$(PIP) install -r requirements-test.txt | ||
ifeq (OCR,$(findstring OCR, $(EXTRAS))) | ||
ln -rs $(OCR_MODELNAME)/* $(SEG_MODELNAME)/ | ||
endif | ||
|
||
smoke-test: TMPDIR != mktemp -d | ||
smoke-test: tests/resources/kant_aufklaerung_1784_0020.tif | ||
# layout analysis: | ||
eynollah layout -i $< -o $(TMPDIR) -m $(CURDIR)/models_layout_v0_5_0 | ||
eynollah layout -i $< -o $(TMPDIR) -m $(CURDIR)/$(SEG_MODELNAME) | ||
fgrep -q http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 $(TMPDIR)/$(basename $(<F)).xml | ||
fgrep -c -e TextRegion -e ImageRegion -e SeparatorRegion $(TMPDIR)/$(basename $(<F)).xml | ||
# layout, directory mode (skip one, add one): | ||
eynollah layout -di $(<D) -o $(TMPDIR) -m $(CURDIR)/models_layout_v0_5_0 | ||
eynollah layout -di $(<D) -o $(TMPDIR) -m $(CURDIR)/$(SEG_MODELNAME) | ||
test -s $(TMPDIR)/euler_rechenkunst01_1738_0025.xml | ||
# mbreorder, directory mode (overwrite): | ||
eynollah machine-based-reading-order -di $(<D) -o $(TMPDIR) -m $(CURDIR)/models_layout_v0_5_0 | ||
eynollah machine-based-reading-order -di $(<D) -o $(TMPDIR) -m $(CURDIR)/$(SEG_MODELNAME) | ||
fgrep -q http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 $(TMPDIR)/$(basename $(<F)).xml | ||
fgrep -c -e RegionRefIndexed $(TMPDIR)/$(basename $(<F)).xml | ||
# binarize: | ||
eynollah binarization -m $(CURDIR)/default-2021-03-09 -i $< -o $(TMPDIR)/$(<F) | ||
eynollah binarization -m $(CURDIR)/$(BIN_MODELNAME) -i $< -o $(TMPDIR)/$(<F) | ||
test -s $(TMPDIR)/$(<F) | ||
@set -x; test "$$(identify -format '%w %h' $<)" = "$$(identify -format '%w %h' $(TMPDIR)/$(<F))" | ||
# enhance: | ||
eynollah enhancement -m $(CURDIR)/models_layout_v0_5_0 -sos -i $< -o $(TMPDIR) -O | ||
eynollah enhancement -m $(CURDIR)/$(SEG_MODELNAME) -sos -i $< -o $(TMPDIR) -O | ||
test -s $(TMPDIR)/$(<F) | ||
@set -x; test "$$(identify -format '%w %h' $<)" = "$$(identify -format '%w %h' $(TMPDIR)/$(<F))" | ||
$(RM) -r $(TMPDIR) | ||
|
@@ -114,18 +126,18 @@ ocrd-test: tests/resources/kant_aufklaerung_1784_0020.tif | |
cp $< $(TMPDIR) | ||
ocrd workspace -d $(TMPDIR) init | ||
ocrd workspace -d $(TMPDIR) add -G OCR-D-IMG -g PHYS_0020 -i OCR-D-IMG_0020 $(<F) | ||
ocrd-eynollah-segment -w $(TMPDIR) -I OCR-D-IMG -O OCR-D-SEG -P models $(CURDIR)/models_layout_v0_5_0 | ||
ocrd-eynollah-segment -w $(TMPDIR) -I OCR-D-IMG -O OCR-D-SEG -P models $(CURDIR)/$(SEG_MODELNAME) | ||
result=$$(ocrd workspace -d $(TMPDIR) find -G OCR-D-SEG); \ | ||
fgrep -q http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 $(TMPDIR)/$$result && \ | ||
fgrep -c -e TextRegion -e ImageRegion -e SeparatorRegion $(TMPDIR)/$$result | ||
ocrd-sbb-binarize -w $(TMPDIR) -I OCR-D-IMG -O OCR-D-BIN -P model $(CURDIR)/default-2021-03-09 | ||
ocrd-sbb-binarize -w $(TMPDIR) -I OCR-D-SEG -O OCR-D-SEG-BIN -P model $(CURDIR)/default-2021-03-09 -P operation_level region | ||
ocrd-sbb-binarize -w $(TMPDIR) -I OCR-D-IMG -O OCR-D-BIN -P model $(CURDIR)/$(BIN_MODELNAME) | ||
ocrd-sbb-binarize -w $(TMPDIR) -I OCR-D-SEG -O OCR-D-SEG-BIN -P model $(CURDIR)/$(BIN_MODELNAME) -P operation_level region | ||
$(RM) -r $(TMPDIR) | ||
|
||
# Run unit tests | ||
test: export MODELS_LAYOUT=$(CURDIR)/models_layout_v0_5_0 | ||
test: export MODELS_OCR=$(CURDIR)/models_ocr_v0_5_0 | ||
test: export MODELS_BIN=$(CURDIR)/default-2021-03-09 | ||
test: export MODELS_LAYOUT=$(CURDIR)/$(SEG_MODELNAME) | ||
test: export MODELS_OCR=$(CURDIR)/$(OCR_MODELNAME) | ||
test: export MODELS_BIN=$(CURDIR)/$(BIN_MODELNAME) | ||
test: | ||
$(PYTHON) -m pytest tests --durations=0 --continue-on-collection-errors $(PYTEST_ARGS) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
pytest | ||
pytest-subtests | ||
pytest-isolate | ||
coverage[toml] | ||
black |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,5 +5,4 @@ scikit-learn >= 0.23.2 | |
tensorflow < 2.13 | ||
numba <= 0.58.1 | ||
scikit-image | ||
loky | ||
biopython |
Uh oh!
There was an error while loading. Please reload this page.