Skip to content

Commit b66b08e

Browse files
Merge pull request #105 from swisstopo/LGVISIUM-102/LayerIdentifierColumn
LGVISIUM-102: common parent class "Sidebar" for LayerIdentifierColumn and DepthColumn
2 parents 2ffcdef + 474111a commit b66b08e

32 files changed

+1613
-1830
lines changed

README.md

Lines changed: 3 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -124,103 +124,9 @@ Use `boreholes-extract-all --help` to see all options for the extraction script.
124124

125125
4. **Check the results**
126126

127-
Once the script has finished running, you can check the results in the `data/output/draw` directory. The result is a `predictions.json` file as well as a png file for each page of each PDF in the specified input directory.
128-
129-
### Output Structure
130-
The `predictions.json` file contains the results of a data extraction process from PDF files. Each key in the JSON object is the name of a PDF file, and the value is a list of extracted items in a dictionary like object. The extracted items for now are the material descriptions in their correct order (given by their depths).
131-
132-
Example: predictions.json
133-
```json
134-
{
135-
"685256002-bp.pdf": { # file name
136-
"language": "de",
137-
"metadata": {
138-
"coordinates": null
139-
},
140-
"layers": [ # a layer corresponds to a material layer in the borehole profile
141-
{
142-
"material_description": { # all information about the complete description of the material of the layer
143-
"text": "grauer, siltig-sandiger Kies (Auffullung)",
144-
"rect": [
145-
232.78799438476562,
146-
130.18496704101562,
147-
525.6640014648438,
148-
153.54295349121094
149-
],
150-
"lines": [
151-
{
152-
"text": "grauer, siltig-sandiger Kies (Auffullung)",
153-
"rect": [
154-
232.78799438476562,
155-
130.18496704101562,
156-
525.6640014648438,
157-
153.54295349121094
158-
],
159-
"page": 1
160-
}
161-
],
162-
"page": 1
163-
},
164-
"depth_interval": { # information about the depth of the layer
165-
"start": null,
166-
"end": {
167-
"value": 0.4,
168-
"rect": [
169-
125.25399780273438,
170-
140.2349853515625,
171-
146.10398864746094,
172-
160.84498596191406
173-
],
174-
"page": 1
175-
}
176-
}
177-
},
178-
...
179-
],
180-
"depths_materials_column_pairs": [ # information about where on the pdf the information for material description as well as depths are taken.
181-
{
182-
"depth_column": {
183-
"rect": [
184-
119.05999755859375,
185-
140.2349853515625,
186-
146.8470001220703,
187-
1014.4009399414062
188-
],
189-
"entries": [
190-
{
191-
"value": 0.4,
192-
"rect": [
193-
125.25399780273438,
194-
140.2349853515625,
195-
146.10398864746094,
196-
160.84498596191406
197-
],
198-
"page": 1
199-
},
200-
{
201-
"value": 0.6,
202-
"rect": [
203-
125.21800231933594,
204-
153.8349609375,
205-
146.0679931640625,
206-
174.44496154785156
207-
],
208-
"page": 1
209-
},
210-
...
211-
]
212-
}
213-
}
214-
],
215-
"page_dimensions": [
216-
{
217-
"height": 1192.0999755859375,
218-
"width": 842.1500244140625
219-
}
220-
]
221-
},
222-
}
223-
```
127+
The script produces output in two different formats:
128+
- A file `data/output/predictions.json` that contains all extracted data in a machine-readable format. The structure of this file is documented in [README.predictions-json.md](README.predictions-json.md).
129+
- A PNG image of each processed PDF page in the `data/output/draw` directory, where the extracted data is highlighted.
224130

225131
# Developer Guidance
226132
## Project Structure

README.predictions-json.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# `predictions.json` output structure
2+
The `predictions.json` file contains the results of a data extraction process in a machine-readable format. By default, the file is written to `data/output/predictions.json`.
3+
4+
Each key in the JSON object is the name of a PDF file. The extracted data is listed as an object with the following keys:
5+
- `metadata`
6+
- `elevation`: the detected elevation (if any) and the location in the PDF where they were extraction from.
7+
- `coordinates`: the detected coordinates (if any) and the location in the PDF where they were extraction from.
8+
- `language`: language that was detected for the document.
9+
- `page_dimensions`: dimensions of each page in the PDF, measured in PDF points
10+
- `layers`: a list of objects, where each object represents a layer of the borehole profile, using the following keys:
11+
- `material_description`: the text of the material description, both as a single value as well as line-by-line, and the location in the PDF where the text resp. the lines where extracted from.
12+
- `depth_interval`: the measured depth of the upper and lower limits of the layer, and the location in the PDF where they were extracted from.
13+
- `bounding_boxes`: a list of objects, one for each (part of a) borehole profile in the PDF, that list some bounding boxes that can be used for visualizations. Each object has the following keys:
14+
- `sidebar_rect`: the area of the page the contains a "sidebar" (if any), which contains depths or other data displayed to the side of material descriptions.
15+
- `depth_column_entries`: list of locations of the entries in the depth column (if any).
16+
- `material_description_rect`: the area of the page that contains all material descriptions.
17+
- `page`: the number of the page of the PDF.
18+
- `groundwater`: a list of objects, one for each groundwater measurement that was extracted from the PDF. Each object has the following keys.
19+
- `date`: extracted date for the groundwater measurement (if any) as a string in YYYY-MM-DD format.
20+
- `depth`: the measured depth (in m) of the groundwater measurement.
21+
- `elevation`: the elevation (in m above sea level) of the groundwater measurement.
22+
- `page` and `rect`: the location in the PDF where the groundwater measurement was extracted from.
23+
24+
All page numbers are counted starting at 1.
25+
26+
All bounding boxes are measured with PDF points as the unit, and with the top-left of the page as the origin.
27+
28+
## Example output
29+
```yaml
30+
{
31+
"B366.pdf": { # file name
32+
"metadata": {
33+
"elevation": {
34+
"elevation": 355.35,
35+
"page": 1,
36+
"rect": [27.49843978881836, 150.2817840576172, 159.42971801757812, 160.76754760742188]
37+
},
38+
"coordinates": {
39+
"E": 659490.0,
40+
"N": 257200.0,
41+
"rect": [28.263830184936523, 179.63882446289062, 150.3379364013672, 188.7487335205078],
42+
"page": 1
43+
},
44+
"language": "de",
45+
"page_dimensions": [
46+
{
47+
"width": 591.956787109375,
48+
"height": 1030.426025390625
49+
},
50+
{
51+
"width": 588.009521484375,
52+
"height": 792.114990234375
53+
}
54+
]
55+
},
56+
"layers": [
57+
{
58+
"material_description": {
59+
"text": "beiger, massig-dichter, stark dolomitisierter Kalk, mit Muschelresten",
60+
"lines": [
61+
{
62+
"text": "beiger, massig-dichter, stark",
63+
"page": 1,
64+
"rect": [258.5303039550781, 345.9997253417969, 379.9410705566406, 356.1011657714844]
65+
},
66+
{
67+
"text": "dolomitisierter Kalk, mit",
68+
"page": 1,
69+
"rect": [258.2362060546875, 354.4559326171875, 363.0706787109375, 364.295654296875]
70+
},
71+
{
72+
"text": "Muschelresten",
73+
"page": 1,
74+
"rect": [258.48748779296875, 363.6712341308594, 313.03204345703125, 371.3343505859375]
75+
}
76+
],
77+
"page": 1,
78+
"rect": [258.2362060546875, 345.9997253417969, 379.9410705566406, 371.3343505859375]
79+
},
80+
"depth_interval": {
81+
"start": {
82+
"value": 1.5,
83+
"rect": [200.63790893554688, 331.3035888671875, 207.83108520507812, 338.30450439453125]
84+
},
85+
"end": {
86+
"value": 6.0,
87+
"rect": [201.62551879882812, 374.30560302734375, 210.0361328125, 380.828857421875]
88+
}
89+
}
90+
},
91+
# ... (more layers)
92+
],
93+
"bounding_boxes": [
94+
{
95+
"sidebar_rect": [198.11251831054688, 321.8956298828125, 210.75906372070312, 702.2628173828125],
96+
"depth_column_entries": [
97+
[200.1201171875, 321.8956298828125, 208.59901428222656, 328.6802062988281],
98+
[200.63790893554688, 331.3035888671875, 207.83108520507812, 338.30450439453125],
99+
[201.62551879882812, 374.30560302734375, 210.0361328125, 380.828857421875],
100+
[199.86251831054688, 434.51556396484375, 210.10894775390625, 441.4538879394531],
101+
[198.11251831054688, 557.5472412109375, 210.35877990722656, 563.9244995117188],
102+
[198.28451538085938, 582.0216674804688, 209.76953125, 588.7603759765625],
103+
[198.7814178466797, 616.177001953125, 209.50042724609375, 622.502197265625],
104+
[198.6378173828125, 663.2830810546875, 210.75906372070312, 669.5428466796875],
105+
[198.26901245117188, 695.974609375, 209.12693786621094, 702.2628173828125]
106+
],
107+
"material_description_rect": [256.777099609375, 345.9997253417969, 392.46051025390625, 728.2700805664062],
108+
"page": 1
109+
},
110+
{
111+
"sidebar_rect": null,
112+
"depth_column_entries": [],
113+
"material_description_rect": [192.3216094970703, 337.677978515625, 291.1827392578125, 633.6331176757812],
114+
"page": 2
115+
}
116+
],
117+
"groundwater": [
118+
{
119+
"date": "1979-11-29",
120+
"depth": 19.28,
121+
"elevation": 336.07,
122+
"page": 1,
123+
"rect": [61.23963928222656, 489.3185119628906, 94.0096435546875, 513.6478881835938]
124+
}
125+
]
126+
}
127+
}
128+
```

pyproject.toml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ dependencies = [
1212
"boto3",
1313
"pandas",
1414
"levenshtein",
15-
"pathlib",
1615
"python-dotenv",
1716
"setuptools",
1817
"tqdm",

src/stratigraphy/annotations/draw.py

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,7 @@
88
import pandas as pd
99
from dotenv import load_dotenv
1010
from stratigraphy.data_extractor.data_extractor import FeatureOnPage
11-
from stratigraphy.depthcolumn.depthcolumn import DepthColumn
12-
from stratigraphy.depths_materials_column_pairs.depths_materials_column_pairs import DepthsMaterialsColumnPairs
11+
from stratigraphy.depths_materials_column_pairs.bounding_boxes import BoundingBoxes
1312
from stratigraphy.groundwater.groundwater_extraction import Groundwater
1413
from stratigraphy.layer.layer import Layer
1514
from stratigraphy.metadata.coordinate_extraction import Coordinate
@@ -55,7 +54,7 @@ def draw_predictions(
5554
for file_prediction in predictions.file_predictions_list:
5655
logger.info("Drawing predictions for file %s", file_prediction.file_name)
5756

58-
depths_materials_column_pairs = file_prediction.depths_materials_columns_pairs
57+
bounding_boxes = file_prediction.bounding_boxes
5958
coordinates = file_prediction.metadata.coordinates
6059
elevation = file_prediction.metadata.elevation
6160

@@ -98,7 +97,7 @@ def draw_predictions(
9897
draw_depth_columns_and_material_rect(
9998
shape,
10099
page.derotation_matrix,
101-
[pair for pair in depths_materials_column_pairs if pair.page == page_number],
100+
[bboxes for bboxes in bounding_boxes if bboxes.page == page_number],
102101
)
103102
draw_material_descriptions(
104103
shape,
@@ -245,7 +244,7 @@ def draw_material_descriptions(shape: fitz.Shape, derotation_matrix: fitz.Matrix
245244

246245

247246
def draw_depth_columns_and_material_rect(
248-
shape: fitz.Shape, derotation_matrix: fitz.Matrix, depths_materials_column_pairs: list[DepthsMaterialsColumnPairs]
247+
shape: fitz.Shape, derotation_matrix: fitz.Matrix, bounding_boxes: list[BoundingBoxes]
249248
):
250249
"""Draw depth columns as well as the material rects on a pdf page.
251250
@@ -257,25 +256,22 @@ def draw_depth_columns_and_material_rect(
257256
Args:
258257
shape (fitz.Shape): The shape object for drawing.
259258
derotation_matrix (fitz.Matrix): The derotation matrix of the page.
260-
depths_materials_column_pairs (list): List of depth column entries.
259+
bounding_boxes (list[BoundingBoxes]): List of bounding boxes for depth column and material descriptions.
261260
"""
262-
for pair in depths_materials_column_pairs:
263-
depth_column: DepthColumn = pair.depth_column
264-
material_description_rect = pair.material_description_rect
265-
266-
if depth_column: # Draw rectangle for depth columns
261+
for bboxes in bounding_boxes:
262+
if bboxes.sidebar_bbox: # Draw rectangle for depth columns
267263
shape.draw_rect(
268-
fitz.Rect(depth_column.rect()) * derotation_matrix,
264+
fitz.Rect(bboxes.sidebar_bbox.rect) * derotation_matrix,
269265
)
270266
shape.finish(color=fitz.utils.getColor("green"))
271-
for depth_column_entry in depth_column.entries: # Draw rectangle for depth column entries
267+
for depth_column_entry in bboxes.depth_column_entry_bboxes: # Draw rectangle for depth column entries
272268
shape.draw_rect(
273269
fitz.Rect(depth_column_entry.rect) * derotation_matrix,
274270
)
275271
shape.finish(color=fitz.utils.getColor("purple"))
276272

277273
shape.draw_rect( # Draw rectangle for material description column
278-
fitz.Rect(material_description_rect) * derotation_matrix,
274+
bboxes.material_description_bbox.rect * derotation_matrix,
279275
)
280276
shape.finish(color=fitz.utils.getColor("red"))
281277

0 commit comments

Comments
 (0)