1
1
<p align =" center " >
2
- <a href =" https://github.com/ds4sd/docling " > <img loading =" lazy " alt =" Docling " src =" https://github.com/DS4SD/docling/raw/main/logo.png " width =" 150 " />
2
+ <a href =" https://github.com/ds4sd/docling " >
3
+ <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
4
+ </a >
3
5
</p >
4
6
5
7
# Docling
11
13
[ ![ Imports: isort] ( https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336 )] ( https://pycqa.github.io/isort/ )
12
14
[ ![ Pydantic v2] ( https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json )] ( https://pydantic.dev )
13
15
[ ![ pre-commit] ( https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white )] ( https://github.com/pre-commit/pre-commit )
14
- [ ![ License MIT] ( https://img.shields.io/github/license/ds4sd/deepsearch-toolkit )] ( https://opensource.org/licenses/MIT )
16
+ [ ![ License MIT] ( https://img.shields.io/github/license/DS4SD/docling )] ( https://opensource.org/licenses/MIT )
15
17
16
18
Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
17
19
@@ -49,7 +51,7 @@ The output of the above command will be written to `./scratch`.
49
51
50
52
### Adjust pipeline features
51
53
52
- ** Control pipeline options**
54
+ #### Control pipeline options
53
55
54
56
You can control if table structure recognition or OCR should be performed by arguments passed to ` DocumentConverter ` :
55
57
``` python
@@ -62,16 +64,15 @@ doc_converter = DocumentConverter(
62
64
)
63
65
```
64
66
65
- ** Control table extraction options**
67
+ #### Control table extraction options
66
68
67
69
You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
68
70
This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
69
71
70
72
71
73
``` python
72
-
73
74
pipeline_options = PipelineOptions(do_table_structure = True )
74
- pipeline_options.table_structure_options.do_cell_matching = False # Uses text cells predicted from table structure model
75
+ pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model
75
76
76
77
doc_converter = DocumentConverter(
77
78
artifacts_path = artifacts_path,
0 commit comments