Skip to content

Commit

Permalink
docs: update chunking usage docs, minor reorg (#550)
Browse files Browse the repository at this point in the history
Signed-off-by: Panos Vagenas <[email protected]>
  • Loading branch information
vagenas authored Dec 10, 2024
1 parent a7df337 commit d0c9e8e
Show file tree
Hide file tree
Showing 7 changed files with 31 additions and 26 deletions.
2 changes: 1 addition & 1 deletion docs/concepts/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ For each document format, the *document converter* knows which format-specific *

The *conversion result* contains the [*Docling document*](./docling_document.md), Docling's fundamental document representation.

Some typical scenarios for using a Docling document include directly calling its *export methods*, such as for markdown, dictionary etc., or having it chunked by a *chunker*.
Some typical scenarios for using a Docling document include directly calling its *export methods*, such as for markdown, dictionary etc., or having it chunked by a [*chunker*](./chunking.md).

For more details on Docling's architecture, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869).

Expand Down
4 changes: 2 additions & 2 deletions docs/cli.md β†’ docs/reference/cli.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# CLI Reference
# CLI reference

This page provides documentation for our command line tools.

::: mkdocs-click
:module: docling.cli.main
:command: click_app
:prog_name: docling
:style: table
:style: table
File renamed without changes.
File renamed without changes.
File renamed without changes.
32 changes: 19 additions & 13 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,7 @@ A simple example would look like this:
docling https://arxiv.org/pdf/2206.01062
```

To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./cli.md).


To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md).

### Advanced options

Expand Down Expand Up @@ -130,29 +128,37 @@ You can limit the CPU threads used by Docling by setting the environment variabl

## Chunking

You can perform a hierarchy-aware chunking of a Docling document as follows:
You can chunk a Docling document using a [chunker](concepts/chunking.md), such as a
`HybridChunker`, as shown below (for more details check out
[this example](examples/hybrid_chunking.ipynb)):

```python
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker import HierarchicalChunker
from docling.chunking import HybridChunker

conv_res = DocumentConverter().convert("https://arxiv.org/pdf/2206.01062")
doc = conv_res.document
chunks = list(HierarchicalChunker().chunk(doc))

print(chunks[30])
chunker = HybridChunker(tokenizer="BAAI/bge-small-en-v1.5") # set tokenizer as needed
chunk_iter = chunker.chunk(doc)
```

An example chunk would look like this:

```python
print(list(chunk_iter)[11])
# {
# "text": "Lately, new types of ML models for document-layout analysis have emerged [...]",
# "text": "In this paper, we present the DocLayNet dataset. [...]",
# "meta": {
# "doc_items": [{
# "self_ref": "#/texts/40",
# "self_ref": "#/texts/28",
# "label": "text",
# "prov": [{
# "page_no": 2,
# "bbox": {"l": 317.06, "t": 325.81, "r": 559.18, "b": 239.97, ...},
# }]
# }],
# "headings": ["2 RELATED WORK"],
# "bbox": {"l": 53.29, "t": 287.14, "r": 295.56, "b": 212.37, ...},
# }], ...,
# }, ...],
# "headings": ["1 INTRODUCTION"],
# }
# }
```
19 changes: 9 additions & 10 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,6 @@ nav:
- "Docling": index.md
- Installation: installation.md
- Usage: usage.md
- CLI: cli.md
- FAQ: faq.md
- Docling v2: v2.md
- Concepts:
Expand All @@ -76,15 +75,12 @@ nav:
- "Table export": examples/export_tables.py
- "Multimodal export": examples/export_multimodal.py
- "Force full page OCR": examples/full_page_ocr.py
- Chunking:
- "Hybrid chunking": examples/hybrid_chunking.ipynb
- RAG / QA:
- "RAG with LlamaIndex πŸ¦™": examples/rag_llamaindex.ipynb
- "RAG with LangChain πŸ¦œπŸ”—": examples/rag_langchain.ipynb
- "Hybrid RAG with Qdrant": examples/hybrid_rag_qdrant.ipynb
- Chunking:
- "Hybrid chunking": examples/hybrid_chunking.ipynb
# - Chunking: examples/chunking.md
# - CLI:
# - CLI: examples/cli.md
- Integrations:
- Integrations: integrations/index.md
- "🐝 Bee": integrations/bee.md
Expand All @@ -99,10 +95,13 @@ nav:
- "spaCy": integrations/spacy.md
- "txtai": integrations/txtai.md
# - "LangChain πŸ¦œπŸ”—": integrations/langchain.md
- API reference:
- Document Converter: api_reference/document_converter.md
- Pipeline options: api_reference/pipeline_options.md
- Docling Document: api_reference/docling_document.md
- Reference:
- Python API:
- Document Converter: reference/document_converter.md
- Pipeline options: reference/pipeline_options.md
- Docling Document: reference/docling_document.md
- CLI:
- CLI reference: reference/cli.md

markdown_extensions:
- pymdownx.superfences
Expand Down

0 comments on commit d0c9e8e

Please sign in to comment.