A text preparation pipeline
Initially focused on extracting text from EPUBs for citation and stand-off annotation.
uv sync- activate venv
pengolodh title <book-id-or-path>
will either print the title (from both metadata and ncx) of the volume or, if the assertions are too strict, throw an exception.
pengolodh spine <book-id-or-path>
will either print the "spine" of the volume, or, if the assertions are too strict, throw an exception.
pengolodh extract-map <book-id-or-path> [<item-ref>] [<address>] [--recurse]
will give information about HTML elements in the EPUB.
If there is an item-ref then only that item (i.e. file) will be considered otherwise all items will be traversed.
If there is an address then only that element will be extracted otherwise the root will be extracted.
If there is a --recurse then information about the descendants will also be given.
The results are in tuple form if --recurse is used, otherwise they are in a dictionary.
Note that the name extract-map is historical and will likely change.
pengolodh text <book-id-or-path> <item-ref> [<address>]
will extract the plain text of the given item (or the specific address, if given)
pengolodh xml <book-id-or-path> <item-ref> [<address>]
will extract the XML of the given item (or the specific address, if given)
pengolodh tree <book-id-or-path> <item-ref> [<address>] [--depth <depth>]
will show the tree structure of the given item (or the specific address, if given) optionally up to the given depth.
pengolodh list-books
will list any books configured with ids (see under What is a book-id-or-path?)
$ pengolodh extract-map <book-id-or-path> chapter01
{'label': 'body.text#text', 'offset': 0, 'length': 170401, 'child_count': 1}
$ pengolodh extract-map <book-id-or-path> chapter01 1
{'label': 'div.chapter#chapter01', 'offset': 1, 'length': 170400, 'child_count': 4}
$ pengolodh extract-map <book-id-or-path> chapter01 1.3.2
{'label': 'h2.chapterTitle', 'offset': 7, 'length': 20, 'child_count': 1}
$ pengolodh extract-map <book-id-or-path> chapter01 1.3.2.1
{'label': 'span.bold', 'offset': 7, 'length': 20, 'child_count': 0}
$ pengolodh extract-map <book-id-or-path> chapter01 1.3.2 --recurse
('h2.chapterTitle', 7, 20, [('span.bold', 7, 20, [])])
$ pengolodh text <book-id-or-path> chapter01 1.3.2 will then give the extracted plain text.
This can either be the full path to an EPUB file, an unzipped EPUB, or a book identifier set in $XDG_CONFIG/pengolodh.toml as follows:
[books]
<book-id> = "<path-to-epub>"
...
An item-ref is an identifier for a particular HTML file in the EPUB given by the first column of the output of the spine command.
An address is a dot-separated path to a particular element in an HTML file. 5.1.3 would mean the third child or the first child of the fifth child of the root.