broken chapters in an ebook #289

vinniec · 2023-08-02T18:39:44Z

Hi, for the past couple of days I've been trying to write a small script that takes text from an ebook, one chapter at a time and reworks it (I'm doing a dictation script).
The first version of my script works though I realized that the TOC doesn't match with the number of files.html inside the book.epub.
Basically there are 50 chapters in the TOC and 78 html pages.

When I look at the book from an ebookreader (librera) there is no noticeable problem because the toc works and even the program tells me how many pages (reflowed) are missing to the end of the chapter including also the text of the chapters that are split into multiple related pages.html in the book.epub.
However if I try to fetch the text with ebooklib by iterating on the EpubBook class with the get_body_content() method, only the partial text of the chapter that has been split into multiple pages.html is fetched.

So I don't know much about how an epub is supposed to be formed, but I assume that my ebook was created incorrectly (I think calibre was used), however having toc the right references, how could I merge those chapters that were fragmented into multiple files.html?

vinniec · 2023-08-02T21:07:19Z

Okay, small update, I fixed the table of contents with calibre, although the problem remains, you lose enough time to fix it by hand and it would be nice if you could merge chapter text automatically respecting the toc.

I am writing to add one thing though, in another book I adjusted the toc to add subchapters and this though generated an error in ebooklib which I am sharing:

piece of code

book = epub.read_epub("book.epub")
for a in book.toc:
    print(type(a))
    print(type(a.title), a.title)

output with error

....
<class 'ebooklib.epub.Link'>
<class 'str'> CAP TITLE
<class 'tuple'>
Traceback (most recent call last):
  File "<string>", line 11, in <module>
AttributeError: 'tuple' object has no attribute 'title'

you can see that the object from class ebooklib.epub.Link becomes a tuple that does not contain the title property

aerkalov · 2023-08-02T22:15:56Z

If it is not a problem could you send me sample of your input and output files to my email [email protected] so I can double check what is going on there. I want to check how is TOC organised in the original book.

vinniec · 2023-08-06T05:19:25Z

ok i re-sent you an email :)

Coniferish · 2023-08-08T15:41:04Z

@aerkalov I have a related issue/question (but let me know if I should open a separate issue):

I am trying to map the chapter/section info from the TOC to text elements for partitioning epub books. I've noticed a few things...

the TOC hrefs can refer to the same epub.EpubHtml item from book.items
some book.items are skipped by the TOC hrefs
the beginning and ending of an epub.EpubHtml.get_content object doesn't line up with the beginning and ending of chapters.

Looking at the epub.EpubNcx, I'm wondering if it's somehow possible to use the links to define the chapter boundaries, but I haven't been able to figure it out. Any help would be greatly appreciated!
This is the book I'm using for testing: https://www.gutenberg.org/cache/epub/60153/pg60153-images.html#CONTENTS

Coniferish mentioned this issue Aug 15, 2023

Improve epub metadata.section Unstructured-IO/unstructured#1126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broken chapters in an ebook #289

broken chapters in an ebook #289

vinniec commented Aug 2, 2023

vinniec commented Aug 2, 2023

aerkalov commented Aug 2, 2023

vinniec commented Aug 6, 2023

Coniferish commented Aug 8, 2023 •

edited

Loading

broken chapters in an ebook #289

broken chapters in an ebook #289

Comments

vinniec commented Aug 2, 2023

vinniec commented Aug 2, 2023

aerkalov commented Aug 2, 2023

vinniec commented Aug 6, 2023

Coniferish commented Aug 8, 2023 • edited Loading

Coniferish commented Aug 8, 2023 •

edited

Loading