Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broken chapters in an ebook #289

Open
vinniec opened this issue Aug 2, 2023 · 4 comments
Open

broken chapters in an ebook #289

vinniec opened this issue Aug 2, 2023 · 4 comments

Comments

@vinniec
Copy link

vinniec commented Aug 2, 2023

Hi, for the past couple of days I've been trying to write a small script that takes text from an ebook, one chapter at a time and reworks it (I'm doing a dictation script).
The first version of my script works though I realized that the TOC doesn't match with the number of files.html inside the book.epub.
Basically there are 50 chapters in the TOC and 78 html pages.

When I look at the book from an ebookreader (librera) there is no noticeable problem because the toc works and even the program tells me how many pages (reflowed) are missing to the end of the chapter including also the text of the chapters that are split into multiple related pages.html in the book.epub.
However if I try to fetch the text with ebooklib by iterating on the EpubBook class with the get_body_content() method, only the partial text of the chapter that has been split into multiple pages.html is fetched.

So I don't know much about how an epub is supposed to be formed, but I assume that my ebook was created incorrectly (I think calibre was used), however having toc the right references, how could I merge those chapters that were fragmented into multiple files.html?

@vinniec
Copy link
Author

vinniec commented Aug 2, 2023

Okay, small update, I fixed the table of contents with calibre, although the problem remains, you lose enough time to fix it by hand and it would be nice if you could merge chapter text automatically respecting the toc.

I am writing to add one thing though, in another book I adjusted the toc to add subchapters and this though generated an error in ebooklib which I am sharing:

piece of code

book = epub.read_epub("book.epub")
for a in book.toc:
    print(type(a))
    print(type(a.title), a.title)

output with error

....
<class 'ebooklib.epub.Link'>
<class 'str'> CAP TITLE
<class 'tuple'>
Traceback (most recent call last):
  File "<string>", line 11, in <module>
AttributeError: 'tuple' object has no attribute 'title'

you can see that the object from class ebooklib.epub.Link becomes a tuple that does not contain the title property

@aerkalov
Copy link
Owner

aerkalov commented Aug 2, 2023

If it is not a problem could you send me sample of your input and output files to my email [email protected] so I can double check what is going on there. I want to check how is TOC organised in the original book.

@vinniec
Copy link
Author

vinniec commented Aug 6, 2023

ok i re-sent you an email :)

@Coniferish
Copy link

Coniferish commented Aug 8, 2023

@aerkalov I have a related issue/question (but let me know if I should open a separate issue):

I am trying to map the chapter/section info from the TOC to text elements for partitioning epub books. I've noticed a few things...

  1. the TOC hrefs can refer to the same epub.EpubHtml item from book.items
  2. some book.items are skipped by the TOC hrefs
  3. the beginning and ending of an epub.EpubHtml.get_content object doesn't line up with the beginning and ending of chapters.

Looking at the epub.EpubNcx, I'm wondering if it's somehow possible to use the links to define the chapter boundaries, but I haven't been able to figure it out. Any help would be greatly appreciated!
This is the book I'm using for testing: https://www.gutenberg.org/cache/epub/60153/pg60153-images.html#CONTENTS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants