[Data Liberation] Add EPub to Blocks converter #2097

adamziel · 2024-12-17T14:11:45Z

Adds WP_EPub_Entity_Reader to parse EPub files into WordPress posts and post meta:

$reader = new WP_EPub_Entity_Reader(
	new WP_Zip_Filesystem(
        WP_Remote_File_Ranged_Reader::create( 
            'https://github.com/IDPF/epub3-samples/releases/download/20230704/childrens-literature.epub'
        )
    )
);

foreach($reader as $entity) {
	print_r( $entity );
}
// prints three arrays representing WordPress posts with content represented as block markup

A part of #1894

Implementation details

EPub are ZIP files containing content represented as XHTML. They may include other assets, too, e.g., CSS, images, table of contents, metadata, etc.

This PR glues together WP_Zip_Filesystem with WP_HTML_To_Blocks to find all the XHTML files in the zip and convert them to block markup.

Since XHTML uses XML syntax that cannot be parsed via WP_HTML_Processor, we use WP_XML_Processor instead. To support XHTML, this PR adds support for parsing simple <!DOCTYPE html> declarations in WP_XML_Processor.

This PR also enables swapping WP_HTML_Processor for WP_XML_Processor in WP_HTML_To_Blocks by adding a WP_XML_Processor::expects_closer() method. It doesn't have exactly the same semantics as the WP_HTML_Processor one, but it's close enough.

Remaining work

Right now, we're guessing the location of all the XML files. It works for the test example above, but to support all the epub files out there, we'd need to:

Parse the META-INF/container.xml config file to get the root file path.
Parse the root file to extract the paths of the content XHTML files, and potentially metadata such as authors, titles, pages, etc.
Discuss mapping the EPub structure into WordPress entities. We have files, chapters, and content pages. What should one WordPress page represent once the import is finished?

Open questions

Should we introduce a common WP_Markup_Processor interface to represent a subset of methods shared between the HTML processor and the XML processor?
Should we introduce class WP_XHTML_Processor extends WP_XML_Processor to align the semantics of expects_closer() and other overlapping methods?
Should we only consider the OEBPS and EPUB directories inside the epub file? Or can XHTML be stored under another path? How would we know?

cc @ellatrix @dmsnell @zaerl @brandonpayton @sirreal

Description TBD

sirreal · 2024-12-18T12:45:33Z

Should we introduce a common WP_Markup_Processor interface to represent a subset of methods shared between the HTML processor and the XML processor?

On the surface this seems nice, then folks could program against a common interface. I'd like to have a good understanding of what the common interface would be and how it would be used. An example might be the selectors work that could be used to navigate documents this common interface. Would the tag processor also implement this interface? Is the tag processor expected to be the base class for many of these other processors?

I also wonder if there are enough subtle differences that the common interface would be cumbersome without much tangible benefit. Again from the selectors work, when matching the ID selector, it needs to know if the document is in quirks mode to determine whether the match is case sensitive or insensitive. That's a purely HTML concept. Maybe XML documents are documents that are never in quirks mode, but it's something to think about.

adamziel · 2024-12-18T14:51:18Z

@sirreal all good points! Maybe we'd need a separate XHTML processor to integrate the selectors work, then, and the common interface would apply to XHTML and HTML, not XML and HTML.

sirreal · 2024-12-18T17:25:49Z

Selectors are a great interface for navigating trees, they're a great fit for XML. I just wanted to mention a quirk (😉) I noticed with selectors matching the HTML Processor. I'd love for something like select and select_all to be part of the interface if we decide to implement one.

[Data Liberation] Add Epub importer

e01dec8

Description TBD

adamziel added [Type] Enhancement New feature or request [Aspect] Data Liberation labels Dec 17, 2024

adamziel added 2 commits December 17, 2024 15:51

Use WP_XML_Reader for EPubs, support simple DOCTYPE declarations in XML

d293c22

Parse EPubs as XHTML

58def6c

adamziel marked this pull request as ready for review December 17, 2024 15:17

This was referenced Dec 17, 2024

[Data Liberation] Expose experimental Markdown importer in the importWxr step #2080

Closed

Tracking Issue: Next-gen PHP Importers for Data Liberation #1894

Open

adamziel self-assigned this Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data Liberation] Add EPub to Blocks converter #2097

[Data Liberation] Add EPub to Blocks converter #2097

adamziel commented Dec 17, 2024 •

edited

Loading

sirreal commented Dec 18, 2024

adamziel commented Dec 18, 2024

sirreal commented Dec 18, 2024

[Data Liberation] Add EPub to Blocks converter #2097

Are you sure you want to change the base?

[Data Liberation] Add EPub to Blocks converter #2097

Conversation

adamziel commented Dec 17, 2024 • edited Loading

Implementation details

Remaining work

Open questions

sirreal commented Dec 18, 2024

adamziel commented Dec 18, 2024

sirreal commented Dec 18, 2024

adamziel commented Dec 17, 2024 •

edited

Loading