Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Liberation] Add HTML to Blocks converter #2095

Merged
merged 13 commits into from
Dec 19, 2024
Merged

[Data Liberation] Add HTML to Blocks converter #2095

merged 13 commits into from
Dec 19, 2024

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Dec 17, 2024

Adds a basic WP_HTML_To_Blocks class that accepts HTML and outputs block markup.

It's a very basic converter. It only considers the markup and won't consider any visual changes introduced via CSS or JavaScript. Only a few core blocks are supported in this initial PR. The API can easily support more HTML elements and blocks.

To preserve visual fidelity between the original HTML page and the produced block markup, we'll need an annotated HTML input produced by the Try WordPress browser extension. It would contain each element's colors, sizes, etc. We cannot possibly get all from just analyzing the HTML on the server without building a full-blown, browser-like HTML renderer in PHP, and I know I'm not building one.

A part of #1894

Example

$html = <<<HTML
<meta name="post_title" content="My first post">
<p>Hello <b>world</b>!</p>
HTML;

$converter = new WP_HTML_To_Blocks( $html );
$converter->convert();

var_dump( $converter->get_all_metadata() );
/*
 * array( 'post_title' => array( 'My first post' ) )
 */

var_dump( $converter->get_block_markup() );
/*
 * <!-- wp:paragraph -->
 * <p>Hello <b>world</b>!</p>
 * <!-- /wp:paragraph -->
 */

Caveats

I had to patch WP_HTML_Processor to stop baling out on <meta> tags referencing the document charset. Ideally we'd patch WordPress core to stop baling out when the charset is UTF-8.

Testing instructions

This PR mostly adds new code. Just confirm the unit tests pass in CI.

cc @brandonpayton @zaerl @sirreal @dmsnell @ellatrix

Adds a basic WP_HTML_To_Blocks class that accepts HTML and outputs block markup.

It only considers the markup and won't consider any visual changes introduced via CSS or JavaScript.

A part of #1894

 ## Example

```html
$html = <<<HTML
<meta name="post_title" content="My first post">
<p>Hello <b>world</b>!</p>
HTML;

$converter = new WP_HTML_To_Blocks( $html );
$converter->convert();

var_dump( $converter->get_all_metadata() );
/*
 * array( 'post_title' => array( 'My first post' ) )
 */

var_dump( $converter->get_block_markup() );
/*
 * <!-- wp:paragraph -->
 * <p>Hello <b>world</b>!</p>
 * <!-- /wp:paragraph -->
 */
```

 ## Testing instructions

This PR mostly adds new code. Just confirm the unit tests pass in CI.
@@ -0,0 +1,70 @@
<?php

abstract class WP_Entity_Reader implements \Iterator {
Copy link
Collaborator Author

@adamziel adamziel Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need an abstraction yet in this PR, but I'm about to propose another one with more entity readers.

@ellatrix
Copy link
Member

It would be good if this creates the same output as the JS (DOM-based) converter. If @dmsnell ends up creating a JS HTML API, we could potentially move the JS one to it. Could the HTML API somehow be language agnostic?

@adamziel
Copy link
Collaborator Author

@ellatrix What do you mean by the DOM-based converter? Would this involve changes in this code? Or would this involve designing the DOM-based version to match functionality with the PHP-based version?

@adamziel adamziel merged commit ee3ce32 into trunk Dec 19, 2024
10 checks passed
@adamziel adamziel deleted the html-importer branch December 19, 2024 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants