[Data Liberation] Expose experimental Markdown importer in the importWxr step #2080

adamziel · 2024-12-13T20:16:40Z

🚧 Work in progress, don't merge 🚧

Enables importing markdown and epub files via the importWxr step (to be renamed) when the data-liberation importer is enabled.

CleanShot.2024-12-13.at.21.17.10.mp4

Here's the Blueprint you can use to import the "data basics" tutorial from the Gutenberg repo:

{
    "$schema": "https://playground.wordpress.net/blueprint-schema.json",
    "landingPage": "/adding-a-delete-button/",
    "features": {
        "networking": true
    },
    "steps": [
        {
            "step": "resetData"
        },
        {
            "step": "importWxr",
            "importer": "data-liberation",
            "phpImporterOptions": {
                "data_source": "markdown_directory",
                "source_site_url": "https://raw.githubusercontent.com/WordPress/gutenberg/HEAD/docs/how-to-guides/data-basics"
            },
            "importData": {
                "resource": "git:directory",
                "url": "https://github.com/WordPress/gutenberg.git",
                "ref": "HEAD",
                "path": "docs/how-to-guides/data-basics"
            }
        }
    ]
}

Requires WordPress/blueprints-library#121

Other code examples

Combining the new importers APIs is getting ridiculous. Here’s two entity readers:

The first one sources posts, meta, etc. from XHTML files stored inside a remote .epub file
The second one sources posts, meta, etc. from markdown files in a local .zip file

We can mix&match data sources (local filesystem, remote), formats (e.g. md, xhtml, wxr), and containes (plain, .zip, git in the future)

$reader = WP_Directory_Tree_Entity_Reader::create(
    new WP_Zip_Filesystem(
        WP_Remote_File_Ranged_Reader::create( 
            'https://github.com/IDPF/epub3-samples/releases/download/20230704/childrens-literature.epub'
        )
    ),
    array (
        'root_dir' => '/EPUB',
        'first_post_id' => 1,
        'allowed_extensions' => array( 'html', 'xhtml' ),
        'index_file_patterns' => array( '#^index\.x?html$#' ),
        'markup_converter_factory' => function( $content ) {
            return new WP_HTML_To_Blocks( $content );
        },
    )
);

$reader = WP_Directory_Tree_Entity_Reader::create(
    new WP_Zip_Filesystem(
        WP_File_Reader::create(__DIR__.'/../docs.zip')
    ),
    array (
        'root_dir' => '/',
        'first_post_id' => 1,
        'allowed_extensions' => array( 'md' ),
        'index_file_patterns' => array( '#^index\.md$#' ),
        'markup_converter_factory' => function( $content ) {
            return new WP_Markdown_To_Blocks( $content );
        },
    )
);

Remaining work

Confirm the WXR import still works both for the regular importer and the data liberation one
Add E2E coverage
Rewrite relative markdown URLs
Enable specifying additional URL mappings directly in the Blueprint
Review the code and make any architectural adjustments necessary

adamziel · 2024-12-16T23:59:53Z

This PR needs to be split into smaller parts before merging. For sure the new vendor libraries will become a separate PR. Epub and HTML importers probably, too.

Adds a forked version of the markdown parsing libraries required by the upcoming Markdown importer. We need out own fork for PHP 7.2 compatibility. The downgrade process was performed semi-automatically via Rector. This PR adds the following libraries: * `league/commonmark` * `webuni/front-matter` There are no testing steps here. This PR only adds new code without modifying the existing one. A part of #2080

Adds a forked version of the markdown parsing libraries required by the upcoming Markdown importer. We need out own fork for PHP 7.2 compatibility. The downgrade process was performed semi-automatically via Rector. This PR adds the following libraries: * `league/commonmark` * `webuni/front-matter` There are no testing steps here. This PR only adds new code without modifying the existing one. A part of: * #2080 * #1894

…Wxr step 🚧 Work in progress, don't merge 🚧 Enables importing markdown files via the `importWxr` step (to be renamed) when the data-liberation importer is enabled. Here's the Blueprint you can use to import the "data basics" tutorial from the Gutenberg repo: ```json { "$schema": "https://playground.wordpress.net/blueprint-schema.json", "landingPage": "/adding-a-delete-button/", "features": { "networking": true }, "steps": [ { "step": "resetData" }, { "step": "importWxr", "importer": "data-liberation", "phpImporterOptions": { "data_source": "markdown_directory", "source_site_url": "https://raw.githubusercontent.com/WordPress/gutenberg/HEAD/docs/how-to-guides/data-basics" }, "importData": { "resource": "git:directory", "url": "https://github.com/WordPress/gutenberg.git", "ref": "HEAD", "path": "docs/how-to-guides/data-basics" } } ] } ``` ## Remaining work * Confirm the WXR import still works both for the regular importer and the data liberation one * Add E2E coverage * Rewrite relative markdown URLs * Enable specifying additional URL mappings directly in the Blueprint * Review the code and make any architectural adjustments necessary

…zed WP_Markdown_Directory_Tree_Reader

Moves the Markdown importer to a `data-liberation-markdown` package so that it can be shipped as a separate `.phar` file and downloaded only when needed. ## Testing instructions This only moves code around. To test, confirm the CI PHP unit tests keep working. A part of: * #2080 * #1894

Builds data-liberation-markdown.phar.gz (200KB) to enable downloading the Markdown importer only when needed instead of on every page load. A part of: * #2080 * #1894 ## Testing instructions Run `nx build playground-data-liberation-markdown`, confirm it finished without errors. A smoke test of the built phar file is included in the build command.

adamziel · 2024-12-17T15:22:27Z

I'm going to close this PR. I've reorganized it as a series of smaller ones that we can discuss granularly:

After all the API changes, I'm no longer sure setting up the importer in blueprint.json in the way proposed in this PR will stand the test of time. Let's land all the plumbing from the above PRs and then discuss the public API in a dedicated discussion.

adamziel added [Type] Enhancement New feature or request [Feature] Import Export [Aspect] Data Liberation labels Dec 13, 2024

adamziel mentioned this pull request Dec 13, 2024

Tracking Issue: Next-gen PHP Importers for Data Liberation #1894

Open

85 tasks

adamziel mentioned this pull request Dec 17, 2024

[Data Liberation] Add Markdown parsing libraries #2092

Merged

adamziel added 13 commits December 17, 2024 12:50

Add the missing Markdown package

e438475

Move byte readers to the Blueprints Library

5f09610

WP_HTML_To_Blocks converter

02c1757

Add WP_HTML_Entity_Reader to import data from HTML files

2e84294

Add first test for WP_Epub_Entity_Reader

fc20d5e

Test the epub entity reader using local file and a remote file

798bc77

Slightly improve unit tests

b34bea1

Document a potential HTML vs XHTML issue in the EPub Entity Reader

f4185c4

Lint

7618668

Use a generic WP_Directory_Tree_Entity_Reader instead of the speciali…

b918f9e

…zed WP_Markdown_Directory_Tree_Reader

Document markup_converter_factory

8a1a60e

Remove namespaces, lint

4a31689

adamziel mentioned this pull request Dec 17, 2024

[Data Liberation] Move Markdown importer to a separate package #2093

Merged

adamziel mentioned this pull request Dec 17, 2024

[Data Liberation] Build markdown importer as phar #2094

Merged

adamziel force-pushed the expose-markdown-importer branch from f522d40 to 4a31689 Compare December 17, 2024 13:35

adamziel mentioned this pull request Dec 17, 2024

[Data Liberation] Refactor Entity Readers class diagram #2096

Open

adamziel closed this Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data Liberation] Expose experimental Markdown importer in the importWxr step #2080

[Data Liberation] Expose experimental Markdown importer in the importWxr step #2080

adamziel commented Dec 13, 2024 •

edited

Loading

adamziel commented Dec 16, 2024

adamziel commented Dec 17, 2024

[Data Liberation] Expose experimental Markdown importer in the importWxr step #2080

[Data Liberation] Expose experimental Markdown importer in the importWxr step #2080

Conversation

adamziel commented Dec 13, 2024 • edited Loading

Other code examples

Remaining work

adamziel commented Dec 16, 2024

adamziel commented Dec 17, 2024

adamziel commented Dec 13, 2024 •

edited

Loading