[Data Liberation] WP_Stream_Importer with support for WXR and Markdown files #1982

adamziel · 2024-11-04T01:39:53Z

Motivation for the change, related issues

Adds WP_Stream_Importer – a generalized importer for arbitrary data. It comes with two data sources:

WP_WXR_Reader that streams entities from a WXR file
WP_Markdown_Directory_Tree_Reader that turns a markdown directory into page entities

WP_Stream_Importer

This is a draft of a re-entrant stream importer designed for importing very large datasets with minimal overhead. The few core ideas are:

Never insert a database record until all its dependencies are available.
(almost) never post-process database data. For example, replace all the URLs upfront.
Never crash. Instead, tell the user what failures happened and ask them how to proceed (e.g. upload custom image).
Whenever the work is stopped, start the next run at that exact point.
Avoid per-record database lookups, e.g. don't run SELECT * FROM wp_posts WHERE guid = :guid
Clearly communicate progress (x out of y posts imported, x out of y images downloaded, 380MB of huge_file.zip downloaded).
Assume the import will take multiple requests and make everything re-entrant.

Entities

This is a generalized data importer, not a WXR importer. WXR is just one of possible data sources. This design enables importing markdown files, Blogger exports, Tumblr blogs etc. without having to rewrite that data as WXR.

The basic unit of data is an "entity" – a simple PHP array with post, tag, comment etc. data. Entities can be sourced from WXR and Markdown files – the relevant classes are described below.

Multiple passes

Every import will require multiple passes over the stream of entities to:

Perform topological sort to process the dependencies first
Frontload all static assets
Potentially retry failed downloads
Verify all the files have been downloaded before moving on to inserting posts

User input

The proposed importer is not a single "start and forget" device. It could be configured as such, but by default it will require the user to review the process – sometimes multiple times. Here's a few examples of such touchpoints:

20 images failed to download. Do you want to provide alternative images? Or do you want to remove them from the site and remove any related <img> tags from the content? Because they are referenced in these posts: (list of posts)
Post number 984 already exists in the database. Do you want to overwrite it? Ignore it? Insert as a new one? Manually reconcile the conflict?
Post 985 has a parent_id 23, but there is no such parent. Do you want to set another parent? Or make it a top-level post? Or ignore it?

If a webhost would rather avoid asking the user all these questions, the future importer API may enable forcing each of these decision.

WP_WXR_Reader

Streaming

The WXR reader supports the usual streaming interface with append_bytes(), is_paused_on_incomplete_input() et al.

It also comes with a new connect_upstream( $byte_source ) method that allows it to automatically pull new data chunks from a data source:

$wxr = new WP_WXR_Reader();
$wxr->connect_upstream(
	new WP_File_Reader(__DIR__ . '/tests/fixtures/wxr-simple.xml')
);
while($wxr->next_entity()) {
	$entity = $wxr->get_entity();
	// process
}

This way the consumer code never needs to worry about appending bytes, checking for EOF and such.

This PR also ships a few byte sources. Shaping more than one helped me notice patterns and propose v1 of the interface:

WP_File_Reader – streams bytes from a local file
WP_GZ_File_Reader – streams bytes from a gzipped local file
WP_Remote_File_Reader – streams bytes over HTTPS
WP_Remote_File_Ranged_Reader – streams specific byte ranges over HTTPS

WP_Markdown_Directory_Tree_Reader

This class traverses a directory tree and transforms all the .md files into page entity objects that can be processed by WP_Entity_Importer:

$docs_root = __DIR__ . '/../../docs/site';
$docs_content_root = $docs_root . '/docs';
$entity_iterator_factory = function() use ($docs_content_root) {
    return new WP_Markdown_Directory_Tree_Reader(
        $docs_content_root,
        1000
    );
};
$markdown_importer = WP_Markdown_Importer::create(
    $entity_iterator_factory, [
        'source_site_url' => 'file://' . $docs_content_root,
        'local_markdown_assets_root' => $docs_root,
        'local_markdown_assets_url_prefix' => '@site/',
    ]
);
$markdown_importer->frontload_assets();
$markdown_importer->import_posts();

WP_Markdown_To_Blocks

We don't just save raw Markdown data as post_content. Not at all!

This PR ships a WP_Markdown_To_Blocks class that:

Parses markdown data using the League\CommonMark library. It supports frontmatter and GitHub-flavored syntax such as tables, but it's also bulky and likely not PHP 7.2-compatible. For inclusion in WordPress core, we may need to roll out our own Markdown parser, or fork the League\CommonMark one and downgrade it to PHP 7.2.
Converts the document tree to block markup.
Sourcer the post title, order, slug etc. from frontmatter.

Other stuff

This PR also:

Enhances the XML parser.
@php-wasm/compile – Adds more Asyncify functions to the PHP WASM Dockerfile
@wp-playground/cli – buffers the downloads to a .partial file to avoid assuming the file is already cached in case the download have failed.

Follow-up work

... what else ... ?
Add event listeners / hooks for transforming frontmatter to post data. Importantly, the code should still work outside of WordPress.
Implement topological sort of entities before importing them
Go over @TODOs and implement them
Scrutinize the pause/resume workflow. Can we avoid exposing string indices? Can we easily feed downstream byte offset into upstream byte reader to later resume reading the file where the last WXR entity started?

Testing instructions

Confirm the CI tests pass. This code isn't actually used anywhere yet so there isn't a better way.

…ng the same importer as for WXR

packages/playground/data-liberation/src/WP_Directory_Reader.php

jasonbahl · 2024-11-04T18:58:58Z

packages/playground/data-liberation/docs-importer-test.php

+
+require_once __DIR__ . '/bootstrap.php';
+
+$reader = new WP_Serialized_Pages_Reader(__DIR__ . '/../../docs/site/docs');


Just want to make sure I'm following correctly.

Is the idea here that any project would implement their own instance of this WP_Serialized_Pages_Reader in order to customize how front-matter should be read from that specific project?

For example, in my WPGraphQL Docs, I have frontmatter like so:

title: Contributing uri: `/docs/contributing`

So, following this example, I could map my specific front-matter in the .md files I'm importing to whatever field(s) I want them to map to when being imported as a WordPress post?

Or is this something that's always running when front-matter is detected in .md and there would be a different mechanism (add_filter, for example) to custom map front-matter keys/values to WordPress (wxr) keys/values?

So, following this example, I could map my specific front-matter in the .md files I'm importing to whatever field(s) I want them to map to when being imported as a WordPress post?

Exactly! I think every project will have to handle its own frontmatter. I'm not aware of any unified schema for frontmatter metadata and I've seen a few different variations.

Although I like add_filter() maybe even more – it makes extensibility easier

Ya, I was thinking there could be some documented defaults.

i.e.

Frontmatter WXR

title post_title

date post_date

status post_status

... ...

Then if folks use those documented defaults in frontmatter already, things will "just work", but if they need to customize the mappings they could do so via a custom php snippet added in "steps" or a custom plugin loaded or whatever 🤔

Here's a (pseudo) example:

add_filter( 'wp_playground_map_front_matter_to_wxr', function( $wxr, $unfiltered_frontmatter ) { // do some logic based on the frontmatter key and map it to wxr if ( isset( $unfiltered_frontmatter['something'] ) { $wxr['wp:some_meta_key'] => $unfiltered_frontmatter['something']; } return $wxr; } );

I like that! I'd just keep the entire WXR pipeline completely separate and treat Markdown -> WordPress independently. In this scenario, we'd map the frontmatter keys directly to wp_insert_post keys. So there would be a WXR reader, Markdown reader using that filter, and a single unified Importer accepting inputs from these and other readers.

…_Url_Processor

…XR/Markdown importer abstraction to work

…create_from_string"

…s emerge

WXRReader is now a real pull reader – it automatically pulls data from the upstream byte reader, whether it's a local file, gzipped file, or a remote HTTP resource.

adamziel added 6 commits November 4, 2024 02:37

Add Markdown -> Blocks reader

4f1bb29

Write Playground CLI errors to stdout, don't reuse partial downloads

65728f3

Adjust WXR reader data shape

9588634

Add WP_Entity_Importer class

51e7900

Add test scripts

83f35f1

Lint

b5bdbb3

adamziel mentioned this pull request Nov 4, 2024

[Data Liberation] EBNF processor #1981

Open

adamziel added 2 commits November 4, 2024 17:35

Experiment: Source markdown files from a nested directory tree

141cdfa

Parse frontmatter, try importing all the doc pages into WordPress usi…

c46d973

…ng the same importer as for WXR

jasonbahl reviewed Nov 4, 2024

View reviewed changes

packages/playground/data-liberation/src/WP_Directory_Reader.php Outdated Show resolved Hide resolved

jasonbahl reviewed Nov 4, 2024

View reviewed changes

adamziel mentioned this pull request Nov 5, 2024

Tracking Issue: Next-gen PHP Importers for Data Liberation #1894

Open

85 tasks

adamziel added 18 commits November 5, 2024 20:01

Decently working markdown importing

48ee5a0

Assign GUIDs

58150bb

Use local path as GUID

ddf2406

WP_Markdown_Directory_Tree_Reader to load markdown from a directory tree

f787f6a

Rename Markdown HTML API

f9ba8cb

Move specialized APIs to their own subfolders

cd80834

Move imported entity data to a dedicated WP_Imported_Entity class

b855aef

Explore imperative connection of byte stream to wxr reader

0f2f428

Rewrite URLs in the imported content

3878998

Prototype entity import with image downloading and URL rewriting

40f0028

Document how two passes are needed

37cdb76

Add a comment to do two passes on markdown

6dd046f

Fetch attachments when importing a markdown file

75427b3

Remove unused code

acce0b4

Further document the two-pass approach

560252c

Support multiple URLs in wp_rewrite_urls

5fb3073

Two pass Markdown import, move URL rewriting logic to WP_Block_Markup…

26652a9

…_Url_Processor

Create attachments when importing markdown files

1c74f5a

adamziel added 20 commits November 15, 2024 01:42

Harmonize WXR import logic with Markdown import

a9205e1

Frontload assets during WXR import

8087f3e

Support importing parent IDs

d2c71d2

First stab at a common abstraction for streaming content importers

d01be29

Sort out relative vs absolute vs base URL nuances to get the common W…

e5b79cc

…XR/Markdown importer abstraction to work

Remove addressed todos, expand docstrings

7ac2133

Rename "from_stream" and "from_strig" to "create_for_streaming" and "…

82bd95e

…create_from_string"

Move Stream_Importer and Markdown_Importer to separate files

81c47f7

Simplify the implode() call

e6b713b

Adjust docstrings

d0cf271

Use empty list of kses attributes in tests

a376f05

Update docstrings

94ea43d

Update docstrings

97da630

Adjust type in docs

e55106e

Explore re-entrancy – implement more byte readers to see what pattern…

5238736

…s emerge

Replace StreamChain with much simpler connect_upstream() method.

bf4b53d

WXRReader is now a real pull reader – it automatically pulls data from the upstream byte reader, whether it's a local file, gzipped file, or a remote HTTP resource.

Restore functional URL rewriting

9c01981

Add inline comment to resume()

87b5d28

Lint

b977619

Exclude WXR_Importer.php from phpcs

ac62590

adamziel marked this pull request as ready for review November 18, 2024 14:25

adamziel changed the title ~~[Data Liberation] WXR importer, Markdown reader~~ [Data Liberation] WP_Stream_Importer with support for WXR and Markdown files Nov 18, 2024

Use the correct case in require statement in bootstrap.php

9d68196

adamziel merged commit 9aeb038 into trunk Nov 18, 2024
9 of 10 checks passed

adamziel deleted the wxr-importer branch November 18, 2024 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data Liberation] WP_Stream_Importer with support for WXR and Markdown files #1982

[Data Liberation] WP_Stream_Importer with support for WXR and Markdown files #1982

adamziel commented Nov 4, 2024 •

edited

Loading

jasonbahl Nov 4, 2024

adamziel Nov 4, 2024

adamziel Nov 4, 2024

jasonbahl Nov 4, 2024 •

edited

Loading

adamziel Nov 4, 2024 •

edited

Loading


		require_once __DIR__ . '/bootstrap.php';

		$reader = new WP_Serialized_Pages_Reader(__DIR__ . '/../../docs/site/docs');

[Data Liberation] WP_Stream_Importer with support for WXR and Markdown files #1982

[Data Liberation] WP_Stream_Importer with support for WXR and Markdown files #1982

Conversation

adamziel commented Nov 4, 2024 • edited Loading

Motivation for the change, related issues

WP_Stream_Importer

Entities

Multiple passes

User input

WP_WXR_Reader

Streaming

WP_Markdown_Directory_Tree_Reader

WP_Markdown_To_Blocks

Other stuff

Follow-up work

Testing instructions

jasonbahl Nov 4, 2024

Choose a reason for hiding this comment

adamziel Nov 4, 2024

Choose a reason for hiding this comment

adamziel Nov 4, 2024

Choose a reason for hiding this comment

jasonbahl Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

adamziel Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

adamziel commented Nov 4, 2024 •

edited

Loading

jasonbahl Nov 4, 2024 •

edited

Loading

adamziel Nov 4, 2024 •

edited

Loading