Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Liberation] WP_Stream_Importer with support for WXR and Markdown files #1982

Merged
merged 47 commits into from
Nov 18, 2024

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Nov 4, 2024

Motivation for the change, related issues

Adds WP_Stream_Importer – a generalized importer for arbitrary data. It comes with two data sources:

  • WP_WXR_Reader that streams entities from a WXR file
  • WP_Markdown_Directory_Tree_Reader that turns a markdown directory into page entities

WP_Stream_Importer

This is a draft of a re-entrant stream importer designed for importing very large datasets with minimal overhead. The few core ideas are:

  • Never insert a database record until all its dependencies are available.
  • (almost) never post-process database data. For example, replace all the URLs upfront.
  • Never crash. Instead, tell the user what failures happened and ask them how to proceed (e.g. upload custom image).
  • Whenever the work is stopped, start the next run at that exact point.
  • Avoid per-record database lookups, e.g. don't run SELECT * FROM wp_posts WHERE guid = :guid
  • Clearly communicate progress (x out of y posts imported, x out of y images downloaded, 380MB of huge_file.zip downloaded).
  • Assume the import will take multiple requests and make everything re-entrant.

Entities

This is a generalized data importer, not a WXR importer. WXR is just one of possible data sources. This design enables importing markdown files, Blogger exports, Tumblr blogs etc. without having to rewrite that data as WXR.

The basic unit of data is an "entity" – a simple PHP array with post, tag, comment etc. data. Entities can be sourced from WXR and Markdown files – the relevant classes are described below.

Multiple passes

Every import will require multiple passes over the stream of entities to:

  • Perform topological sort to process the dependencies first
  • Frontload all static assets
  • Potentially retry failed downloads
  • Verify all the files have been downloaded before moving on to inserting posts

User input

The proposed importer is not a single "start and forget" device. It could be configured as such, but by default it will require the user to review the process – sometimes multiple times. Here's a few examples of such touchpoints:

  • 20 images failed to download. Do you want to provide alternative images? Or do you want to remove them from the site and remove any related <img> tags from the content? Because they are referenced in these posts: (list of posts)
  • Post number 984 already exists in the database. Do you want to overwrite it? Ignore it? Insert as a new one? Manually reconcile the conflict?
  • Post 985 has a parent_id 23, but there is no such parent. Do you want to set another parent? Or make it a top-level post? Or ignore it?

If a webhost would rather avoid asking the user all these questions, the future importer API may enable forcing each of these decision.

WP_WXR_Reader

Streaming

The WXR reader supports the usual streaming interface with append_bytes(), is_paused_on_incomplete_input() et al.

It also comes with a new connect_upstream( $byte_source ) method that allows it to automatically pull new data chunks from a data source:

$wxr = new WP_WXR_Reader();
$wxr->connect_upstream(
	new WP_File_Reader(__DIR__ . '/tests/fixtures/wxr-simple.xml')
);
while($wxr->next_entity()) {
	$entity = $wxr->get_entity();
	// process
}

This way the consumer code never needs to worry about appending bytes, checking for EOF and such.

This PR also ships a few byte sources. Shaping more than one helped me notice patterns and propose v1 of the interface:

  • WP_File_Reader – streams bytes from a local file
  • WP_GZ_File_Reader – streams bytes from a gzipped local file
  • WP_Remote_File_Reader – streams bytes over HTTPS
  • WP_Remote_File_Ranged_Reader – streams specific byte ranges over HTTPS

WP_Markdown_Directory_Tree_Reader

This class traverses a directory tree and transforms all the .md files into page entity objects that can be processed by WP_Entity_Importer:

$docs_root = __DIR__ . '/../../docs/site';
$docs_content_root = $docs_root . '/docs';
$entity_iterator_factory = function() use ($docs_content_root) {
    return new WP_Markdown_Directory_Tree_Reader(
        $docs_content_root,
        1000
    );
};
$markdown_importer = WP_Markdown_Importer::create(
    $entity_iterator_factory, [
        'source_site_url' => 'file://' . $docs_content_root,
        'local_markdown_assets_root' => $docs_root,
        'local_markdown_assets_url_prefix' => '@site/',
    ]
);
$markdown_importer->frontload_assets();
$markdown_importer->import_posts();

WP_Markdown_To_Blocks

We don't just save raw Markdown data as post_content. Not at all!

This PR ships a WP_Markdown_To_Blocks class that:

  • Parses markdown data using the League\CommonMark library. It supports frontmatter and GitHub-flavored syntax such as tables, but it's also bulky and likely not PHP 7.2-compatible. For inclusion in WordPress core, we may need to roll out our own Markdown parser, or fork the League\CommonMark one and downgrade it to PHP 7.2.
  • Converts the document tree to block markup.
  • Sourcer the post title, order, slug etc. from frontmatter.

Other stuff

This PR also:

  • Enhances the XML parser.
  • @php-wasm/compile – Adds more Asyncify functions to the PHP WASM Dockerfile
  • @wp-playground/cli – buffers the downloads to a .partial file to avoid assuming the file is already cached in case the download have failed.

Follow-up work

  • ... what else ... ?
  • Add event listeners / hooks for transforming frontmatter to post data. Importantly, the code should still work outside of WordPress.
  • Implement topological sort of entities before importing them
  • Go over @TODOs and implement them
  • Scrutinize the pause/resume workflow. Can we avoid exposing string indices? Can we easily feed downstream byte offset into upstream byte reader to later resume reading the file where the last WXR entity started?

Testing instructions

Confirm the CI tests pass. This code isn't actually used anywhere yet so there isn't a better way.


require_once __DIR__ . '/bootstrap.php';

$reader = new WP_Serialized_Pages_Reader(__DIR__ . '/../../docs/site/docs');

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure I'm following correctly.

Is the idea here that any project would implement their own instance of this WP_Serialized_Pages_Reader in order to customize how front-matter should be read from that specific project?

For example, in my WPGraphQL Docs, I have frontmatter like so:

title: Contributing
uri: `/docs/contributing`

So, following this example, I could map my specific front-matter in the .md files I'm importing to whatever field(s) I want them to map to when being imported as a WordPress post?

Or is this something that's always running when front-matter is detected in .md and there would be a different mechanism (add_filter, for example) to custom map front-matter keys/values to WordPress (wxr) keys/values?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, following this example, I could map my specific front-matter in the .md files I'm importing to whatever field(s) I want them to map to when being imported as a WordPress post?

Exactly! I think every project will have to handle its own frontmatter. I'm not aware of any unified schema for frontmatter metadata and I've seen a few different variations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I like add_filter() maybe even more – it makes extensibility easier

Copy link

@jasonbahl jasonbahl Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I was thinking there could be some documented defaults.

i.e.

Frontmatter WXR
title post_title
date post_date
status post_status
... ...

Then if folks use those documented defaults in frontmatter already, things will "just work", but if they need to customize the mappings they could do so via a custom php snippet added in "steps" or a custom plugin loaded or whatever 🤔

Here's a (pseudo) example:

add_filter( 'wp_playground_map_front_matter_to_wxr', function( $wxr, $unfiltered_frontmatter ) { 
  // do some logic based on the frontmatter key and map it to wxr
  if ( isset( $unfiltered_frontmatter['something'] ) {
    $wxr['wp:some_meta_key'] => $unfiltered_frontmatter['something'];
  }
  return $wxr;
} );

Copy link
Collaborator Author

@adamziel adamziel Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that! I'd just keep the entire WXR pipeline completely separate and treat Markdown -> WordPress independently. In this scenario, we'd map the frontmatter keys directly to wp_insert_post keys. So there would be a WXR reader, Markdown reader using that filter, and a single unified Importer accepting inputs from these and other readers.

@adamziel adamziel marked this pull request as ready for review November 18, 2024 14:25
@adamziel adamziel changed the title [Data Liberation] WXR importer, Markdown reader [Data Liberation] WP_Stream_Importer with support for WXR and Markdown files Nov 18, 2024
@adamziel adamziel merged commit 9aeb038 into trunk Nov 18, 2024
9 of 10 checks passed
@adamziel adamziel deleted the wxr-importer branch November 18, 2024 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants