Skip to content

Commit

Permalink
[Data Liberation] WP_Stream_Importer with support for WXR and Markdow…
Browse files Browse the repository at this point in the history
…n files (#1982)

Adds `WP_Stream_Importer` – a generalized importer for arbitrary data.
It comes with two data sources:

* `WP_WXR_Reader` that streams entities from a WXR file
* `WP_Markdown_Directory_Tree_Reader` that turns a markdown directory
into `page` entities

### WP_Stream_Importer

This is a draft of a re-entrant stream importer designed for importing
very large datasets with minimal overhead. The few core ideas are:

* Never insert a database record until all its dependencies are
available.
* (almost) never post-process database data. For example, replace all
the URLs upfront.
* Never crash. Instead, tell the user what failures happened and ask
them how to proceed (e.g. upload custom image).
* Whenever the work is stopped, start the next run at that exact point.
* Avoid per-record database lookups, e.g. don't run `SELECT * FROM
wp_posts WHERE guid = :guid`
* Clearly communicate progress (x out of y posts imported, x out of y
images downloaded, 380MB of `huge_file.zip` downloaded).
* Assume the import will take multiple requests and make everything
re-entrant.

#### Entities

This is a generalized data importer, not a WXR importer. WXR is just one
of possible data sources. This design enables importing markdown files,
Blogger exports, Tumblr blogs etc. without having to rewrite that data
as WXR.

The basic unit of data is an "entity" – a simple PHP array with post,
tag, comment etc. data. Entities can be sourced from WXR and Markdown
files – the relevant classes are described below.

#### Multiple passes

Every import will require multiple passes over the stream of entities
to:

* Perform topological sort to process the dependencies first
* Frontload all static assets
* Potentially retry failed downloads
* Verify all the files have been downloaded before moving on to
inserting posts

#### User input

The proposed importer is not a single "start and forget" device. It
could be configured as such, but by default it will require the user to
review the process – sometimes multiple times. Here's a few examples of
such touchpoints:

* 20 images failed to download. Do you want to provide alternative
images? Or do you want to remove them from the site and remove any
related `<img>` tags from the content? Because they are referenced in
these posts: (list of posts)
* Post number 984 already exists in the database. Do you want to
overwrite it? Ignore it? Insert as a new one? Manually reconcile the
conflict?
* Post 985 has a `parent_id` 23, but there is no such parent. Do you
want to set another parent? Or make it a top-level post? Or ignore it?

If a webhost would rather avoid asking the user all these questions, the
future importer API may enable forcing each of these decision.

### WP_WXR_Reader

#### Streaming

The WXR reader supports the usual streaming interface with
`append_bytes()`, `is_paused_on_incomplete_input()` et al.

It also comes with a new `connect_upstream( $byte_source )` method that
allows it to automatically pull new data chunks from a data source:

```php
$wxr = new WP_WXR_Reader();
$wxr->connect_upstream(
	new WP_File_Reader(__DIR__ . '/tests/fixtures/wxr-simple.xml')
);
while($wxr->next_entity()) {
	$entity = $wxr->get_entity();
	// process
}
```

This way the consumer code never needs to worry about appending bytes,
checking for EOF and such.

This PR also ships a few byte sources. Shaping more than one helped me
notice patterns and propose v1 of the interface:

* `WP_File_Reader` – streams bytes from a local file
* `WP_GZ_File_Reader` – streams bytes from a gzipped local file
* `WP_Remote_File_Reader` – streams bytes over HTTPS
* `WP_Remote_File_Ranged_Reader` – streams specific byte ranges over
HTTPS

### WP_Markdown_Directory_Tree_Reader

This class traverses a directory tree and transforms all the `.md` files
into `page` entity objects that can be processed by
`WP_Entity_Importer`:

```php
$docs_root = __DIR__ . '/../../docs/site';
$docs_content_root = $docs_root . '/docs';
$entity_iterator_factory = function() use ($docs_content_root) {
    return new WP_Markdown_Directory_Tree_Reader(
        $docs_content_root,
        1000
    );
};
$markdown_importer = WP_Markdown_Importer::create(
    $entity_iterator_factory, [
        'source_site_url' => 'file://' . $docs_content_root,
        'local_markdown_assets_root' => $docs_root,
        'local_markdown_assets_url_prefix' => '@site/',
    ]
);
$markdown_importer->frontload_assets();
$markdown_importer->import_posts();
```

#### WP_Markdown_To_Blocks

We don't just save raw Markdown data as `post_content`. Not at all!

This PR ships a `WP_Markdown_To_Blocks` class that:

* Parses markdown data using the `League\CommonMark` library. It
supports frontmatter and GitHub-flavored syntax such as tables, but it's
also bulky and likely not PHP 7.2-compatible. For inclusion in WordPress
core, we may need to roll out our own Markdown parser, or fork the
`League\CommonMark` one and downgrade it to PHP 7.2.
* Converts the document tree to block markup.
* Sourcer the post title, order, slug etc. from frontmatter.

## Other stuff

This PR also:

* Enhances the XML parser.
* `@php-wasm/compile` – Adds more Asyncify functions to the PHP WASM
Dockerfile
* `@wp-playground/cli` – buffers the downloads to a `.partial` file to
avoid assuming the file is already cached in case the download have
failed.

## Follow-up work

- [ ] ... what else ... ?
- [ ] [Add event listeners / hooks for transforming frontmatter to post
data](#1982 (comment)).
Importantly, the code should still work outside of WordPress.
- [ ] Implement topological sort of entities before importing them
- [ ] Go over `@TODO`s and implement them
- [ ] Scrutinize the pause/resume workflow. Can we avoid exposing string
indices? Can we easily feed downstream byte offset into upstream byte
reader to later resume reading the file where the last WXR entity
started?

## Testing instructions

Confirm the CI tests pass. This code isn't actually used anywhere yet so
there isn't a better way.
  • Loading branch information
adamziel authored Nov 18, 2024
1 parent 857b091 commit 9aeb038
Show file tree
Hide file tree
Showing 55 changed files with 13,985 additions and 8,848 deletions.
3 changes: 3 additions & 0 deletions packages/php-wasm/compile/php/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -657,6 +657,7 @@ export ASYNCIFY_ONLY=$'"rc_dtor_func",\
"_call_user_function_impl",\
"_mysqlnd_run_command",\
"_php_stream_copy_to_mem",\
"_php_stream_copy_to_stream_ex",\
"_php_stream_eof",\
"_php_stream_fill_read_buffer",\
"_php_stream_free",\
Expand All @@ -666,6 +667,7 @@ export ASYNCIFY_ONLY=$'"rc_dtor_func",\
"_php_stream_set_option",\
"_php_stream_write",\
"_php_stream_xport_create",\
"zif_stream_socket_enable_crypto",\
"do_cli",\
"do_cli_server",\
"execute_ex",\
Expand Down Expand Up @@ -905,6 +907,7 @@ export ASYNCIFY_ONLY=$'"rc_dtor_func",\
"php_zend_stream_closer",\
"zend_fetch_class_by_name",\
"zend_file_handle_dtor",\
"zend_free_extra_named_params",\
"zend_include_or_eval",\
"zend_llist_del_element",\
"zend_lookup_class_ex",\
Expand Down
Binary file modified packages/php-wasm/node/asyncify/8_3_0/php_8_3.wasm
Binary file not shown.
Loading

0 comments on commit 9aeb038

Please sign in to comment.