Tracking Issue: Next-gen PHP Importers for Data Liberation #1894

adamziel · 2024-10-14T17:57:35Z

Next Gen importers

This issue tracks the work related to Data Liberation Phase 2: Importing and Exporting Structured Data, that is:

Parsers
Importers
User and developer tools.

WordPress needs parsers. Not just any parsers, but parsers that are streaming, re-entrant, fast, standard compliant, and tested using a large body of possible inputs. A seemingly simple task such as moving a post to another website requires rewriting the URLs in that post, downloading the assets, and handling network failures. More complex tasks, such as importing a WXR file or transferring an entire site, are even more demanding.

WordPress also needs importers. Not just any importers, but importers that can handle large quantities of data from multitude of data formats, are extensible, and can proceed even when they encounter an error in the middle of the process. The WP_Stream_Importer class explored in this project is designed to fulfill these goals – see specific PRs below.

Finally, WordPress needs user and developer tools to use these importers. Not just any tools, but tools that work on the web, in CLI, in the Playground, guide the user with useful progress updates, and provide useful recovery paths when the inevitable errors occur. The work tracked here focuses on a wp-admin page, but the PHP software components are designed for easy reuse outside of wp-admin.

Tracking – ongoing Issues and PRs

Parsing

Exporting

Create v1 all-encompassing WordPress export for Assembler results #2055

Importing

Data formats

Reliability

[Data Liberation] Anomaly Testing #2019

UI

Beautiful design for the admin page
[Data Liberation] "Fetch from a different URL" button for failed media downloads, Interactivity API support #2040

Other

Move wp_kses_uri_attributes filter to import start/end #2047
Extension points for plugin-provided URL treatment, e.g. base64_decode specific block attributes before rewriting the URLs
Streaming SQL import and export
Streaming ZIP import and export
Per-row version control (like @dmsnell's vector clock idea from https://core.trac.wordpress.org/ticket/60375)
Test with 300GB XML file
PHP dependency management – should we ship all the PHP classes in this repo? Or publish independent plugins for others to start adapting in their work – but with no BC guarantees?
Move data-liberation WP-CLI command to separate class #2025

Related resources

Next phases: Future Data Liberation roadmap

Note

The ideas below are the next phases of the project. They stretch far beyond the medium-term importers work tracked in this issue and only live here to paint the big picture.

The text was updated successfully, but these errors were encountered:

A part of #1894. Follows up on #1893. This PR brings in a few more PHP APIs that were initially explored outside of Playground so that they can be incubated in Playground. See the linked descriptions for more details about each API: * XML Processor from WordPress/wordpress-develop#6713 * Stream chain from adamziel/wxr-normalize#1 * A draft of a WXR URL Rewriter class capable of rewriting URLs in WXR files ## Testing instructions * Confirm the PHPUnit tests pass in CI * Confirm the test suite looks reasonabel * That's it for now! It's all new code that's not actually used anywhere in Playground yet. I just want to merge it to keep iterating and improving.

A part of #1894. Adds https://github.com/WordPress/blueprints-library as a git submodule to the data-liberation package to enable easy code reuse between the projects. I'm not yet sure, but perhaps moving all the PHP libraries to the blueprints-library would make sense? TBD No testing instructions. This is just a new submodule. No code changes are involved.

…essor (#1960) Merge `WP_XML_Tag_Processor` and `WP_XML_Processor` into a single `WP_XML_Processor` class. This reduces abstractions, enables keeping more properties as private, and simplifies the code. Related to #1894 and WordPress/wordpress-develop#6713 ## Testing instructions Confirm the CI tests pass.

brandonpayton · 2024-11-01T03:54:15Z

@adamziel I think this may have been accidentally closed when #1960 was merged because it was "Related to" this one. There are a good number of tasks left unfinished, and this closing looks automated rather than intentional.

I'll reopen, and you can close again if it was intentional.

adamziel · 2024-11-02T14:20:09Z

Let's also review Automattic's VIP WXR importer for going from WXR reading to importing:

https://github.com/search?q=repo%3AAutomattic%2Fvip-go-mu-plugins%20wxr&type=code

This PR introduces the `WP_WXR_Reader` class for parsing WordPress eXtended RSS (WXR) files, along with supporting improvements to the XML processing infrastructure. **Note: `WP_WXR_Reader` is just a reader. It won't actually import the data into WordPress** – that part is coming soon. A part of #1894 ## Motivation There is no WordPress importer that would check all these boxes: * Supports 100GB+ WXR files without running out of memory * Can pause and resume along the way * Can resume even after a fatal error * Can run without libxml and mbstring * Is really fast `WP_WXR_Reader` is a step in that direction. It cannot pause and resume yet, but the next few PRs will add that feature. ## Implementation `WP_WXR_Reader` uses the `WP_XML_Processor` to find XML tags representing meaningful WordPress entities. The reader knows the WXR schema and only looks for relevant elements. For example, it knows that posts are stored in `rss > channel > item` and comments are stored in `rss > channel > item > `wp:comment`. The `$wxr->next_entity()` method stream-parses the next entity from the WXR document and exposes it to the API consumer via `$wxr->get_entity_type()` and `$wxr->get_entity_date()`. The next call to `$wxr->next_entity()` remembers where the parsing has stopped and parses the next entity after that point. ```php $fp = fopen('my-wxr-file.xml', 'r'); $wxr_reader = WP_WXR_Reader::from_stream(); while(true) { if($wxr_reader->next_entity()) { switch ( $wxr_reader->get_entity_type() ) { case 'post': // ... process post ... break; case 'comment': // ... process comment ... break; case 'site_option': // ... process site option ... break; // ... process other entity types ... } continue; } // Next entity not found – we ran out of data to process. // Let's feed another chunk of bytes to the reader. if(feof($fp)) { break; } $chunk = fread($fp, 8192); if(false === $chunk) { $wxr_reader->input_finished(); continue; } $wxr_reader->append_bytes($chunk); } ``` Similarly to `WP_XML_Processor`, the `WP_WXR_Reader` enters a paused state when it doesn't have enough XML bytes to parse the entire entity. The _next_entity() -> fread -> break_ usage pattern may seem a bit tedious. This is expected. Even if the WXR parsing part of the `WP_WXR_Reader` offers a high-level API, working with byte streams requires reasoning on a much lower level. The `StreamChain` class shipped in this repository will make the API consumption easier with its transformation–oriented API for chaining data processors. ### Supported WordPress entities * posts – sourced from `<item>` tags * comments – sourced from `<wp:comment>` tags * comment meta – sourced from `<wp:commentmeta>` tags * users – sourced from `<wp:author>` tags * post meta – sourced from `<wp:postmeta>` tags * terms – sourced from `<wp:term>` tags * tags – sourced from `<wp:tag>` tags * categories – sourced from `<wp:category>` tags ## Caveats ### Extensibility `WP_WXR_Reader` ignores any XML elements it doesn't recognize. The WXR format is extensible so in the future the reader may start supporting registration of custom handlers for unknown tags in the future. ### Nested entities intertwined with data `WP_WXR_Reader` flushes the current entity whenever another entity starts. The upside is simplicity and a tiny memory footprint. The downside is that it's possible to craft a WXR document where some information would be lost. For example: ```xml <rss> <channel> <item> <title>Page with comments</title> <link>http://wpthemetestdata.wordpress.com/about/page-with-comments/</link> <wp:postmeta> <wp:meta_key>_wp_page_template</wp:meta_key> <wp:meta_value><![CDATA[default]]></wp:meta_value> </wp:postmeta> <wp:post_id>146</wp:post_id> </item> </channel> </rss> ``` `WP_WXR_Reader` would accumulate post data until the `wp:post_meta` tag. Then it would emit a `post` entity and accumulate the meta information until the `</wp:postmeta>` closer. Then it would advance to `<wp:post_id>` and **ignore it**. This is not a problem in all the `.wxr` files I saw. Still, it is important to note this limitation. It is possible there is a `.wxr` generator somewhere out there that intertwines post fields with post meta and comments. If this ever comes up, we could: * Emit the `post` entity first, then all the nested entities, and then emit a special `post_update` entity. * Do multiple passes over the WXR file – one for each level of nesting, e.g. 1. Insert posts, 2. Insert Comments, 3. Insert comment meta Buffering all the post meta and comments seems like a bad idea – there might be gigabytes of data. ## Future Plans The next phase will add pause/resume functionality to handle timeout scenarios: - Save parser state after each entity or every `n` entities to speed it up. Then also save the `n` for a quick rewind after resuming. - Resume parsing from saved state. ## Testing Instructions Read the tests and ponder whether they make sense. Confirm the PHPUnit test suite passed on CI. The test suite includes coverage for various WXR formats and streaming behaviors.

Adds wp-admin support for incrementally importing data from WXR files: ![CleanShot 2024-11-27 at 19 07 23@2x](https://github.com/user-attachments/assets/401158e4-d499-45f1-b1d2-2054edeb8326) This is a part of #1894 ## Implementation details There can be one active import session at any given time. It is started by uploading a WXR file, specifying the URL, and can be extended to any number of data sources. Once created, the admin page shows the current import progress. This PR adds a `WP_Import_Session` model class to store the progress information and the current import cursor. Given an active importing session, the admin page will show the current stage and the number of imported entities accompanied by a "Continue Importing" button. When pressed, it calls `WP_Stream_Importer::next_step()` one or more times to perform a small unit of work. After each call, we collect the progress information from `WP_Stream_Importer` – be it the number of downloaded asset bytes, the number of inserted database records, the current importing cursor, etc. `next_step()` returns true when some progress was made, even if that was a failed image download attempt. It returns false when it reaches the end of the current importing stage, at which point the `advance_to_next_stage()` method must be called. After each `next_step()` or `advance_to_next_stage()` call, the `WP_Stream_Importer::get_reentrancy_cursor()` returns a string that can be used to create a new importer that will resume from the exact same place. The cursor means _we got this far_, not _we got this far and no further_. The record the cursor points to may have already been processed. In the upcoming PRs we'll need to either point to the next entity, or invent an idempotent import semantics where processing the same record twice leads to the same outcome as processing it once. ### Resource Budgets This PR starts exploring resource budgets by introducing a soft time limit and a minimum number of files downloaded during a single frontloading session. We don't support partial download and resuming yet, so we can't settle for downloading less than one file. On the next attempt we'd just discard the result and likely download less than one file again, meaning we would never get past the frontloading step. ## Testing instructions 1. `cd packages/playground/data-liberation/tests/import` 2. `bash run.sh` 1. Go to wp-admin 3. Go to the Data Liberation page 4. Upload the a11y xml file from the WXR test set shipped in `packages/playground/data-liberation/tests/wxr/a11y-unit-test-data.xml` 5. Click through all the import steps 6. Confirm the assets are downloaded are expected and that, eventually, every click of the "continue" button imports one more entity

…a downloads, Interactivity API support (#2040) ## Description Ships user-driven import error handling and makes the import UI more useful by automatically refreshing the progress details. ### User-driven error handling When a remote asset cannot be downloaded, most importers either stop or ignore the error. This PR adds a user interaction to make an explicit decision about what should happen next – do we ignore the missing asset? Do we use another file instead? https://github.com/user-attachments/assets/cea48258-b644-434c-9fb2-1b890c4d86d7 ### Auto-Refreshing Import Status This PR also re-expresses the entire data liberation wp-admin page using the interactivity API, and auto-refreshes the progress: https://github.com/user-attachments/assets/e093268b-5deb-4bc2-a1d2-e2bb1148e153 A part of #1894 ## Technical overview ### User-driven error handling During the frontloading stage, the `WP_Stream_Importer` exposes all the frontloaded entities to the API consumer. The consumer then creates a post of type `frontloading_placeholder` with an initial status `awaiting_download` for each asset, and updates it with progress information and status (success, failure, skipped) as the import progresses. The frontloading stage is not finished until all the frontloaded assets have been processed with a non-error outcome. There's a few ways to recover from errors: * Retry the download – `WP_Stream_Importer` now retries the failed assets URL (via `WP_Retry_Frontloading_Iterator`) before moving on to entities provided by the usual entity source such as a WXR file. * Changing the downloaded URL – done by the user on the wp-admin page * Choosing to skip the download – done by the user on the wp-admin page Sometimes we don't want to require user interactions, e.g. when running the `importWxr` Blueprint step. In those scenarios, we could choose a default error outcome, e.g. "skip failed downloads". ### Auto-refreshing admin page Two `fetch()` requests running in an infinite loop are: * Updating the JavaScript interactivity store with the latest import state from the server * Running the next import step ### Other changes * Adds `php_userstreamop_read` to the Asyncify list – it crashed the importer in `@wp-playground/cli` running in bun. ## Follow-up work * Pretty UI transitions. Right now it's all sudden and jerky. We need progress bars, smooth animations, clear visual causality. * Prevent running the same import in two concurrent requests. This is a serial importer not designed for parallelization. * Run each import step in a transaction – either it all worked and we can commit the changes and an updated cursor, or it didn't work and we roll back the last step. Ideally we'll never see a scenario where an entity was processed, but a crash happened before storing the updated cursor and the next run reprocesses the same entity. ## Testing instructions * Go to the data liberation admin page * Upload a WXR export file * Confirm the import processes automatically and doesn't error out

…2058) ## Description Adds the Data Liberation WXR importer as an option in the `importWxr` step. The new importer is turned by including the `"importer": "data-liberation"` option: ```json { "steps": [ { "step": "importWxr", "file": { "resource": "url", "url": "https://raw.githubusercontent.com/wpaccessibility/a11y-theme-unit-test/master/a11y-theme-unit-test-data.xml" }, "importer": "data-liberation" } ] } ``` When the `importer` option is missing or set to "default," nothing changes in the behavior of the step and it continues using the https://github.com/humanmade/WordPress-Importer importer. The new importer: * Rewrites links in the imported content * Downloads assets through Playground's CORS proxy * Parallelizes the downloads * Communicates progress This PR is a part of #1894 ## Implementation details This `importWxr` step fetches and includes the `data-liberation-core.phar` file. The phar file is built with [Box](https://box-project.github.io/box/configuration/) and contains the importer library with its dependencies, which is a subset of the Data Liberation library, a subset of the Blueprints library, and a few vendor libraries. This, unfortunately, means that any changes in the PHP files require rebuilding the .phar file. Here's how you can do it: ```bash nx build:phar playground-data-liberation ``` You can also build the entire Data Liberation package as a WordPress plugin complete with a wp-admin page: ```bash nx build:plugin playground-data-liberation ``` Both commands will output the built files to `packages/playground/data-liberation/dist` The progress updates are a first-class feature of the new importer. The updated `importer` step receives them in real-time via a `post_message_to_js()` call running after every import step. Then, it passes them on to the progress bar UI. ### Other changes * **TLS traffic now goes through the CORS proxy.** Since the new importer uses `AsyncHTTP\Client` which deals with raw sockets, Playground's [TLS-based network bridge](#1926) runs the outbound traffic through a cors proxy. Technically, `TCPOverFetchWebsocket` gets the `corsProxy` URL passed to the `playground.boot()` call. * A few composer dependencies were forked, downgraded to PHP 7.2 using Rector, and bundled with this PR to keep the Data Liberation importer working. ## Remaining work - [x] PHP 7.2 compatibility. Done by forking and Rector-downgrading dependencies that were incompatible with PHP 7.2. - [x] Report the importer's progress on the overall Blueprint progress bar - [x] Enqueue the data liberation plugin files for downloading at the blueprint compilation stage - [x] Don't eagerly rewrite attachments URLs in `WP_Stream_Importer`. Exposing this information to the API consumer requires an explicit decision. Do we rewrite it? Or do we ignore it? - [x] Fix the TLS errors at the intersection of Playground network transport and the async HTTP client library - [x] Separate the markdown importer and its dependencies (md parser, frontmatter parser, Symfony libraries) from the core plugin - [x] Ship the importer and its tree-shaken deps (URL parser) as a minified zip/phar ## Follow-up work - [ ] Reconsider the `WP_Import_Session` API – do we need so many verbosely named methods? Can we achieve the same outcomes with fewer methods? - [ ] Investigate why there's a significant delay before media downloads start on PHP 7.2 – 7.4. It's likely a PHP.wasm issue. ## Testing instructions * Default importer – [Open this link](http://localhost:5400/website-server/#{%20%22plugins%22:%20[],%20%22steps%22:%20[%20{%20%22step%22:%20%22importWxr%22,%20%22file%22:%20{%20%22resource%22:%20%22url%22,%20%22url%22:%20%22https://raw.githubusercontent.com/wpaccessibility/a11y-theme-unit-test/master/a11y-theme-unit-test-data.xml%22%20}%20}%20],%20%22preferredVersions%22:%20{%20%22php%22:%20%228.3%22,%20%22wp%22:%20%226.7%22%20},%20%22features%22:%20{%20%22networking%22:%20true%20},%20%22login%22:%20true%20}) and confirm it does what the current `importWxr` step do, that is it stays at "Importing content" for a moment, fails to fetch media files (CORS issues in network tools), but inserts posts and pages. * Data Liberation – [Open this link](http://localhost:5400/website-server/#{%20%22plugins%22:%20[],%20%22steps%22:%20[%20{%20%22step%22:%20%22importWxr%22,%20%22importer%22:%20%22data-liberation%22,%20%22file%22:%20{%20%22resource%22:%20%22url%22,%20%22url%22:%20%22https://raw.githubusercontent.com/wpaccessibility/a11y-theme-unit-test/master/a11y-theme-unit-test-data.xml%22%20}%20}%20],%20%22preferredVersions%22:%20{%20%22php%22:%20%228.3%22,%20%22wp%22:%20%226.7%22%20},%20%22features%22:%20{%20%22networking%22:%20true%20},%20%22login%22:%20true%20}), confirm the import progress is visible and that the content and media indeed get imported: ![CleanShot 2024-12-08 at 14 54 49@2x](https://github.com/user-attachments/assets/a7da3244-a10f-43d2-8e94-43d305220a7e) ## Related issues * #1211 * #2012 * #1477 * #1250 * #1780

Adds a forked version of the markdown parsing libraries required by the upcoming Markdown importer. We need out own fork for PHP 7.2 compatibility. The downgrade process was performed semi-automatically via Rector. This PR adds the following libraries: * `league/commonmark` * `webuni/front-matter` There are no testing steps here. This PR only adds new code without modifying the existing one. A part of: * #2080 * #1894

Moves the Markdown importer to a `data-liberation-markdown` package so that it can be shipped as a separate `.phar` file and downloaded only when needed. ## Testing instructions This only moves code around. To test, confirm the CI PHP unit tests keep working. A part of: * #2080 * #1894

Builds data-liberation-markdown.phar.gz (200KB) to enable downloading the Markdown importer only when needed instead of on every page load. A part of: * #2080 * #1894 ## Testing instructions Run `nx build playground-data-liberation-markdown`, confirm it finished without errors. A smoke test of the built phar file is included in the build command.

Adds a basic WP_HTML_To_Blocks class that accepts HTML and outputs block markup. It only considers the markup and won't consider any visual changes introduced via CSS or JavaScript. A part of #1894 ## Example ```html $html = <<<HTML <meta name="post_title" content="My first post"> Hello world! HTML; $converter = new WP_HTML_To_Blocks( $html ); $converter->convert(); var_dump( $converter->get_all_metadata() ); /* * array( 'post_title' => array( 'My first post' ) ) */ var_dump( $converter->get_block_markup() ); /* *  * Hello world! *  */ ``` ## Testing instructions This PR mostly adds new code. Just confirm the unit tests pass in CI.

Builds data-liberation-markdown.phar.gz (200KB) to enable downloading the Markdown importer only when needed instead of on every page load. A part of: * #2080 * #1894 ## Testing instructions Run `nx build playground-data-liberation-markdown`, confirm it finished without errors. A smoke test of the built phar file is included in the build command.

@brandonpayton

Adds a basic `WP_HTML_To_Blocks` class that accepts HTML and outputs block markup. It's a very basic converter. It only considers the markup and won't consider any visual changes introduced via CSS or JavaScript. Only a few core blocks are supported in this initial PR. The API can easily support more HTML elements and blocks. To preserve visual fidelity between the original HTML page and the produced block markup, we'll need an annotated HTML input produced by the [Try WordPress](https://github.com/WordPress/try-wordpress/) browser extension. It would contain each element's colors, sizes, etc. We cannot possibly get all from just analyzing the HTML on the server without building a full-blown, browser-like HTML renderer in PHP, and I know I'm not building one. A part of #1894 ## Example ```php $html = <<<HTML <meta name="post_title" content="My first post"> Hello world! HTML; $converter = new WP_HTML_To_Blocks( $html ); $converter->convert(); var_dump( $converter->get_all_metadata() ); /* * array( 'post_title' => array( 'My first post' ) ) */ var_dump( $converter->get_block_markup() ); /* *  * Hello world! *  */ ``` ## Caveats I had to patch WP_HTML_Processor to stop baling out on `<meta>` tags referencing the document charset. Ideally we'd patch WordPress core to stop baling out when the charset is UTF-8. ## Testing instructions This PR mostly adds new code. Just confirm the unit tests pass in CI. cc @brandonpayton @zaerl @sirreal @dmsnell @ellatrix

adamziel added [Aspect] Data Liberation [Type] Project [Type] Tracking Tactical breakdown of efforts across the codebase and/or tied to Overview issues. labels Oct 14, 2024

github-project-automation bot added this to Playground Board Oct 14, 2024

github-project-automation bot moved this to Inbox in Playground Board Oct 14, 2024

adamziel mentioned this issue Oct 14, 2024

[Data Liberation] wp_rewrite_urls() #1893

Merged

8 tasks

bgrgicak moved this from Inbox to In progress in Playground Board Oct 15, 2024

adamziel moved this from In progress to Project: In Progress in Playground Board Oct 16, 2024

adamziel added this to the Data Liberation: URL Rewriting milestone Oct 25, 2024

adamziel mentioned this issue Oct 28, 2024

[Data Liberation] Add XML API, Stream API, WXR URL Rewriter API #1952

Merged

adamziel mentioned this issue Oct 29, 2024

Adam's list of Data Liberation wishes and ideas #1957

Open

adamziel mentioned this issue Oct 30, 2024

[Data Liberation] Add blueprints-library as a submodule #1967

Merged

adamziel mentioned this issue Oct 31, 2024

[Data Liberation] Merge both XML processors into a single WP_XML_Processor #1960

Merged

adamziel linked a pull request Oct 31, 2024 that will close this issue

[Data Liberation] Merge both XML processors into a single WP_XML_Processor #1960

Merged

adamziel closed this as completed in #1960 Oct 31, 2024

github-project-automation bot moved this from Project: In Progress to Done in Playground Board Oct 31, 2024

brandonpayton reopened this Nov 1, 2024

github-project-automation bot moved this from Done to Inbox in Playground Board Nov 1, 2024

adamziel mentioned this issue Nov 2, 2024

[Data Liberation] WP_WXR_Reader #1972

Merged

bgrgicak mentioned this issue Nov 7, 2024

Explore interplay between Blueprints and Assembler WordPress/blueprints-library#117

Open

This was referenced Nov 18, 2024

[Data Liberation] Re-entrant gzip decoder #2002

Open

[Data Liberation] Re-entrant WP_Stream_Importer #2004

Merged

adamziel mentioned this issue Nov 28, 2024

[Data Liberation] WP_Stream_Importer: User-driven incremental import #2013

Merged

adamziel mentioned this issue Dec 1, 2024

[Data Liberation] "Fetch from a different URL" button for failed media downloads, Interactivity API support #2040

Merged

adamziel changed the title ~~[Data Liberation] Tracking issue~~ [Data Liberation] Next Gen Importers Tracking issue Dec 4, 2024

adamziel changed the title ~~[Data Liberation] Next Gen Importers Tracking issue~~ [Data Liberation] Next Gen Importers Tracking Issue Dec 4, 2024

adamziel changed the title ~~[Data Liberation] Next Gen Importers Tracking Issue~~ [Data Liberation] Tracking Issue: Next Gen Importers Dec 4, 2024

adamziel changed the title ~~[Data Liberation] Tracking Issue: Next Gen Importers~~ Tracking Issue: Next Gen Importers for Data Liberation Dec 4, 2024

adamziel removed this from Playground Board Dec 4, 2024

github-project-automation bot added this to Playground Board Dec 4, 2024

github-project-automation bot moved this to Inbox in Playground Board Dec 4, 2024

adamziel changed the title ~~Tracking Issue: Next Gen Importers for Data Liberation~~ Tracking Issue: Next-gen PHP Importers for Data Liberation Dec 4, 2024

This was referenced Dec 5, 2024

Kickoff Data Liberation: Let's Build WordPress-first Data Migration Tools #1888

Merged

[Blueprints] Support Data Liberation importer in the importWxr step #2058

Merged

Tracking Issue: WordPress to WordPress migrations WordPress/data-liberation#73

Closed

adamziel mentioned this issue Dec 17, 2024

[Data Liberation] Add Markdown parsing libraries #2092

Merged

adamziel mentioned this issue Dec 17, 2024

[Data Liberation] Move Markdown importer to a separate package #2093

Merged

adamziel mentioned this issue Dec 17, 2024

[Data Liberation] Build markdown importer as phar #2094

Merged

This was referenced Dec 17, 2024

[Data Liberation] Add HTML to Blocks converter #2095

Merged

[Data Liberation] Add EPub to Blocks converter #2097

Open

Move wp_kses_uri_attributes filter to import start/end #2047

Open

adamziel mentioned this issue Dec 19, 2024

StreamChain: An API for streams-processing data (e.g. HTTP → ZIP → XML → HTML) adamziel/wxr-normalize#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue: Next-gen PHP Importers for Data Liberation #1894

Tracking Issue: Next-gen PHP Importers for Data Liberation #1894

adamziel commented Oct 14, 2024 •

edited

Loading

brandonpayton commented Nov 1, 2024

adamziel commented Nov 2, 2024 •

edited

Loading

Tracking Issue: Next-gen PHP Importers for Data Liberation #1894

Tracking Issue: Next-gen PHP Importers for Data Liberation #1894

Comments

adamziel commented Oct 14, 2024 • edited Loading

Next Gen importers

Tracking – ongoing Issues and PRs

Parsing

Exporting

Importing

Data formats

Reliability

UI

Other

Related resources

Next phases: Future Data Liberation roadmap

brandonpayton commented Nov 1, 2024

adamziel commented Nov 2, 2024 • edited Loading

adamziel commented Oct 14, 2024 •

edited

Loading

adamziel commented Nov 2, 2024 •

edited

Loading