Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Liberation] Topological sorter, entities remapping and add missing imports #2030

Draft
wants to merge 61 commits into
base: trunk
Choose a base branch
from

Conversation

zaerl
Copy link
Collaborator

@zaerl zaerl commented Nov 26, 2024

You can find a discussion here with a more detailed explanation: #2090

Motivation for the change, related issues

Topological Sorting of WXR entities before starting the import to ensure parent posts are imported before child posts. Processing WXR in a topological order may require an index with all offsets and lengths of items in the WXR.

Implementation details

Entities preloading/loading/mapping

The topological sort happens during a STAGE_TOPOLOGICAL_SORT phase runt before the STAGE_FRONTLOAD_ASSETS.

This PR removes the WP_Entity_Importer::mapping and WP_Entity_Importer::exists arrays and all memory preload. It is slow and memory-consuming, with thousands of entries and not support sessions. It adds a new table with import IDs and mapped IDs. During the first phase, it is prefilled. During the WP_Entity_Importer import it maps the imported IDs by using the wxr_importer_* filters and actions.

New WP-CLI script

This PR also introduces the new CLI script and moves the logger there. moved to #2104

New unit tests

  1. Added all the 130 WordPress core unit tests. Updated the WXRs to latest version.
  2. Added a PlaygroundTestCase base class that cleanup the WordPress database after a test, as _delete_all_data does
    Moved to [Data Liberation] Add support for terms meta and new unit tests #2105

Missing imports and hierarchies

Added support for:

  1. Posts hierarchy
  2. Term hierarchy
  3. Term meta
  4. Comment meta

New PHPUnit filter

This PR adds a PHPUNIT_FILTER constant to packages/playground/data-liberation/tests/import/blueprint-import.json. If the value is not falsy it will be passed to PHPUnit when calling npx nx run playground-data-liberation:test:wp-phpunit. So, for example, you can set "PHPUNIT_FILTER": "WPRewriteUrlsTests". It will be the same as running phpunit --filter WPRewriteUrlsTests.
Moved to #2105

Testing Instructions (or ideally a Blueprint)

Unit tests

npx nx run playground-data-liberation:test:wp-phpunit

# Or only tests that do not need a WordPress environment
cd packages/playground/data-liberation
./vendor/bin/phpunit

Many of the tests can be run only in a real WordPress environment, that's why wp-phpunit exists.

Data test

  1. Spin Playground
  2. Import whatever .xml you find in packages/playground/data-liberation/tests/wxr/*
  3. In the above discussion, you can use a plugin to create an XML with thousands of entries if you want.

About to add and pass all the tests found here https://github.com/WordPress/wordpress-importer/tree/master/phpunit/tests.

New script

wp data-liberation import test.xml

or:

cd packages/playground/data-liberation/bin/import
bash import-wxr.sh a-folder-with-xmls

It accepts a file, a URL, or a folder. Run wp help data-liberation import to see all options. moved to #2104

@zaerl zaerl force-pushed the add/topological-sort branch from c51d9c4 to 7778714 Compare November 26, 2024 13:52
@zaerl zaerl self-assigned this Nov 26, 2024
@zaerl zaerl force-pushed the add/topological-sort branch 2 times, most recently from 85c850b to 78f1cb3 Compare November 29, 2024 11:19
@zaerl zaerl force-pushed the add/topological-sort branch from 78f1cb3 to 5af5722 Compare November 29, 2024 13:02
* Quicksort performs badly on already sorted arrays, O(n^2) is the worst case.
* Let's consider using a different sorting algorithm.
*/
uksort( $elements, $sort_callback );
Copy link
Collaborator

@adamziel adamziel Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires fitting all $elements into memory. This may be fine for v1, since every $element is relatively small, but we'll hit the limits of this approach sooner than later. Is it possible to perform topological sorting with at a reasonable speed without holding everything in memory? If not, how much RAM would this need to process one of these huge VIP 1TB exports?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can save the data in a custom DB table (Jetpack have a custom wp_jetpack_sync_queue table) instead of a simple array in memory. I didn't disturb the database in this first phase, but we can. Reproducing this in a table and using the custom sort here is straightforward. At the end of collecting the byte offsets, it's a matter of ALTER TABLE wxr_sorting ORDER BY custom_sorting_like_the_one_above, and we are done, even with our new reentrancy model.

When it finds a row in this table, the streamer can jump to the "correct" position (the one it should load before the current one) and proceed to the next one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely! If it's straightforward, let's include it in v1 – it seems like we can ship something that's on the right track with relatively little effort. It would also enforce shaping the API to stream the sorted list instead of loading it into memory, which would have implications for the reentrancy cursor. It may seem excessive for v1, but thinking about it more it seems like something that could have a ripple through the entire system if we don't account for streaming early on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may seem excessive for v1, but thinking about it more it seems like something that could have a ripple through the entire system if we don't account for streaming early on.

Agree. It's reasonable, let's proceed this way. 👍

@@ -93,7 +93,7 @@ public function __construct( $options = array() ) {
$this->mapping['term_id'] = array();
$this->requires_remapping = $empty_types;
$this->exists = $empty_types;
$this->logger = new Logger();
$this->logger = isset( $options['logger'] ) ? $options['logger'] : new WP_Logger();
Copy link
Collaborator

@adamziel adamziel Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure we need a logger beyond _doing_it_wrong(). All the progress information are now exposed as numbers and the API consumer may implements its own logging using any technique.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, thanks.

Comment on lines 230 to 236
if ( true === $this->topological_sort_next_entity( $count ) ) {
return true;
}

// We indexed all the entities. Now sort them topologically.
$this->topological_sorter->sort_topologically();
$this->topological_sorter = null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reads confusing to me:

  1. Sort every entity we encounter and short circuit while we have entities
  2. Once we run out of entities to sort, sort them

What would be another way to call these operations?

/**
* The base name of the table.
*/
const TABLE_NAME = 'data_liberation_index';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called index, but it's not used in the index_next_entities() stage, only in the topological sort stage. How about data_liberation_topological_index or something to that effect?

* Run in the 'plugins_loaded' action.
*/
public static function load() {
if ( self::DB_VERSION !== (int) get_site_option( self::OPTION_NAME ) ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the distinction between load() and activate() – could we either have one or document them a bit more to explain the usage?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The activation/deactivation is when a user clicks "activate" and "deactivate" link. The load action is usually used if you need to change schema:

The best practices are here:

  1. Installation, activation steps: https://developer.wordpress.org/plugins/plugin-basics/activation-deactivation-hooks/
  2. Custom tables: https://learn.wordpress.org/lesson/custom-database-tables/
  3. Schema change: https://developer.wordpress.org/plugins/creating-tables-with-plugins/

Names can be misleading. Let me add more docs.

$table_name = self::get_table_name();

// Create the table if it doesn't exist.
// @TODO: remove this custom SQLite declaration after first phase of unit tests is done.
Copy link
Collaborator

@adamziel adamziel Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the CREATE TABLE statement doesn't work with the legacy SQLite integration we're using now? cc @JanJakes this could be useful for testing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamziel @zaerl The MySQL query actually seems to work fine with the current translator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JanJakes it is, thanks! Removed the special SQL.

/**
* Run by register_deactivation_hook.
*/
public static function deactivate() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any advantage to running this when the plugin is uninstalled and not just deactivated?

Copy link
Collaborator Author

@zaerl zaerl Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read this here https://learn.wordpress.org/lesson/custom-database-tables/:

If the users of your plugin do not need the data in this table if they deactivate the plugin, you could trigger this on the plugin deactivation hook.

register_deactivation_hook( __FILE__, 'wp_learn_delete_table' );
However, if the data in that table is important, and your users might want to keep it, even if the plugin is deactivated, you could delete the table using one of the two uninstall methods available to plugins.

Our data is not "important" in the strict sense of the term. It can be recreated again; it's just bulky. So, removing it when the user deactivates the plugin is okay. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍

* @param int $byte_offset The byte offset of the category.
* @param array $data The category data.
*/
public function map_category( $byte_offset, $data ) {
Copy link
Collaborator

@adamziel adamziel Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map makes me think of map/reduce and a mapper callback. How about insert_category or add_category or index_category? Ditto for the other map_ method

'element_id' => (string) $data['term_id'],
'parent_id' => $category_parent,
'byte_offset' => $byte_offset,
// Items with a parent has at least a sort order of 2.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's document this more. What's the logic behind the sort_order of 1, 2 etc? Should it be more for deeply nested categories or not?

*
* @var int
*/
protected $orphan_post_counter = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to restore this (or any other) value after stopping and resuming the import?

*
* @return int|bool The byte offset of the category, or false if the category is not found.
*/
public function get_category_byte_offset( $session_id, $slug ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many methods can we make private? Every public method makes a public API and gets covered by the WordPress indefinite BC policy upon merge to WordPress core.


return $wpdb->get_var(
$wpdb->prepare(
'SELECT byte_offset FROM %i WHERE element_id = %s AND element_type = %d AND session_id = %d LIMIT 1',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to fetch these in chunks of 100 or 1000 to avoid hitting the database every time? Also, noting the docstring says "and remove it from the list" while the method doesn't perform the removal.


// MySQL version - update sort_order using a subquery
return $wpdb->query(
$wpdb->prepare(
Copy link
Collaborator

@adamziel adamziel Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How well does this perform on large datasets? Running a single, table-wide UPDATE that would potentially affect a few hundred million rows is making me anxious, although that's just a gut feeling.

@zaerl zaerl force-pushed the add/topological-sort branch from 41a2007 to 69606a4 Compare December 18, 2024 10:20
@zaerl
Copy link
Collaborator Author

zaerl commented Dec 18, 2024

Fixing post-rebase failing unit tests.

@zaerl
Copy link
Collaborator Author

zaerl commented Dec 19, 2024

@adamziel I cherry-picked those parts not strictly related to the topo-sort in these two PRs:

  1. [Data Liberation] New "wp data-liberation" WP-CLI script #2104
  2. [Data Liberation] Add support for terms meta and new unit tests #2105

I would like to see some fixes and missing parts merged now that our discussion is going on. 👍

@@ -108,6 +113,9 @@ public function __construct( $options = array() ) {
'default_author' => null,
)
);

WP_Topological_Sorter::activate();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implicitly runs a bunch of database operations in the object constructor. If we create 10 objects, it will run 10 times. If it fails, we don't have a good way of reporting that. What do you think about running it in a plugin-related hook instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has already run for the first time during the activations. I can remove it from here.

if ( array_key_exists( 'session_id', $options ) ) {
$this->set_session( $options['session_id'] );
} else {
$active_session = WP_Import_Session::get_active();
Copy link
Collaborator

@adamziel adamziel Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we're not using an WP_Import_Session object for this import run? Note nothing in WP_Entity_Importer depends on direct access to WP_Import_Session.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@@ -108,6 +113,9 @@ public function __construct( $options = array() ) {
'default_author' => null,
)
);

WP_Topological_Sorter::activate();
$this->topological_sorter = new WP_Topological_Sorter( $this->options );
Copy link
Collaborator

@adamziel adamziel Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this object only exist for the duration of the stages where it's used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used even later during the import stage.

@@ -257,6 +267,7 @@ public function import_user( $data ) {
* @param array $userdata Raw data imported for the user.
*/
do_action( 'wxr_importer_processed_user', $user_id, $userdata );
// $this->topological_sorter->map_entity( 'user', $userdata, $user_id );
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be uncommented? Or is it a placeholder for later? Then let's document inline how will we know to uncomment it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is commented out because I'm not sure what we do with wp_insert_user. If we will reassign, rewrite or possibly remap the user of the wxr with the preexisting one, like in other importers.

@@ -267,13 +278,13 @@ public function import_term( $data ) {
* @param array $meta Meta data.
*/
$data = apply_filters( 'wxr_importer_pre_process_term', $data );
$data = $this->topological_sorter->get_mapped_entity( 'term', $data );
Copy link
Collaborator

@adamziel adamziel Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this line without any additional context, and I'm confused about what's happening here – we use a sorter, but instead of sorting, we map. Let's come up with a set of terms that meaningfully reflect that entire idea space. Ideally, this line would communicate something to a new person without requiring additional work, such as diving into the source code of that method.

Copy link
Collaborator

@adamziel adamziel Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could we only source specific values from the sorter instead of blanket-replacing the entire $data array? This leads to a section in the code below where we just accept the parent value from $data instead of explicitly using the mapped parent value. It's an easy way for unprocessed user input to slip through if we miss even a single check in a single code path. We can reduce the likelihood of running into that entire class of errors by just saying "sorry, $data['parent'] won't come through, we use $mapped_parent instead".

if ( isset( $this->mapping['term'][ $parent_id ] ) ) {
// Map the parent term, or mark it as one we need to fix
if ( $parent ) {
// TODO: add parent mapping and remapping
Copy link
Collaborator

@adamziel adamziel Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO still relevant? Or can this entire section be cleaned up and those commented code parts removed? Topological sorting should remove the need for remapping, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old comment can be cleaned up, yes.

@@ -314,9 +327,30 @@ public function import_term( $data ) {

// Wipe the parent for now
$data['parent'] = 0;
}*/
$parent_term = term_exists( $parent, $data['taxonomy'] );
Copy link
Collaborator

@adamziel adamziel Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would running a SELECT before every insert be a show-stopper for importing, say, 100M records to an already large database? Or would that be manageable? cc @smithjw1 for VIP thoughts

$data['parent'] = $parent_term['term_id'];
} else {
// It can happens that the parent term is not imported yet in manually created WXR files.
$parent_term = wp_insert_term( $parent, $data['taxonomy'] );
Copy link
Collaborator

@adamziel adamziel Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the parent's parent is also missing? In other words, imagine we have a hierarchy of 10 nested parents and we're to process a term at the very bottom of that hierarchy?

Here's a visual representation:

Root Parent
    │
    ├── Parent Level 1
    │       │
    │       ├── Parent Level 2
    │       │       │
    │       │       ├── Parent Level 3
    │       │       │       │
    │       │       │       ├── Parent Level 4
    │       │       │       │       │
    │       │       │       │       ├── Parent Level 5
    │       │       │       │       │       │
    │       │       │       │       │       ├── Parent Level 6
    │       │       │       │       │       │       │
    │       │       │       │       │       │       ├── Parent Level 7
    │       │       │       │       │       │       │       │
    │       │       │       │       │       │       │       ├── Parent Level 8
    │       │       │       │       │       │       │       │       │
    │       │       │       │       │       │       │       │       ├── Parent Level 9
    │       │       │       │       │       │       │       │       │       │
    │       │       │       │       │       │       │       │       │       └── Our term

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why would a parent be missing if we've already sorted everything topologically?

Comment on lines +193 to +201
/**
* Map an entity to the index. If $id is provided, it will be used to map the entity.
*
* @param string $entity_type The type of the entity.
* @param array $data The data to map.
* @param int|null $id The ID of the entity.
* @param int|null $additional_id The additional ID of the entity.
*/
public function map_entity( $entity_type, $data, $id = null, $additional_id = null ) {
Copy link
Collaborator

@adamziel adamziel Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method name and docstring don't explain much about this method. Let's include a thorough description of what this method does. It might take quite a few paragraphs of text and even be longer than the method itself – it's fine, and it's well worth it in the long run.

The role model for documenting mission-critical and non-trivial code is the WP_HTML_Tag_Processor class. All the mental models are explained in depth and with examples. It discusses paths not taken, failure modes, and design decisions.

Here's just a few questions that came up for me as I read this: What's "the index"? Where does it live? What is its layout? What is it used for? How do we retrieve from it? Why an $id is needed? What is the relevance of $additional_id? What if one of them is missing? Can this method fail, and how do we know?

@adamziel
Copy link
Collaborator

@zaerl can we move all the WP_Entity_Importer changes to another PR and focus this one strictly on the topological sort? I'm having some trouble reviewing it as it touches so many concepts. Let's assume we're importing into a blank site for the sake of topo sorting. We don't need to solve ID remapping to reorder the entity stream.


// Map the parent ID if the entity has one.
switch ( $entity_type ) {
// @TODO: missing comment parent ID.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would it take to support comment parent ID?

break;
case 'post':
if ( 'post' === $data['post_type'] || 'page' === $data['post_type'] ) {
if ( array_key_exists( 'post_parent', $data ) && '0' !== $data['post_parent'] ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're dealing with raw user input here so we may also need to check if it's an int etc

}

// The entity has been imported, so we can use the ID.
if ( $id ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for calling a topological sorter that is meant to sort the entity stream before we start importing, after an entity is imported?

Copy link
Collaborator

@adamziel adamziel Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it doubling as an ID mapper? While reusing the database makes sense, I'm not sure if topological sorter class is the best place for the ID mapping logic. It might take us some discussion to find the right place. Let's remove anything related to mapping IDs for now and focus just on pre-import sorting.

@adamziel
Copy link
Collaborator

adamziel commented Dec 19, 2024

I've finished the review and I'm very confused. It seems like it's an ID mapper that doesn't actually sort the entity stream topologically. Am I missing something?

@zaerl
Copy link
Collaborator Author

zaerl commented Dec 19, 2024

@zaerl can we move all the WP_Entity_Importer changes to another PR and focus this one strictly on the topological sort?

Already done:

  1. [Data Liberation] Add support for terms meta and new unit tests #2105
  2. [Data Liberation] New "wp data-liberation" WP-CLI script #2104

@zaerl
Copy link
Collaborator Author

zaerl commented Dec 19, 2024

I've finished the review and I'm very confused. It seems like it's an ID mapper that doesn't actually sort the entity stream topologically. Am I missing something?

The initial memory-array sorting and CTE were shelved because they had problems with massive data sets. I've mentioned that in the discussion of design choice 8. The sort order is not used, only the remapping of the IDs and the other changes that have been moved to the other two PRs.

I will try something else now that I've read your ideas on the discussion and tell you what I will do before implementing the new code. We can take this on draft. Thanks.

@zaerl zaerl marked this pull request as draft December 19, 2024 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs review
Development

Successfully merging this pull request may close these issues.

3 participants