[Data Liberation] IDs mapping, remapping, hierarchy #2090

zaerl · 2024-12-17T08:30:44Z

zaerl
Dec 17, 2024
Maintainer

In a well-formed and not manually crafted WXR, this is the structure we expect to read:

metadata*
<wp:author></wp:author>* ¹
<wp:category>*
<wp:tag>*
<item>*

Root-level entities are not guaranteed to be in this strict order, and categories or terms are hierarchically in order with each other. ²

The autoincrement ID generation on the source site (the site that exported data) is also not guaranteed to generate the same IDs as the target site. There are efforts in the WordPress core importer to maintain the same structure using the import_id field that suggests wp_insert_post to prefer that ID. ³

About data integrity, similarity and deduplication

In my PR #2030, I am investigating how to keep track of creating new entities in the target system (the site that imports data) to have a 1:1 structure between the two.

In the core importer, these are the phases:

The WXR is parsed, and all the content is saved in authors, posts, ... tags associative arrays, and processed_* arrays are created as well
Categories are looped
1. Preexisting categories are skipped (term_exists())
2. Categories parent is set if exists (category_exists())
3. All the term meta are added
Tags are looped. Same as categories, without the parent check
Terms are looped. Same as categories
Posts are looped
1. If a post with the same title, date, and type is found, the post is skipped (post_exists()). Unless the wp_import_existing_post filter returns zero
2. If the post's parent has not been processed, the post is moved to a post_orphans array
3. Post terms are imported like steps 2, 3, 4. If found
4. Comments, comment meta, and post meta are imported in the same way
A backfill of parents (the one missing at 5.2) is performed as well

This means the system will have all the entities of the WXR in memory for a brief moment. A +1M posts WXR with all the content will stress the RAM. After all the phases, the memory is freed from imported entities.

Note

Design choice 1: do not use in-memory associative arrays

Running a WXR import multiple times will yield the same results. Once created, the entities are not modified. Both WP_Entity_Importer and the core importer will skip existing entities.

Ideally, the importer should be able to avoid deduplicating data. Categories, tags, and terms are not deduplicated. In Data Liberation trunk categories' hierarchy is not imported; in my new PR yes:

Note

Design choice 2: automatically create categories parent if a category with that slug as a slug and as the name does not exist once a category with that parent is about to be created. It does not happen with core-exported XMLs, but it can

Note

Design choice 3: Category import does update existing categories. This retrofill parent created on the fly and updates categories already existing, such as uncategorized (Uncategorized), the default category that often is translated into sites with a language different from English.

About how to keep track of the entities that have been created and potential remapping

The current implementation keeps track of the entities that have been created and potential remapping using in-memory arrays as WP_Entity_Importer::mapping, similar to the core importer WP_Import::posts arrays. This is a perfect file for small imports. Arrays of integers do not take much space in Zend.

Le'ts see a raw test to see how much memory an array of 1M integers takes in an M3 MAX PHP 8.2.25 Zend v4.2.25:

<?php

// Measure memory before
$memory_before = memory_get_usage();

// Create an array of 1 million integers
$arr = array();
for ( $i = 0; $i < 1000000; ++$i ) {
    $arr[] = $i;
}

// Measure memory after
$memory_after = memory_get_usage();

// Calculate and display memory usage
$memory_used = $memory_after - $memory_before;
echo 'Memory before: ' . number_format( $memory_before) . " bytes\n";
echo 'Memory used: ' . number_format( $memory_used / 1024 / 1024, 2 ) . " MB\n";
echo 'Memory after: ' . number_format($memory_after) . " bytes\n";

// Wait forever
$line = fgets( fopen( 'php://stdin', 'r' ) );
fclose( $handle );

Results:

Memory before: 398,304 bytes
Memory used: 16.02 MB
Memory after: 17,191,984 bytes

These are raw numbers, but they give a good idea of the memory usage. The standard memory_limit is 128MB, so it is easy to chew enough RAM if you start saving in that array a number and all the contents of a post that can be of an arbitrary length.

The keep track of the entities is made for two reasons:

To avoid creating the same entity twice, if a manually crafted WXR has duplicate entities
To avoid remapping the same entity twice, if a manually crafted WXR has duplicate entities

Note

Design choice 4: In my PR, I removed the in-memory arrays and replaced them with a database table. This is a more robust solution, and can be linked to the session ID of the import. Each row maps the minimum information.

The table has these columns:

id
session_id (see the sessions we use for saving preprocessed imports)
entity_type (comment, comment_meta, post, post_meta, term, term_meta)
entity_id (the ID of the entity in the source site)
mapped_id (the ID of the entity in the target site)
parent_id (the ID of the parent entity)
additional_id (the ID of the additional entity, if needed)
byte_offset (the byte offset of the entity in the WXR file)
sort_order (the sort order of the entity in the WXR file)

During the XML parsing, the entities are inserted in the table with the original ID and the byte offset inside the WXR file. Once the entities are imported, the mapped_id column is updated with the new ID of the entity in the target site.

Note

Design choice 5: have a pre-import step to fill the database with IDs. From my test this add ~20% of computing time to all the phases that do not write in the database or download files (frontload_assets and import_entities). The difference can be noticed only with millions of rows; otherwise, it is a matter of seconds; a running time is usually a couple of orders of magnitude below the download, wp_insert_* steps.

Parsing, but not importing, a file with a million entities is a quick operation. Use this plugin I've created to generate an XML at 10k at a time if you are curious. https://gist.github.com/zaerl/44dad0cd465751702d03eb58f01386e7

So, at the start of the import phase, all the original IDs are already saved in the database. When the importer is about to import an entity, know if the parent, or whatever other entity, has already been imported or if it just exists in the XML somewhere.

Also this will add support for resuming the process. All rows have the session ID attached and can be read/modified and deleted once done. In-memory arrays are lost when the process restarts and must be filled again.

About the remapping

Important rule: remapping IDs should never happen. But it can happen if the sites are entirely different, the target is not a brand new one, one of the two sites deleted posts etc. What is a remapping? The target site has an auto-increment ID generation that does not match the source site. So, an entity with parent X in the source site will have a different ID in the target site. You should replace the parent ID with the new one; otherwise, it will use a different parent.

Where are the IDs saved? That is the problem. IDs in WordPress are saved in well-known places in the database. But they can be in:

Custom options with serialize()d data
Custom fields
Post meta
Term meta
Comment meta
Etc.

How do I find the IDs? That is the problem. The importer needs to find out where the IDs are saved. We know where the standard one is, but we can only guess for the others.

Example: the foo plugin saves an option with this content: array( 'post_id' => 10 ) (a:1:{s:7:"post_id";i:10;). If the ID 10 is different and need to be remapped this data will become obsolete and the plugin broken.

Note

Design choice 6: ignore such structures. Well-written plugins should never use direct reference by ID, but always by slug to prevent this. We can fix well-known plugins, but not worth the effort.

About not remapping

Not remapping is a good thing. It means the importer does not need to guess where the IDs are saved. It can directly use the IDs. This is doable with sites that are made to be imported, such as two sites that do not perform deletion, reset the autoincrement IDs, and where the two sets are overlapping or disjointed. In ( $A \cap B$ ), our post_exists() will prevent multiple imports from continue adding data; just skip the overlapping items and add the ones of the source data.

Note

Design choice 7: do not offer the possibility of not remapping now. But do make it the only way of importing in the near future. A-là git
Git does ask you what you want to do with the files changed in the upstream branch that are changed in your local one, and we must do this as well. Imagine a brand-new site. If you don't do anything with that site you will have an hello-world post. That post will likely be modified in the source site, adding a new title, new slug etc. We should ask the user: what you want to do with this (post with the same ID)? Is it ok to rewrite it?

Hierarchy of entities

A category can be created before its parent, as well as a post. In the worst cases, all posts with ID x can refer to a post with ID N - x, where N is the number of posts in the source XML. During the investigation, I tried various things. I needed to change strategies once I approached the numbers that are not "supported" by preexisting importers:

A recursive sorter for in-memory tree arrays. But not using memory structures now
uksort with in-memory arrays. But not using memory structures now. Also with +1M potentially sorted arrays it is very slow. That's the way Quick Sort works
A CTE: perfectly fine with relatively small tables. It needs to be faster with millions or rows and not working very well with PHP outdated SQLite.

Note

Design choice 8: do not perform a DB-level sort. The sort_order field will be kept there but not used. We will import all the entities and add the parent_id only if already mapped to avoid adding more complexity to the PR.

Summary:

On-the-fly categories, meta, tags
Categories and tags modification
Remove all in-memory arrays
New table for keeping track of mapped IDs
Keep the user in-memory array untouched but change it later when we will show the user a source site user -> target site user UI
New sort phase
Do not try not to see if there is a way to skip remapping with the current set of data or inform the user

In an ideal world, an import should add new stuff and update a pre-existing entity by asking what to do if it changes both sides with the possibility of saving the target version (as in git stash). And clean up all the categories/tags/meta/etc. if the target site has them but not the source one, it is the user's will.

Addendum: what Unison does

Unison is pretty smart and has two rules:

Handle file changes on both sides of replication
Conflicts (the same file changed on both sides) are displayed and can be resolved manually, optionally creating backups of changed files.

adamziel · 2024-12-18T15:04:10Z

adamziel
Dec 18, 2024
Maintainer

Thank you so much for starting this discussion @zaerl! I'll comment with specific thoughts as I read through it

1 reply

brandonpayton Dec 19, 2024
Maintainer

Thank you, @zaerl!

adamziel · 2024-12-18T15:06:41Z

adamziel
Dec 18, 2024
Maintainer

In the core importer, these are the phases:

I'd say we can freely diverge from any choices the core importer makes. Let's aim for sensible outcomes, even if they differ from the original behavior. I'm worried that trying to satisfy all the constraints imposed by the legacy system would complicate the design.

1 reply

zaerl Dec 18, 2024
Maintainer Author

Yes, I just mentioned the core importer because it was modified over the years by tweaking it in the field with multiple strange inputs.

adamziel · 2024-12-18T15:06:44Z

adamziel
Dec 18, 2024
Maintainer

There are efforts in the WordPress core importer to maintain the same structure using the import_id field that suggests wp_insert_post to prefer that ID. 3

Whenever the imported post has the same ID as an existing post, we can either decide:

It's a new post. Then we'd need to give it a different ID. This is useful when importing new content into an existing site.
It's a newer version of an existing post. Then we'd need to overwrite the existing post. This is useful for synchronizing updates from one site to another as some big publishers do.

The new importing system needs to support both use-cases.

It would be easy to get stuck in the weeds here so I propose starting with a simple definition of "overwriting", e.g. upserting all the post meta defined in WXR. We could deal with complex deltas later on once as we build the site transfer protocol.

5 replies

zaerl Dec 18, 2024
Maintainer Author

e.g. upserting all the post meta defined in WXR

We skip the insert/update if post_exists now: https://github.com/WordPress/wordpress-playground/blob/trunk/packages/playground/data-liberation/src/import/WP_Entity_Importer.php#L499. We should update it when it exists? The WXR version will always be the final one, which is what should happen. The core importer does not update the post (like data liberation does), and there have always been head scratches.

brandonpayton Dec 19, 2024
Maintainer

There are efforts in the WordPress core importer to maintain the same structure using the import_id field that suggests wp_insert_post to prefer that ID.

Whenever the imported post has the same ID as an existing post, we can either decide:

It's a new post. ...

It's a newer version of an existing post. ...

The new importing system needs to support both use-cases.

When I read this, my gut reaction was that we don't want some middle ground where IDs may or may not be preserved. We need a more certain way of matching entities than auto-incrementing integers which will always tend to conflict when there are two versions of a site that have diverged. Auto-incrementing integers will always haunt us in conflict resolution.

Would it be good to move toward a world where every entity has a real UUID that could be used for matching?

adamziel Dec 19, 2024
Maintainer

The WXR version will always be the final one, which is what should happen.

Should is an opinion. I want opinions to be inputs, not hardcoded properties. Different users will have different shoulds. Here's an example:

This content export has a post with ID=2:

	<item>
		<title>About The Tests</title>
		<wp:post_id>2</wp:post_id>
	</item>

I may want to import it:

On a site where I imported that file already
On a blank site
On a site that already has a different post with ID=2. The latest post has ID=1922

In each scenario, I'd expect a different outcome:

Overwrite the old post
Create a new post with ID=2
Create a new post with ID=1923

But that's me. Another person might have different expectations:

Keep my existing post with ID=2 and only import the delta
Create a new post with ID=2
Overwrite my existing post with ID=2

The importer just doesn't know. These decisions need to come from the user in one way or another. We may also build opinionated tools on top of this importer that always overwrites the data. Either way, the low-level importing plumbing needs to be flexible and capable of supporting different outcomes.

zaerl Dec 19, 2024
Maintainer Author

It's ok to ask the user what to do.

<item>
  <title>About The Tests</title>
  <link>http://wpthemetestdata.wordpress.com/about/</link>
  <pubDate>Mon, 26 Jul 2010 02:40:01 +0000</pubDate>
  ...
</item>

"There is already a post called X (and another 100k); what do you want to do with this?"

Replace it
Keep it as it is
Unfortunately, the AUTONCREMENT ID part is just an annoying part for us 🙂. I think the average user does not care if the ID is different, just that the <link> field is identical.

adamziel Dec 19, 2024
Maintainer

I think the average user does not care if the ID is different, just that the field is identical.

+1 to that

adamziel · 2024-12-18T15:09:26Z

adamziel
Dec 18, 2024
Maintainer

Deign choice 1: do not use in-memory associative arrays

+1 👍. While most imports will be relatively small, this assumption will enable importing 1TB export files and even continuous import of infinite streams of data. Live site-to-site sync would be one such stream.

0 replies

adamziel · 2024-12-18T15:10:56Z

adamziel
Dec 18, 2024
Maintainer

Ideally, the importer should be able to avoid deduplicating data.

What's the use-case for de-duplication? And you mean pre-processing the data, or duplicating at the DB level with upserts and such?

My gut says what you say – let's avoid deduplication entirely. Deduplication is complex. I'd stick to garbage in, garbage out. If someone needs to duplicate the imported records, they'll need to clean their data before bringing it in.

4 replies

zaerl Dec 18, 2024
Maintainer Author

What's the use-case for de-duplication? And you mean pre-processing the data, or duplicating at the DB level with upserts and such?

Right now, the importer does not duplicate anything. It just adds something else. Running it infinite times after the first will leave the system in the same state. We should let the user decide between two sets intersection like:

All the data of the target, update the same entities, all the data of the source
Update the same entities, only the data of the source, and remove all other

brandonpayton Dec 19, 2024
Maintainer

How much might we be able to change core to make entity transfers easier to reason about?

A couple of things come to mind:

Adding real entity UUIDs
Tracking versions of the state, maybe just logging UUIDs of deleted entities

With Git, we have this kind of information in version history. But if all we can do with WP is compare sets of entities, we can reason about the entities in the intersection but not about the rest. Those may be completely new entities or entities that were deleted in one site but not the other. With very large data sets like this, a user may be overwhelmed by choice if asked to resolve individual conflicts, and if they are asked to make a single decision for all conflicts, it seems likely desired entities will be lost (depending on whether the entity transfer relationship is multi-directional or just single-direction).

adamziel Dec 19, 2024
Maintainer

How much might we be able to change core to make entity transfers easier to reason about?

I'm all for changing WordPress core. Let's incubate those ideas in the data liberation plugin and once they're mature enough, I'll support and help merge the required core changes. I'm sure @dmsnell (core committer) and @sirreal (on track to be one) will help, too.

adamziel Dec 19, 2024
Maintainer

I'll also refer to @dmsnell's vector clock idea outlined below this comment. Here's the gist of it:

adamziel · 2024-12-18T15:21:24Z

adamziel
Dec 18, 2024
Maintainer

Design choice 2: automatically create categories parent if a category with that slug as a slug and as the name does not exist once a category with that parent is about to be created

What would be an example of that? Do you mean a WXR file such as this one?

<!-- There is no wp:category with `scripting-languages` slug -->
<wp:category>
	<wp:term_id>8</wp:term_id>
	<wp:category_nicename>javascript</wp:category_nicename>
	<wp:category_parent>scripting-languages</wp:category_parent>
	<wp:cat_name><![CDATA[JavaScript]]></wp:cat_name>
</wp:category>

If yes, this scenario is similar to not being able to download an attachment. I wouldn't force any opinionated actions here – that's what the existing tools do. Instead, I'd expose this information to the API consumer. "Hey, the data is incomplete, what do you want to do?".

We could have a few handlers, such as create_missing_parent_category(), remap_as_toplevel(), or set_parent_to_existing_category( $category_id ), and every API consumer would figure out what to do. In wp-admin, we may stop the import and ask the user to make a decision. In Blueprints, we may create the missing category by default. A publisher synchronizing data may have a go-to set of existing categories to remap to.

Note we already do that for all the asset frontloading errors. The system won't just skip the download or pull in a placeholder image – it will leave that decision up to the runtime. In wp-admin, the user will get a few buttons. In Blueprints, we'd skip the downloads.

8 replies

adamziel Dec 19, 2024
Maintainer

Assuming the following input data:

category 3 (child of 1)
category 2 (child of 1)
category 1 (root)

I think @brandonpayton meant scanning till the end, noting we can't insert categories 2 or 3, then inserting category 1 and doing another pass. That seems like an isomorphic operation to doing the topological sort, just with different resource constraints. We'd trade storage (a stored index) for computations (filtering the categories each time).

zaerl Dec 19, 2024
Maintainer Author

etc.
category 4 (child of 3)
category 3 (child of 2)
category 2 (child of 1)
category 1 (root)

This is what I mean by edge-case. It's not ok when you have a big list of categories.

brandonpayton Dec 19, 2024
Maintainer

Thanks for elaborating and teaching me. :)

The target DB will be inconsistent for too much time. Actively creating a parent and updating it later will build the hierarchies from the beginning.

Maybe the current insert-on-demand approach is workable and good, but if we find downsides and consider moving to multiple passes, I wonder whether it would make sense to do the work in a separate, dedicated table and then apply it all at once to the actual WP table.

adamziel Dec 19, 2024
Maintainer

I wonder whether it would make sense to do the work in a separate, dedicated table and then apply it all at once to the actual WP table.

for small datasets, sure. For 1TB databases, that might be a problem, even if only because we’re doubling the MySQL storage

brandonpayton Dec 20, 2024
Maintainer

I wonder whether it would make sense to do the work in a separate, dedicated table and then apply it all at once to the actual WP table.

for small datasets, sure. For 1TB databases, that might be a problem, even if only because we’re doubling the MySQL storage

It is still probably not a great idea, but I meant to consider buffering work on incomplete entities (like hierarchical terms in this example), not necessarily keeping "working tables" for the entire database. 😅

adamziel · 2024-12-18T15:26:01Z

adamziel
Dec 18, 2024
Maintainer

as uncategorized (Uncategorized), the default category that often is translated into sites with a language different from English.

Eventually we may need a decision point in the API such as "should the default category be renamed?" We should be fine, though, to just treat all the uncategorized posts as uncategorized regardless of the default category name. It should be easy enough to backtrack once this comes up.

0 replies

adamziel · 2024-12-18T15:29:30Z

adamziel
Dec 18, 2024
Maintainer

In my PR, I removed the in-memory arrays and replaced them with a database table.

I really like this choice. It turns a memory constraint into a disk space and CPU constraint, making reentrancy possible.

Perhaps we could reuse that table as the vector clock eventually. By adding a version and/or updated_at columns we could start tracking changes immediately after each entity is imported. This would tell us what's the delta between the import and the live site at the end of the operation. If we did the same thing for the exports, we'd get a clear list of database rows that need to be re-exported after the first pass. cc @brandonpayton @dmsnell @sirreal

0 replies

adamziel · 2024-12-18T15:38:26Z

adamziel
Dec 18, 2024
Maintainer

During the XML parsing, the entities are inserted in the table with the original ID and the byte offset inside the WXR file. Once the entities are imported, the mapped_id column is updated with the new ID of the entity in the target site.

Note we need to store a string-based reentrancy_cursor, not an int-based byte_offset for two reasons:

When parsing XML, the cursor contains the current open element stack (e.g. channel > item > wp:comments)
XML is just of the possible data sources. Other sources might not even be byte-based, e.g. imagine importing markdown notes from a local directory.

5 replies

brandonpayton Dec 19, 2024
Maintainer

XML is just of the possible data sources. Other sources might not even be byte-based, e.g. imagine importing markdown notes from a local directory.

For arbitrary re-entrancy to work, it seems each export will need to guarantee deterministic entity ordering.

adamziel Dec 19, 2024
Maintainer

And also the importer will need fault tolerance when some resources become unavailable mid-import.

adamziel Dec 19, 2024
Maintainer

On some level, this is like the REPEATABLE READ database isolation level:

The Repeatable Read isolation level only sees data committed before the transaction began; it never sees either uncommitted data or changes committed by concurrent transactions during the transaction's execution.

I wonder if the exporter, once it figures out the delta, could chunk it and save it as a series of local .sqlite (or XML or whatever) files to freeze them in time. Or at least the first meaningful chunk. Then, once those local deterministic chunks are synchronized, a new delta would be computed.

brandonpayton Dec 19, 2024
Maintainer

I wonder if the exporter, once it figures out the delta, could chunk it and save it as a series of local .sqlite (or XML or whatever) files to freeze them in time. Or at least the first meaningful chunk. Then, once those local deterministic chunks are synchronized, a new delta would be computed.

Interesting idea!

If the exporting WP instance is using MySQL and if there are no natural advantages to preferring SQLite, maybe such snapshot chunks could just be stored in MySQL. That way, the a snapshot table could be populated via SQL query without the data having to pass through PHP at all.

adamziel Dec 19, 2024
Maintainer

Oh yes, lovely idea! A “buffer zone” to enable consistency

adamziel · 2024-12-18T15:43:28Z

adamziel
Dec 18, 2024
Maintainer

Design choice 5: have a pre-import step to fill the database with IDs. From my test this add ~20% of computing time to all the phases that do not write in the database or download files (frontload_assets and import_entities). The difference can be noticed only with millions of rows; otherwise, it is a matter of seconds; a running time is usually a couple of orders of magnitude below the download, wp_insert_* steps.

Can you elaborate on this? I'm a bit confused. Inserting empty records seems to be complicating things instead of making them simpler:

Why pre-fill the database instead of inserting records as we go through the data stream?
How would we prevent WordPress from, say, listing a million placeholder pages in WP admin?
What if these placeholders are changed or deleted between the pre-import insertion and a second pass where they're populated with data?
Say I'm importing a post with ID=3 into a site where all IDs from 1 to 100 are already taken. What would be inserted into the database?
How would we cross-reference these placeholder database records with the entity defined in the WXR file?

Also this will add support for resuming the process.

WP_Stream_Importer already supports resuming the process without inserting placeholders. What am I missing?

1 reply

zaerl Dec 18, 2024
Maintainer Author

The PR does not prefill the wp_posts/terms/etc tables with nothing. I means that it prefills the wp_data_liberation_map table. The one used internally with the IDs taken from the XML during the STAGE_TOPOLOGICAL_SORT. The database is touched only by WP_Entity_Importer.

The prefill $wpdb->insert a new row. But if a real ID (the one mapped in the target site) is passed $wpdb->update is called.

But it's a pass that can be removed without problems.

adamziel · 2024-12-18T15:47:34Z

adamziel
Dec 18, 2024
Maintainer

Important rule: remapping IDs should never happen

That's only true when importing a WXR into a site that was wiped clean and has no content at all. There are a lot of WXRs out there with low IDs, e.g., post id=2, and they commonly conflict even with the default WordPress content.

2 replies

zaerl Dec 18, 2024
Maintainer Author

This is also true if we ask the user if it's ok to wipe the site before, by removing test posts and categories, resetting the AUTOINCREMENT, and others. Having to remap thousands of entities because of a single post or a missing ID can be avoided by just asking. Having a quick check of the preexisting systems is a good idea for me.

adamziel Dec 19, 2024
Maintainer

Some users will want to wipe the site, and some won't. I'd be wary of offering such destructive choices lightly – at the WordPress scale, we'd likely see hundreds or thousands of people who accidentally trash their entire site. In either case, the importer must account for both users wiping their site and users not wiping their site.

adamziel · 2024-12-18T15:49:17Z

adamziel
Dec 18, 2024
Maintainer

General question: What parts of the process would be simplified if we had a globally unique ID/content hash for each entity?

2 replies

zaerl Dec 18, 2024
Maintainer Author

I need to think about this. But right now, the ID+Parent ID pair is okay to track everything.

brandonpayton Dec 19, 2024
Maintainer

Ah, I was thinking the same thing when commenting earlier in the discussion.

Integer IDs are inherently problematic for syncing because they are based on entity insertion order rather than on the entities themselves. As a maybe terrible analogy, my name will remain "Brandon" even if we later discover a long-lost sibling who was born before me.

adamziel · 2024-12-18T15:53:01Z

adamziel
Dec 18, 2024
Maintainer

Design choice 6: ignore such structures. Well-written plugins should never use direct reference by ID, but always by slug to prevent this. We can fix well-known plugins, but not worth the effort.

I would love to ignore them! But a harsh reality is that we cannot take an easy way out. There's plenty of plugins storing IDs in JSON encoded content, serialized PHP arrays etc. in site options.

We can't rely on naive str_replace – we'd just break the data.

What we can do, though, is:

Support the "overwrite existing content" importing mode that just shoehorn the incoming data into the existing database records with those same IDs. In this case, we don't have to worry about re-mapping at all. That limits the problem to the "append new content" importing mode.
Unpack, rewrite, and re-pack IDs and URLs in all the well-known serialized data structures used by WordPress core
Provide hooks and filters for the plugin authors to migrate their own microformats. For example, WP Bakery encodes some block attributes in base64, and there are asset URLs encoded inside. Without an extension point to decode/re-encode that data, the importer won't be able to fetch those files.
Use a characteristic ID space for the exported data, e.g. large ints starting at 123456000000, and look for those sequences in all the microformats that don't have an explicit handler provided by the plugin. It won't help with any legacy WXR files, but that's fine.

VersionPress has some prior art on mapping database fields and microformats, and there are also some good specific examples of microformatted data in the URL rewriting discussion.

Design choice 7: do not offer the possibility of not remapping now.

Yeah, remapping is an entire rabbit hole. Let's tackle the topo sort, pausing, resuming etc. first and once that works well then let's team-tag remapping. However, let's keep discussing and aligning here to make sure we account for the eventual remapping facilities in the overall design.

3 replies

zaerl Dec 18, 2024
Maintainer Author

Unpack, rewrite, and re-pack IDs and URLs in all the well-known serialized data structures used by WordPress core

AFAIK, WordPress core does not save serialized data with IDs by default, which we need to change later. In general, it doesn't serialize() many things. But I need to investigate.

adamziel Dec 19, 2024
Maintainer

That would certainly simplify things! I wonder if that's also true for block attributes.

brandonpayton Dec 19, 2024
Maintainer

As an aside, serialize() output is not 100% guaranteed to be compatible between PHP versions. Here's a weird case we encountered every so when running search-replace for domain name changes on WP Cloud:

"PHP 8.1 fatals when attempting to unserialize empty mysqli_result from PHP 8.0"
php/php-src#10893

adamziel · 2024-12-18T16:05:28Z

adamziel
Dec 18, 2024
Maintainer

Design choice 8: do not perform a DB-level sort. The sort_order field will be kept there but not used. We will import all the entities and add the parent_id only if already mapped to avoid adding more complexity to the PR.

Would this be solved by choosing a sparse enough sort_order space? E.g. 64 bit or 128 bit numbers? Instead of sorting via ORDER BY ... OFFSET ... LIMIT ... we could use the same cursor technique large sites use for pagination: SELECT * FROM posts WHERE ID > :last_on_page LIMIT 20. And if that's still not enough, we could research other ways huge sites such as reddit or Tumblr approach this problem.

In general, this is similar to a big data pagination problem – I wonder if we can use similar techniques to deal with it. If not, then perhaps the placeholders approach os for the best, but I'd like to avoid it if we can.

1 reply

zaerl Dec 18, 2024
Maintainer Author

I tried different CTEs; it's an "open" problem. But it was unstable when you used more than one million entries with SQLite in primis.

That's why I now use the table to save the mapped IDs and use them when new entries are inserted. It should ideally be a "jump here and import this and after that jump here instead of reading one element after another in the XML" kind of thing.

If not, then perhaps the placeholders approach os for the best, but I'd like to avoid it if we can.

I have specified above, keep in mind that I do not create any placeholders in the database. For example the parent of a post is updated to the new real one when the child is imported.

brandonpayton · 2024-12-19T04:56:07Z

brandonpayton
Dec 19, 2024
Maintainer

In trunk categories' hierarchy is not imported; in my new PR yes:

For my own understanding, is there a reason why core devs would intentionally not import category hierarchy?

2 replies

zaerl Dec 19, 2024
Maintainer Author

For my own understanding, is there a reason why core devs would intentionally not import category hierarchy?

Sorry, I meant "In (Data Liberation) trunk categories etc." Fixed

brandonpayton Dec 19, 2024
Maintainer

Sorry, I meant "In (Data Liberation) trunk categories etc." Fixed

ah, great 😄 thanks!

brandonpayton · 2024-12-19T05:08:39Z

brandonpayton
Dec 19, 2024
Maintainer

The difference can be noticed only with millions of rows; otherwise, it is a matter of seconds; a running time is usually a couple of orders of magnitude below the download, wp_insert_* steps.

Naive question:
Why does an import have to insert with WP functions? Can it ever just use direct DB access? Is the reason we need the WP insert abstraction because the state of the imported site needs to be merged with existing site state?

3 replies

adamziel Dec 19, 2024
Maintainer

Very good question! I can only think of one reason to use wp_functions: hooks and filters. That's another rabbit hole. Direct DB access is tempting, I'd just worry about, say, cache plugins not kicking in for the new content. Maybe that's a good thing, though?

zaerl Dec 19, 2024
Maintainer Author

The direct $wpdb->insert is not the best idea. The wp_insert_X functions do perform some critical things under the hood.

The WP_IMPORTING "deactivates" everything that should not happen during imports, such as sending emails, notifications, etc (core blocks _publish_post_hook). We should add it, and WP-core should obey this in more places than now. Example: disabling the X calls to clean_post_cache and performing only one at the end. This can speed up things.

adamziel Dec 19, 2024
Maintainer

The wp_insert_X functions do perform some critical things under the hood.

What are those things?

brandonpayton · 2024-12-19T05:09:31Z

brandonpayton
Dec 19, 2024
Maintainer

have a pre-import step to fill the database with IDs.
…
So, at the start of the import phase, all the original IDs are already saved in the database. When the importer is about to import an entity, know if the parent, or whatever other entity, has already been imported or if it just exists in the XML somewhere.

Would there be any benefit to a similar lookup table being included with the export itself so that preprocessing can just be paid once at export time?

1 reply

adamziel Dec 19, 2024
Maintainer

Interesting idea @brandonpayton! It sounds tempting, although I can see a few problems with it:

You'd lose the ability to tweak the exported files in a code editor easily
We'd still need to support those legacy export files that come without the lookup table

However, for a new and binary format that people couldn't edit outside of WordPress / Playground, that would work. I'm not sure how useful would such a format be, though.

brandonpayton · 2024-12-19T05:10:07Z

brandonpayton
Dec 19, 2024
Maintainer

Important rule: remapping IDs should never happen, and it only occurs for well-formed WXR.

Why does remapping only occur for well-formed WXR?

1 reply

zaerl Dec 19, 2024
Maintainer Author

Why does remapping only occur for well-formed WXR?

Grammarly is too pedantic 🙂. It can happen for whatever file. Removed it. Thanks.

brandonpayton · 2024-12-19T05:11:49Z

brandonpayton
Dec 19, 2024
Maintainer

Where are the IDs saved? That is the problem. IDs in WordPress are saved in well-known places in the database. But they can be in:
1. Custom options with `serialize()`d data
2. Custom fields
3. Post meta
4. Term meta
5. Comment meta
6. Etc.
How do I find the IDs? That is the problem. The importer needs to find out where the IDs are saved. We know where the standard one is, but we can only guess for the others.

Is this a place where WP hooks could be offered so plugins can customize how these data structures are handled during export and import?

2 replies

zaerl Dec 19, 2024
Maintainer Author

The importer already has wxr_importer_pre_process_* and wxr_importer_processed_* filters/hooks that are perfect for this. 👍

brandonpayton Dec 19, 2024
Maintainer

The importer already has wxr_importer_pre_process_* and wxr_importer_processed_* filters/hooks that are perfect for this. 👍

Nice! Thanks

brandonpayton · 2024-12-19T05:13:19Z

brandonpayton
Dec 19, 2024
Maintainer

Design choice 6: ignore such structures. Well-written plugins should never use direct reference by ID, but always by slug to prevent this. We can fix well-known plugins, but not worth the effort.

We can place the burden on plugins, but there will always be sites with not-as-well-written plugins. IMO, some of WP’s beauty is how unconstrained it is, but this poses problems when we want to constrain site state enough to understand and transfer it effectively. Maybe we need a way for plugins to export secondary entities for their custom data structures. For example, a site builder plugin might store different DB references and its data structure in post meta, and it might be helpful if we give the plugin an opportunity to say what an export and import of that structure should look like.

0 replies

brandonpayton · 2024-12-19T05:14:18Z

brandonpayton
Dec 19, 2024
Maintainer

Design choice 8: do not perform a DB-level sort. The sort_order field will be kept there but not used.

What is the sort_order field kept for? A future iteration?

1 reply

zaerl Dec 19, 2024
Maintainer Author

What is the sort_order field kept for? A future iteration?

Yes.

brandonpayton · 2024-12-19T05:15:29Z

brandonpayton
Dec 19, 2024
Maintainer

Summary:
...
6. New sort phase

What is the new sort phase? After reading this post once, I would have guessed there is no sort phase.

1 reply

zaerl Dec 19, 2024
Maintainer Author

The STAGE_TOPOLOGICAL_SORT phase, which right now indexes all the entities, despite the name. 🙂

zaerl · 2024-12-31T09:28:18Z

zaerl
Dec 31, 2024
Maintainer Author

I thought of a new strategy that works with big files (+1M) and does not have too much impact on preexisting architecture. I made a raw example in my local branch and would like to share it before adding the details.

Key concepts:

Have a self::STAGE_TOPOLOGICAL_SORT where I calculate the sort order at inserting and not later. It stores the various cursors of entities that must be loaded first like parent posts
The sorter is no longer a separate class but a special kind of WP_WXR_Reader (WP_WXR_Sorted_Reader), which changes only the read_next_entity function of the parent class. That class is loaded using the factory found in create_for_*
The special read_next_entity does set the cursor to the part of the file that needs to be read at the very beginning; once read, those parts that class behave like the standard one following the stream.

In this way the :

WP_Entity_Importer is not changed. It just gets the entities like before through import_entity()
The impact on WP_Stream_Importer is minimal as well

The algorithm:

sort_it( entity, cursor ) {
    // Get the existing entity, if any.
    existing_entity = get_entity( new_entity.id );

    new_entity = [
        entity_id: null,
        parent_id: null,
        cursor: cursor, // <--- This is the cursor that will be used jump in the file.
        sort_order: 1,
    ];

    // If the entity has a parent, we need to check it.
    if ( it_is_an_entity_that_can_have_a_parent_like_a_page_or_a_category( entity ) ) {
        // Check if the parent exists.
        existing_parent = get_entity( entity.parent_id );

        if ( ! existing_parent ) {
            new_parent = [
                entity_id: parent_id,
                cursor: null,
                sort_order: 2, // The parent has at least a sort order of 2.
            ]

            add_to_table( new_parent )
        }
    }

    if ( empty( existing_entity ) ) {
        // Insert the entity if it doesn't exist.
        new_entity.entity_id = entity.id; // 'id' can be comment_id, post_id, term_id, etc.
        add_to_table( new_entity )

        return;
    }

    // The entity exists, so we need to update the sort order. Check if it has a child.
    // Hierarchy is 1 to N in WP, so we need to get the first child.
    first_child = get_the_first_element_with_parent_id( new_entity.parent_id )

    // We found a child, so we need to update the sort order with a new sort order.
    if ( first_child ) {
        // Increase the sort order by 1.
        existing_entity.sort_order = first_child.sort_order + 1;

        // Update the entity if it already exists.
        update_in_table( new_entity );
    }
)

The entities are now sorted and can be accessed using SELECT cursor FROM table ORDER BY sort_order DESC, id ASC. All the entities with sort_order equal to 1 can be removed once the ones with higher order have been imported.

Additional detail: the double ORDER BY is needed because the SQL standard does not specify what has to happen when the keys on an ORDER BY are the same. I tested SQLite and MariaDB, and sometimes they return different results. Having it this way does guarantee that exact order. id is an autoincrement value and will always be different.

0 replies

[Data Liberation] IDs mapping, remapping, hierarchy #2090

zaerl Dec 17, 2024 Maintainer

About data integrity, similarity and deduplication

About how to keep track of the entities that have been created and potential remapping

About the remapping

About not remapping

Hierarchy of entities

Addendum: what Unison does

Footnotes

Replies: 23 comments · 44 replies

adamziel Dec 18, 2024 Maintainer

brandonpayton Dec 19, 2024 Maintainer

adamziel Dec 18, 2024 Maintainer

zaerl Dec 18, 2024 Maintainer Author

adamziel Dec 18, 2024 Maintainer

zaerl Dec 18, 2024 Maintainer Author

brandonpayton Dec 19, 2024 Maintainer

adamziel Dec 19, 2024 Maintainer

zaerl Dec 19, 2024 Maintainer Author

adamziel Dec 19, 2024 Maintainer

adamziel Dec 18, 2024 Maintainer

adamziel Dec 18, 2024 Maintainer

zaerl Dec 18, 2024 Maintainer Author

brandonpayton Dec 19, 2024 Maintainer

adamziel Dec 19, 2024 Maintainer

adamziel Dec 19, 2024 Maintainer

adamziel Dec 18, 2024 Maintainer

adamziel Dec 19, 2024 Maintainer

zaerl Dec 19, 2024 Maintainer Author

brandonpayton Dec 19, 2024 Maintainer

adamziel Dec 19, 2024 Maintainer

brandonpayton Dec 20, 2024 Maintainer

adamziel Dec 18, 2024 Maintainer

adamziel Dec 18, 2024 Maintainer

adamziel Dec 18, 2024 Maintainer

brandonpayton Dec 19, 2024 Maintainer

adamziel Dec 19, 2024 Maintainer

adamziel Dec 19, 2024 Maintainer

brandonpayton Dec 19, 2024 Maintainer

adamziel Dec 19, 2024 Maintainer

adamziel Dec 18, 2024 Maintainer

zaerl Dec 18, 2024 Maintainer Author

adamziel Dec 18, 2024 Maintainer

zaerl Dec 18, 2024 Maintainer Author

adamziel Dec 19, 2024 Maintainer

adamziel Dec 18, 2024 Maintainer

zaerl Dec 18, 2024 Maintainer Author

brandonpayton Dec 19, 2024 Maintainer

adamziel Dec 18, 2024 Maintainer

zaerl Dec 18, 2024 Maintainer Author

adamziel Dec 19, 2024 Maintainer

brandonpayton Dec 19, 2024 Maintainer

zaerl
Dec 17, 2024
Maintainer

Replies: 23 comments 44 replies

adamziel
Dec 18, 2024
Maintainer

brandonpayton Dec 19, 2024
Maintainer

adamziel
Dec 18, 2024
Maintainer

zaerl Dec 18, 2024
Maintainer Author

adamziel
Dec 18, 2024
Maintainer

zaerl Dec 18, 2024
Maintainer Author

brandonpayton Dec 19, 2024
Maintainer

adamziel Dec 19, 2024
Maintainer

zaerl Dec 19, 2024
Maintainer Author

adamziel Dec 19, 2024
Maintainer

adamziel
Dec 18, 2024
Maintainer

adamziel
Dec 18, 2024
Maintainer

zaerl Dec 18, 2024
Maintainer Author

brandonpayton Dec 19, 2024
Maintainer

adamziel Dec 19, 2024
Maintainer

adamziel Dec 19, 2024
Maintainer

adamziel
Dec 18, 2024
Maintainer

adamziel Dec 19, 2024
Maintainer

zaerl Dec 19, 2024
Maintainer Author

brandonpayton Dec 19, 2024
Maintainer

adamziel Dec 19, 2024
Maintainer

brandonpayton Dec 20, 2024
Maintainer

adamziel
Dec 18, 2024
Maintainer

adamziel
Dec 18, 2024
Maintainer

adamziel
Dec 18, 2024
Maintainer

brandonpayton Dec 19, 2024
Maintainer

adamziel Dec 19, 2024
Maintainer

adamziel Dec 19, 2024
Maintainer

brandonpayton Dec 19, 2024
Maintainer

adamziel Dec 19, 2024
Maintainer

adamziel
Dec 18, 2024
Maintainer

zaerl Dec 18, 2024
Maintainer Author

adamziel
Dec 18, 2024
Maintainer

zaerl Dec 18, 2024
Maintainer Author

adamziel Dec 19, 2024
Maintainer

adamziel
Dec 18, 2024
Maintainer

zaerl Dec 18, 2024
Maintainer Author

brandonpayton Dec 19, 2024
Maintainer

adamziel
Dec 18, 2024
Maintainer

zaerl Dec 18, 2024
Maintainer Author

adamziel Dec 19, 2024
Maintainer

brandonpayton Dec 19, 2024
Maintainer