Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore a WP entity export iterator #2107

Draft
wants to merge 10 commits into
base: trunk
Choose a base branch
from

Conversation

brandonpayton
Copy link
Member

Motivation for the change, related issues

For data liberation, we want an API for streaming WP entities from a site. Then we can export WP entities to multiple targets without having to solve the same entity traversal problems in each exporter.

Related to #2106

Implementation details

TBD

Testing Instructions (or ideally a Blueprint)

TBD

@brandonpayton brandonpayton added [Type] Exploration An exploration that may or may not result in mergable code [Aspect] Data Liberation labels Dec 20, 2024
@brandonpayton brandonpayton self-assigned this Dec 20, 2024
@brandonpayton
Copy link
Member Author

I've been reviewing WP tables and data structures to see what seems to make sense. This is a very rough outline that I plan to fill in and explore more tomorrow.

So far, the rough approach is to iterate over various entity iterators, most of which will be iterating over table rows with a condition like WHERE ID > previous_entity_ID ORDER BY ID. For the sake of performance we may select in chunks but still relay one entity at a time through the interface.

One possibly controversial direction so far is that I am thinking it may make sense to directly convey terms, taxonomy, and term relationships separately rather than modeling categories and tags as first class entities. Intuitively, it less fraught to just convey what is there and leave meaning-making to API consumers. Do you have any thoughts on this, @adamziel and @zaerl?

@brandonpayton
Copy link
Member Author

Note: Some of the inline TODOs share some of my thinking.

@zaerl
Copy link
Collaborator

zaerl commented Dec 20, 2024

One possibly controversial direction so far is that I am thinking it may make sense to directly convey terms, taxonomy, and term relationships separately rather than modeling categories and tags as first-class entities

This is okay, and it's not controversial at all. But remember that the tables may have some artifacts and some special rules (example: a post is_sticky if is_sticky( $post->ID ) is true). The standard WXR exporter export_wp uses get_{categories|etc} functions that clean away data using the *_exists family of functions. The wp_term_relationships table exists only because we have many-to-many relationships with posts, and it can have "relationships" that do not exist anymore or refer to entities that do not exist anymore.

We should have two base XML exporters here and start from the simpler ones.

  1. Having an exporter that copies the database by looping the rows, creating the XML one after another. Fast and not memory-hungry for obvious reasons.
  2. Another should export with the 1:1 results of what core does. A WXR created by the core exporter guarantees you to have a specific structure: site options, terms ordered by hierarchies, and all the items after.

For the sake of performance we may select in chunks but still relay one entity at a time through the interface.

Reading the steps in twenty posts is okay. For every post, you should read the following:

  1. The post meta(s)
  2. The comments
    • The comment meta(s)

@brandonpayton
Copy link
Member Author

Thank you for your feedback on this, @zaerl! It's helpful.

Having an exporter that copies the database by looping the rows, creating the XML one after another. Fast and not memory-hungry for obvious reasons.

I am working on this first. Currently, there is just a dumb iteration over database rows starting with terms tables, but it's going to have to be a bit smarter than that (I think). And the above provides a good context / test case to see whether this is headed in a reasonable direction.

@zaerl
Copy link
Collaborator

zaerl commented Dec 23, 2024

And the above provides a good context / test case to see whether this is headed in a reasonable direction.

Having both cases can be a good thing for us. The core export_wp function does, for example, use the can_export arg to check post types. The function core code is good! It does a lot of low-level queries, and the terms are the only ones that use the get_* style. To put categories in order with no child going before its parent. But this is no longer a problem with the importer.

Starting with the DB SQL queries is perfectly fine, and you made the right choice. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Aspect] Data Liberation [Type] Exploration An exploration that may or may not result in mergable code
Projects
Status: Inbox
Development

Successfully merging this pull request may close these issues.

2 participants