Skip to content

Commit

Permalink
[Blueprints] Support Data Liberation importer in the importWxr step (#…
Browse files Browse the repository at this point in the history
…2058)

## Description

Adds the Data Liberation WXR importer as an option in the `importWxr`
step. The new importer is turned by including the `"importer":
"data-liberation"` option:

```json
{
  "steps": [
    {
      "step": "importWxr",
      "file": {
        "resource": "url",
        "url": "https://raw.githubusercontent.com/wpaccessibility/a11y-theme-unit-test/master/a11y-theme-unit-test-data.xml"
      },
      "importer": "data-liberation"
    }
  ]
}
```

When the `importer` option is missing or set to "default," nothing
changes in the behavior of the step and it continues using the
https://github.com/humanmade/WordPress-Importer importer.

The new importer:

* Rewrites links in the imported content
* Downloads assets through Playground's CORS proxy
* Parallelizes the downloads
* Communicates progress

This PR is a part of
#1894

## Implementation details

This `importWxr` step fetches and includes the
`data-liberation-core.phar` file. The phar file is built with
[Box](https://box-project.github.io/box/configuration/) and contains the
importer library with its dependencies, which is a subset of the Data
Liberation library, a subset of the Blueprints library, and a few vendor
libraries.

This, unfortunately, means that any changes in the PHP files require
rebuilding the .phar file. Here's how you can do it:

```bash
nx build:phar playground-data-liberation
```

You can also build the entire Data Liberation package as a WordPress
plugin complete with a wp-admin page:

```bash
nx build:plugin playground-data-liberation
```

Both commands will output the built files to
`packages/playground/data-liberation/dist`

The progress updates are a first-class feature of the new importer. The
updated `importer` step receives them in real-time via a
`post_message_to_js()` call running after every import step. Then, it
passes them on to the progress bar UI.

### Other changes

* **TLS traffic now goes through the CORS proxy.** Since the new
importer uses `AsyncHTTP\Client` which deals with raw sockets,
Playground's [TLS-based network
bridge](#1926)
runs the outbound traffic through a cors proxy. Technically,
`TCPOverFetchWebsocket` gets the `corsProxy` URL passed to the
`playground.boot()` call.
* A few composer dependencies were forked, downgraded to PHP 7.2 using
Rector, and bundled with this PR to keep the Data Liberation importer
working.

## Remaining work

- [x] PHP 7.2 compatibility. Done by forking and Rector-downgrading
dependencies that were incompatible with PHP 7.2.
- [x] Report the importer's progress on the overall Blueprint progress
bar
- [x] Enqueue the data liberation plugin files for downloading at the
blueprint compilation stage
- [x] Don't eagerly rewrite attachments URLs in `WP_Stream_Importer`.
Exposing this information to the API consumer requires an explicit
decision. Do we rewrite it? Or do we ignore it?
- [x] Fix the TLS errors at the intersection of Playground network
transport and the async HTTP client library
- [x] Separate the markdown importer and its dependencies (md parser,
frontmatter parser, Symfony libraries) from the core plugin
- [x] Ship the importer and its tree-shaken deps (URL parser) as a
minified zip/phar

## Follow-up work

- [ ] Reconsider the `WP_Import_Session` API – do we need so many
verbosely named methods? Can we achieve the same outcomes with fewer
methods?
- [ ] Investigate why there's a significant delay before media downloads
start on PHP 7.2 – 7.4. It's likely a PHP.wasm issue.

## Testing instructions

* Default importer – [Open this
link](http://localhost:5400/website-server/#{%20%22plugins%22:%20[],%20%22steps%22:%20[%20{%20%22step%22:%20%22importWxr%22,%20%22file%22:%20{%20%22resource%22:%20%22url%22,%20%22url%22:%20%22https://raw.githubusercontent.com/wpaccessibility/a11y-theme-unit-test/master/a11y-theme-unit-test-data.xml%22%20}%20}%20],%20%22preferredVersions%22:%20{%20%22php%22:%20%228.3%22,%20%22wp%22:%20%226.7%22%20},%20%22features%22:%20{%20%22networking%22:%20true%20},%20%22login%22:%20true%20})
and confirm it does what the current `importWxr` step do, that is it
stays at "Importing content" for a moment, fails to fetch media files
(CORS issues in network tools), but inserts posts and pages.
* Data Liberation – [Open this
link](http://localhost:5400/website-server/#{%20%22plugins%22:%20[],%20%22steps%22:%20[%20{%20%22step%22:%20%22importWxr%22,%20%22importer%22:%20%22data-liberation%22,%20%22file%22:%20{%20%22resource%22:%20%22url%22,%20%22url%22:%20%22https://raw.githubusercontent.com/wpaccessibility/a11y-theme-unit-test/master/a11y-theme-unit-test-data.xml%22%20}%20}%20],%20%22preferredVersions%22:%20{%20%22php%22:%20%228.3%22,%20%22wp%22:%20%226.7%22%20},%20%22features%22:%20{%20%22networking%22:%20true%20},%20%22login%22:%20true%20}),
confirm the import progress is visible and that the content and media
indeed get imported:

![CleanShot 2024-12-08 at 14 54
49@2x](https://github.com/user-attachments/assets/a7da3244-a10f-43d2-8e94-43d305220a7e)

## Related issues

* #1211 
* #2012 
* #1477 
* #1250 
* #1780
  • Loading branch information
adamziel authored Dec 11, 2024
1 parent 23ffd14 commit 2191e22
Show file tree
Hide file tree
Showing 300 changed files with 38,836 additions and 3,456 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ jobs:
- name: Install Playwright Browsers
run: sudo npx playwright install --with-deps
- name: Prepare app deploy and offline mode
run: npx nx e2e:playwright:prepare-app-deploy-and-offline-mode playground-website
run: CORS_PROXY_URL=http://127.0.0.1:5263/cors-proxy.php? npx nx e2e:playwright:prepare-app-deploy-and-offline-mode playground-website
- name: Zip dist
run: zip -r dist.zip dist
- name: Upload dist
Expand Down
4 changes: 2 additions & 2 deletions packages/php-wasm/universal/src/lib/php-worker.ts
Original file line number Diff line number Diff line change
Expand Up @@ -229,8 +229,8 @@ export class PHPWorker implements LimitedPHPApi {
}

/** @inheritDoc @php-wasm/universal!/PHP.onMessage */
onMessage(listener: MessageListener): void {
_private.get(this)!.php!.onMessage(listener);
onMessage(listener: MessageListener) {
return _private.get(this)!.php!.onMessage(listener);
}

/** @inheritDoc @php-wasm/universal!/PHP.defineConstant */
Expand Down
5 changes: 5 additions & 0 deletions packages/php-wasm/universal/src/lib/php.ts
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,11 @@ export class PHP implements Disposable {
*/
onMessage(listener: MessageListener) {
this.#messageListeners.push(listener);
return async () => {
this.#messageListeners = this.#messageListeners.filter(
(l) => l !== listener
);
};
}

async setSpawnHandler(handler: SpawnHandler | string) {
Expand Down
56 changes: 46 additions & 10 deletions packages/php-wasm/web/src/lib/tcp-over-fetch-websocket.ts
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ import { ContentTypes } from './tls/1_2/types';

export type TCPOverFetchOptions = {
CAroot: GeneratedCertificate;
corsProxyUrl?: string;
};

/**
Expand All @@ -67,6 +68,7 @@ export const tcpOverFetchWebsocket = (tcpOptions: TCPOverFetchOptions) => {
constructor(url: string, wsOptions: string[]) {
super(url, wsOptions, {
CAroot: tcpOptions.CAroot,
corsProxyUrl: tcpOptions.corsProxyUrl,
});
}
};
Expand All @@ -85,6 +87,7 @@ export interface TCPOverFetchWebsocketOptions {
* clientDownstream stream and tracking the closure of that stream.
*/
outputType?: 'messages' | 'stream';
corsProxyUrl?: string;
}

export class TCPOverFetchWebsocket {
Expand All @@ -101,6 +104,7 @@ export class TCPOverFetchWebsocket {
port = 0;
listeners = new Map<string, any>();
CAroot?: GeneratedCertificate;
corsProxyUrl?: string;

clientUpstream = new TransformStream();
clientUpstreamWriter = this.clientUpstream.writable.getWriter();
Expand All @@ -111,13 +115,18 @@ export class TCPOverFetchWebsocket {
constructor(
public url: string,
public options: string[],
{ CAroot, outputType = 'messages' }: TCPOverFetchWebsocketOptions = {}
{
CAroot,
corsProxyUrl,
outputType = 'messages',
}: TCPOverFetchWebsocketOptions = {}
) {
const wsUrl = new URL(url);
this.host = wsUrl.searchParams.get('host')!;
this.port = parseInt(wsUrl.searchParams.get('port')!, 10);
this.binaryType = 'arraybuffer';

this.corsProxyUrl = corsProxyUrl;
this.CAroot = CAroot;
if (outputType === 'messages') {
this.clientDownstream.readable
Expand Down Expand Up @@ -307,9 +316,10 @@ export class TCPOverFetchWebsocket {
'https'
);
try {
await RawBytesFetch.fetchRawResponseBytes(request).pipeTo(
tlsConnection.serverEnd.downstream.writable
);
await RawBytesFetch.fetchRawResponseBytes(
request,
this.corsProxyUrl
).pipeTo(tlsConnection.serverEnd.downstream.writable);
} catch (e) {
// Ignore errors from fetch()
// They are handled in the constructor
Expand All @@ -327,9 +337,10 @@ export class TCPOverFetchWebsocket {
'http'
);
try {
await RawBytesFetch.fetchRawResponseBytes(request).pipeTo(
this.clientDownstream.writable
);
await RawBytesFetch.fetchRawResponseBytes(
request,
this.corsProxyUrl
).pipeTo(this.clientDownstream.writable);
} catch (e) {
// Ignore errors from fetch()
// They are handled in the constructor
Expand Down Expand Up @@ -409,7 +420,11 @@ class RawBytesFetch {
/**
* Streams a HTTP response including the status line and headers.
*/
static fetchRawResponseBytes(request: Request) {
static fetchRawResponseBytes(request: Request, corsProxyUrl?: string) {
const targetRequest = corsProxyUrl
? new Request(`${corsProxyUrl}${request.url}`, request)
: request;

// This initially used a TransformStream and piped the response
// body to the writable side of the TransformStream.
//
Expand All @@ -419,13 +434,34 @@ class RawBytesFetch {
async start(controller) {
let response: Response;
try {
response = await fetch(request);
controller.enqueue(RawBytesFetch.headersAsBytes(response));
response = await fetch(targetRequest);
} catch (error) {
/**
* Pretend we've got a 400 Bad Request response whenever
* the fetch() call fails.
*
* Just propagating an error and closing a WebSocket does
* not make PHP aware the socket closed abruptly. This means
* the AsyncHttp\Client will keep polling the socket indefinitely
* until the request times out. This isn't perfect, as we want
* to close the socket as soon as possible to avoid, e.g., 10 seconds
* of unnecessary waitin for the timeout
*
* The root cause is unknown and likely related to the low-level
* implementation of polling file descriptors. The following
* workaround is far from ideal, but it must suffice until we
* have a platform-level resolution.
*/
controller.enqueue(
new TextEncoder().encode(
'HTTP/1.1 400 Bad Request\r\nContent-Length: 0\r\n\r\n'
)
);
controller.error(error);
return;
}

controller.enqueue(RawBytesFetch.headersAsBytes(response));
const reader = response.body?.getReader();
if (!reader) {
controller.close();
Expand Down
Loading

0 comments on commit 2191e22

Please sign in to comment.