Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid partial munmap memory leak #1189

Merged

Conversation

brandonpayton
Copy link
Member

@brandonpayton brandonpayton commented Apr 3, 2024

What is this PR doing?

This PR works around a PHP memory leak that can quickly lead to an OOM condition.

What problem is it solving?

This PR fixes #1128 -- an OOM condition due to memory leak.

How is the problem addressed?

This PR adds custom PHP memory storage handlers to stop PHP from attempting to partial unmap memory via the munmap() function. Emscripten allocators do not support using munmap() to partially unmap memory. They fail for partial unmaps, and when attempts fail, memory is leaked because PHP thinks the memory is free while Emscripten libs consider it allocated.

The custom memory storage handlers take over responsibility for allocating and freeing aligned memory chunks. The handlers use posix_memalign() and free() instead of mmap() and munmap(), and this appears to resolve the problem.

The handlers are added as part of a PHP extension, which is not perfect because allocation can happen before the extension is initialized. But the amount of allocation before the workaround is applied should be quite small.

Testing Instructions

Before applying this patch:
Run this test script and verify that it OOMs.

After applying this patch:
Rerun the test script and verify that it completes after 1000 interations.

Remaining work

  • Compare performance with php-wasm builds prior to this patch and reconsider if 20% slower.
  • Find out whether it is sufficient to add the workaround during the PHP module init phase rather than during each request init
  • Address minor TODO comments in wasm_memory_storage.c
  • See if we can run the test script as part of tests under packages/php-wasm/node/src/test

@brandonpayton brandonpayton requested a review from a team April 3, 2024 02:55
@brandonpayton brandonpayton self-assigned this Apr 3, 2024
@brandonpayton
Copy link
Member Author

I rebased locally and am rebuilding php-wasm for node and web.

@brandonpayton
Copy link
Member Author

Updated the PR description to note that we should add the test script to the php-wasm node tests.

@brandonpayton brandonpayton force-pushed the avoid-partial-munmap-memory-leak branch from 6f56aaa to 5620ed2 Compare April 3, 2024 06:09
@brandonpayton
Copy link
Member Author

Please hold off testing this. Something is off here with the latest builds. The wasm_memory_storage extension is not present.

I verified this earlier before rebuilding for all PHP versions, but perhaps I did not test what I thought I tested. This should be a straightforward fix, but it's unfortunate others cannot yet try the fix for themselves. Planning to look into this first-thing in the morning.

@brandonpayton
Copy link
Member Author

Found the issue:
Somehow the Dockerfile updates in this PR were missing the COPY instruction that I added earlier today for copying in the wasm_memory_storage extension into php-src.

I'm not yet sure what might have happened for that to be omitted. Instead of building all PHP versions again tonight, I'll go ahead and build just light PHP 7.0 and PHP 8.3 for the web.

One odd thing I noticed when testing after rebasing:
The OOM test script output is no longer directed to the browser when there is an OOM. Instead there is just an error in the dev tools console with the OOM message. This seems undesirable as a user may prefer some output indicating breakage than hidden output in the case of breakage. Will make a note to confirm that behavior tomorrow.

@brandonpayton brandonpayton force-pushed the avoid-partial-munmap-memory-leak branch from 5620ed2 to 2a500a2 Compare April 3, 2024 07:01
@brandonpayton
Copy link
Member Author

brandonpayton commented Apr 3, 2024

If others are interested, the build is fixed, and this is now ready for exploratory testing.

I've included just the light php-wasm builds for the web for PHP 7.0 and PHP 8.3 for now.

Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all very nice. it would be great to see some links to the PHP documentation on custom allocators, so that someone could come into this file, review the spec, and see if the code is right.

e.g. a docblock above each function saying "This is the XYZ [link to custom allocator definition of XYZ"

@adamziel
Copy link
Collaborator

adamziel commented Apr 3, 2024

I've played with performance measurements.

Methodology:

  • Close all other Playground tabs and open a single new tab
  • Open devtools before opening the page
  • Go directly to a local Playground URL that loads the respective PHP version and WP 6.5
  • Paste the code snippet below into devtools
  • Don't touch the computer until the test is finished. In particular, don't interact with the page and don't switch to another browser tab.

These may seem like insignificant details, but Chrome does very different things under the hood with and without devtools (different WASM runtimes) and also if the existing tab was used before (Playwright developers struggled with intermittent failures when reusing tabs).

async function measureTime() {
    console.time("ExecutionTime"); // Start the timer
    for(let i = 0; i < 100; i++) {
        await playground.request({ url: '/' });
        if ((i+1) % 10 === 0) { // Log progress every 10 requests
            console.log(`Progress: Completed ${i+1} out of 100 requests`);
        }
    }
    console.timeEnd("ExecutionTime"); // End the timer and print the total time taken
}
measureTime();

Here are the results of 4 test runs on each version:

  • PHP 8.3 from this PR: 29 765 ms, 29 112ms, 29 126 ms, 29 180ms
  • PHP 8.3 from trunk: 29 701 ms, 29 334ms, 29 383ms, 29 317ms

My testing methodology requires some manual work and I won't spend the time to collect 30 or 500 samples to plot a distribution and calculate p-values. We could automate that with a Node.js build, but at this point I'm quite happy to assume we're not looking at a shattering performance decline. I've captured this comment in the project wiki for any future performance discussions.

For now – let's assume the performance is alright here and proceed with code review as usual 👍

Even if there's a much slower code branch somewhere, I'd rather fix five more crashes than obsess about all the performance implications – we'll focus on the speed more if and when that matters.

Thank you so much for all your work on this @brandonpayton!

@@ -0,0 +1,87 @@
/* wasm_memory_storage extension for PHP */
Copy link
Collaborator

@adamziel adamziel Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also link to this PR and the related issue to give future readers more context

@brandonpayton brandonpayton added [Type] Bug An existing feature does not function as intended [Priority] High [Feature] PHP.wasm [Type] Reliability Playground uptime, reliability, not crashing labels Apr 3, 2024
@brandonpayton
Copy link
Member Author

brandonpayton commented Apr 3, 2024

I've played with performance measurements.

Methodology:
[snip]
These may seem like insignificant details, but Chrome does very different things under the hood with and without devtools (different WASM runtimes) and also if the existing tab was used before (Playwright developers struggled with intermittent failures when reusing tabs).

@adamziel Thanks for testing and for sharing the clear methodology.

I ran some tests but with less care about creating a new tab and opening dev tools first. I didn't remember about console.time() and didn't know about playground.request(). Here's the script I used (with the current scope URL substituted):

async function* test100() {
    for (let i = 0; i < 100; ++i) {
        yield await fetch( '/scope:0.7238152673090998/' ).then( r => r.text() );
    }
}

let start = performance.now();
console.log( 'Start:', start );
for await ( let t of test100() ) {}
let end = performance.now();
console.log( 'End:', end );
console.log( 'Total:', end - start );

There isn't a reason that the response text was needed, but originally, I was thinking about logging part of it.

Without this PR, the total time was about 8.97 seconds.
With this PR, the total time was about 7.87 seconds.

I prefer your more careful tests but wanted to share this sample.

@brandonpayton brandonpayton force-pushed the avoid-partial-munmap-memory-leak branch from 63ed729 to 8ca10b4 Compare April 3, 2024 22:20
@brandonpayton
Copy link
Member Author

brandonpayton commented Apr 3, 2024

@dmsnell, I added more documentation, including links to inline examples of custom storage functions, but so far, I have not been able to find related specs or external documentation to link to. Please feel free to ask for more if you think something is missing.

@adamziel, I was able to move the workaround assignment into module init. Zend memory manager does do some cleanup during request shutdown, but it only creates its heap during module startup.

It is possible other modules may have allocated memory before the workaround is added, and we should give those a chance to free the memory in the same way it was allocated. Fortunately, when PHP chooses to run module shutdown, it appears to do so in reverse order which gives us a chance to remove the workaround before those initial allocations are freed. So I added a module shutdown function to cleanup after ourselves.

With the exception of adding a test for the OOM case, this PR is ready for another review.

@adamziel
Copy link
Collaborator

adamziel commented Apr 4, 2024

This looks great @brandonpayton! I'm happy to merge as soon as all the PHP versions are rebuilt.

Should we now be strict about using emalloc and efree in php_wasm.c? Or is malloc and free still fine?

@brandonpayton
Copy link
Member Author

I added a test for this fix that runs for all supported PHP versions:
2e37037

All these tests fail with an OOM error before rebuilding php-wasm with the fix. After the fix, the tests pass.

This is ready for another review.

@brandonpayton
Copy link
Member Author

brandonpayton commented Apr 4, 2024

This looks great @brandonpayton! I'm happy to merge as soon as all the PHP versions are rebuilt.

@adamziel, sounds good! I will rebuild the PHP versions but leave for merging tomorrow, in case there is some auto-deploy to playground.wordpress.net and there are unexpected problems during the night.

Should we now be strict about using emalloc and efree in php_wasm.c? Or is malloc and free still fine?

At a high level, I guess emalloc and efree should be used if you want PHP's memory manager involved. But based on my current understanding, it should be safe to use either pair, as long as we don't mix them and attempt to efree() something allocated with malloc().

@brandonpayton
Copy link
Member Author

Note: I am stepping away for the night but will leave a build command running that should commit and push when complete:

npm run "recompile:php:node" && npm run "recompile:php:web" && git add -u && git commit -m "Add php-wasm builds" && git push --force-with-lease

@brandonpayton brandonpayton force-pushed the avoid-partial-munmap-memory-leak branch from 2e37037 to c60fc58 Compare April 4, 2024 04:04
Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, and thanks for the comments. Too bad there's not more clarity in PHP around creating these custom internals, but the links to a versioned source code file are about as perfect as I suppose we can get.

Looking forward to how this goes. It will make many people happy.

@adamziel adamziel merged commit b38ae63 into WordPress:trunk Apr 4, 2024
5 checks passed
adamziel added a commit that referenced this pull request Apr 4, 2024
The test was removed in #1192 due to intermittent failures like these:

```
FAIL  src/lib/steps/site-data.spec.ts > Blueprint step setSiteOptions() > should set the site option
```

They seem related to #1189, but all the other tests pass and the
Playground website works so I wonder what's going on there. Is the CI
runner just running out of memory? But then it shouldn't be able to
allocate the memory segment in the first place. Weird!
adamziel added a commit that referenced this pull request Apr 4, 2024
adamziel added a commit that referenced this pull request Apr 4, 2024
Reverts #1189

Unfortunately, the current implementation triggers "memory access out of
bounds" errors in places they weren't happening before. See
#1194 for more
context.
brandonpayton added a commit to brandonpayton/wordpress-playground that referenced this pull request Apr 4, 2024
Adds custom PHP memory storage handlers to stop PHP from
attempting to partial unmap memory via the `munmap()` function.
Emscripten allocators do not support using `munmap()` to partially unmap
memory. They fail for partial unmaps, and when attempts fail, memory is
leaked because PHP thinks the memory is free while Emscripten libs
consider it allocated.

The custom memory storage handlers take over responsibility for
allocating and freeing aligned memory chunks. The handlers use
`posix_memalign()` and `free()` instead of `mmap()` and `munmap()`, and
this appears to resolve the problem.

The handlers are added as part of a PHP extension, which is not perfect
because allocation can happen before the extension is initialized. But
the amount of allocation before the workaround is applied should be
quite small.

## Testing Instructions

Before applying this patch:
Run [this test
script](https://gist.github.com/brandonpayton/934d96d76b96557e94ad6983aaffa09f)
and verify that it OOMs.

After applying this patch:
Rerun the test script and verify that it completes after 1000
interations.

## Remaining work

- [x] Compare performance with php-wasm builds prior to this patch and
reconsider if 20% slower.
- [x] Find out whether it is sufficient to add the workaround during the
PHP module init phase rather than during each request init
- [x] Address minor TODO comments in wasm_memory_storage.c
- [x] See if we can run the test script as part of tests under
`packages/php-wasm/node/src/test`
@brandonpayton brandonpayton deleted the avoid-partial-munmap-memory-leak branch April 11, 2024 03:51
adamziel pushed a commit that referenced this pull request Apr 11, 2024
This PR attempts to fix the memory leak reported in #1128 and is an
iteration on PR #1189 which had problems with "memory out of bounds"
errors.

## What problem is it solving?

It stops PHP from leaking memory throughout the runtime of a script and
hopefully stops memory out of bounds errors by zeroing all memory given
to PHP.

## How is the problem addressed?

- By avoiding mmap()/munmap() which have incomplete implementations in
Emscripten
- By using posix_memalign() to allocate memory instead and manually
zeroing the memory before handing it to PHP

## Testing Instructions

- Observe CI test results
- Use `npm run dev` and exercise Playground locally in the browser
adamziel added a commit that referenced this pull request Apr 11, 2024
The test was removed in #1192 due to intermittent failures like these:

```
FAIL  src/lib/steps/site-data.spec.ts > Blueprint step setSiteOptions() > should set the site option
```

They seem related to #1189, but all the other tests pass and the
Playground website works so I wonder what's going on there. Is the CI
runner just running out of memory? But then it shouldn't be able to
allocate the memory segment in the first place. Weird!
adamziel added a commit that referenced this pull request Apr 11, 2024
The test was removed in #1192 due to intermittent failures like these:

```
FAIL  src/lib/steps/site-data.spec.ts > Blueprint step setSiteOptions() > should set the site option
```

![CleanShot 2024-04-04 at 13 09
25@2x](https://github.com/WordPress/wordpress-playground/assets/205419/fe68c5ff-80a7-4966-82a8-fbfcc7e9a101)

They seem related to #1189, but:

* I can't reproduce it locally
* The test actually passed in #1189 and only started failing in #1192
* All the other tests pass and the Playground website works so I wonder
what's going on there.
* The stack trace mentions network activity, but that call shouldn't
involve any. It's almost as if the trace was coming from
`php-networking.spec.ts`, but those tests actually pass.

Is the CI runner just running out of memory? But then it shouldn't be
able to allocate the memory segment in the first place. Weird!

CC @brandonpayton for insights
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] PHP.wasm [Priority] High [Type] Bug An existing feature does not function as intended [Type] Reliability Playground uptime, reliability, not crashing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PHP memory leak
3 participants