Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce copies when reading files in pyio, match behavior of _io #129005

Open
cmaloney opened this issue Jan 18, 2025 · 5 comments
Open

Reduce copies when reading files in pyio, match behavior of _io #129005

cmaloney opened this issue Jan 18, 2025 · 5 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@cmaloney
Copy link
Contributor

cmaloney commented Jan 18, 2025

Feature or enhancement

Proposal:

Currently _pyio uses ~2x as much memory to read all data from a file compared to _io. This is because it makes more than one copy of the data.

Details from test_fileio run

$ ./python -m test -M8g -uall test_largefile -m test_large_read -vvv
== CPython 3.14.0a4+ (heads/main-dirty:3829104ab41, Jan 17 2025, 21:40:47) [Clang 19.1.6 ]
== Linux-6.12.9-arch1-1-x86_64-with-glibc2.40 little-endian
== Python build: debug
== cwd: <$HOME>/python/build/build/test_python_worker_32392æ
== CPU count: 32
== encodings: locale=UTF-8 FS=utf-8
== resources: all

Using random seed: 1740056613
0:00:00 load avg: 0.53 Run 1 test sequentially in a single process
0:00:00 load avg: 0.53 [1/1] test_largefile
test_large_read (test.test_largefile.CLargeFileTest.test_large_read) ... 
 ... expected peak memory use: 4.7G
 ... process data size: 2.3G
ok
test_large_read (test.test_largefile.PyLargeFileTest.test_large_read) ... 
 ... expected peak memory use: 4.7G
 ... process data size: 2.3G
 ... process data size: 4.3G
 ... process data size: 4.7G
ok

----------------------------------------------------------------------
Ran 2 tests in 3.711s

OK

== Tests result: SUCCESS ==

1 test OK.

Total duration: 3.7 sec
Total tests: run=2 (filtered)
Total test files: run=1/1 (filtered)
Result: SUCCESS

Plan:

  1. Switch to os.readv() os.readinto() to do readinto like C _Py_read used by _io does. os.read() can't take a buffer to use. This aligns behavior between _io.FileIO.readall and _pyio.FileIO.readall. os.readv works well today and takes a caller allocated buffer rather than needing to add a new os API. readv(2) mirrors the behavior and errors of read(2), so this should keep the same end behavior.
  2. Update _pyio.BufferedIO to not force a copy of the buffer for readall when its internal buffer is empty. Currently it always slices its internal buffer then adds the result of _pyio.FileIO.readall to it.

For iterating, I'm using a small tracemalloc script to find where copies are:

from _pyio import open

import tracemalloc

with open("README.rst", 'rb') as file:
    tracemalloc.start()
    data = file.read()
    snap = tracemalloc.take_snapshot()


stats = snap.statistics('lineno')
for stat in stats:
    print(stat)

Loose Ends

  • os.readv seems to be well supported but is currently guarded by a configure check. I'd like to just make pyio require readv, but can do conditional code if needed. If making readv non-optional generally is feasible, happy to work on that.
    • os.readv is not supported on WASI, so need to add conditional code.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

@cmaloney cmaloney added the type-feature A feature request or enhancement label Jan 18, 2025
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 18, 2025
`os.read` allocated and filled a buffer by calling `read(2)`, than that
data was copied into the user provied buffer. Read directly into the
caller's buffer instead by using `os.readv`. `self.read()` was doing the
closed and readable checks so move those into `readinto`
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 18, 2025
`os.read` allocated and filled a buffer by calling `read(2)`, than that
data was copied into the user provied buffer. Read directly into the
caller's buffer instead by using `os.readv`. `self.read()` was doing the
closed and readable checks so move those into `readinto`
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 18, 2025
`os.read` allocated and filled a buffer by calling `read(2)`, than that
data was copied into the user provied buffer. Read directly into the
caller's buffer instead by using `os.readv`. `self.read()` was doing the
closed and readable checks so move those into `readinto`
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 18, 2025
`os.read` allocated and filled a buffer by calling `read(2)`, than that
data was copied into the user provied buffer. Read directly into the
caller's buffer instead by using `os.readv`. `self.read()` was doing the
closed and readable checks so move those into `readinto`
@cmaloney
Copy link
Contributor Author

cmaloney commented Jan 22, 2025

I took a tangent and looked at the code complexity of adding os.readinto vs. the conditionals around os.readv not existing on some platforms + needing to pass a sequence of buffers to os.readv.

  1. _pyio Using os.readv with conditional: https://github.com/python/cpython/pull/129006/files
  2. Adding os.readinto, _pyio using it: cmaloney@52c6013

Happy to work on either course, slight preference for os.readinto as it feels like by adding one function removes a lot of new conditional code to make these cases more efficient. Code wise, with the Buffer Protocol, the code to me gets quite a bit easier to follow. Not sure it's worth another public OS api to maintain though.

Know moving to a full PR for os.readinto would need to add tests and news, just focusing on "adding os.readinto" vs. "using os.readv".

cc: @tomasr8 , @vstinner , @gpshead (reviewers where os.read vs. _Py_read / readinto has come up).

@vstinner
Copy link
Member

vstinner commented Jan 22, 2025

Can you please open a separated issue for os.readinto()? This function looks useful and simple.

@picnixz picnixz added performance Performance or resource usage stdlib Python modules in the Lib dir labels Jan 23, 2025
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 26, 2025
`os.read` allocated and filled a buffer by calling `read(2)`, than that
data was copied into the user provied buffer. Read directly into the
caller's buffer instead by using `os.readinto`.

`os.readinto` uses `PyObject_GetBuffer`` to make sure the passed in
buffer is writeable and bytes-like, drop the manual check.
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 26, 2025
`os.read` allocated and filled a buffer by calling `read(2)`, than that
data was copied into the user provied buffer. Read directly into the
caller's buffer instead by using `os.readinto`.

`os.readinto` uses `PyObject_GetBuffer` to make sure the passed in
buffer is writeable and bytes-like, drop the manual check.
vstinner pushed a commit that referenced this issue Jan 28, 2025
`os.read()` allocated and filled a buffer by calling `read(2)`, than that
data was copied into the user provied buffer. Read directly into the
caller's buffer instead by using `os.readinto()`.

`os.readinto()` uses `PyObject_GetBuffer()` to make sure the passed
in buffer is writeable and bytes-like, drop the manual check.
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025
Both now use a pre-allocated buffer of length `bufsize`, fill it using
a readinto, and have matching "expand buffer" logic.

On my machine this takes:

`./python -m test -M8g -uall test_largefile -m test_large_read -v`
from ~3.7 seconds to ~3.3 seconds
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025
Slicing buf and appending chunk would always result in a copy. Commonly
in a readall there is no already read data in buf, and the amount of
data read may be large, so the copy is expensive.
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025
This aligns the memory usage between _pyio and _io. Both use the same amount
of memory.
@cmaloney
Copy link
Contributor Author

The full set of changes on my Linux machine debug build, reduces ./python -m test -M8g -uall test_largefile -m test_large_read -v from ~3.8 seconds -> ~2.4 seconds with a peak memory usage of 2.3GB (identical memory usage for C and Python implementations).

cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025
Both now use a pre-allocated buffer of length `bufsize`, fill it using
a readinto, and have matching "expand buffer" logic.

On my machine this takes:

`./python -m test -M8g -uall test_largefile -m test_large_read -v`
from ~3.7 seconds to ~3.4 seconds
vstinner pushed a commit that referenced this issue Jan 30, 2025
Both now use a pre-allocated buffer of length `bufsize`, fill it using
a readinto(), and have matching "expand buffer" logic.

On my machine this takes:

`./python -m test -M8g -uall test_largefile -m test_large_read -v`

from ~3.7 seconds to ~3.4 seconds.
vstinner pushed a commit that referenced this issue Jan 30, 2025
Slicing buf and appending chunk would always result in a copy. Commonly
in a readall() there is no already read data in buf, and the amount of
data read may be large, so the copy is expensive.
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 31, 2025
This aligns the memory usage between _pyio and _io. Both now use the
same amount of memory when reading a file.
SunderSingh27 added a commit to SunderSingh27/cpython that referenced this issue Jan 31, 2025
@SunderSingh27 SunderSingh27 mentioned this issue Jan 31, 2025
@cmaloney
Copy link
Contributor Author

cmaloney commented Jan 31, 2025

The _pyio changes seems to have broken a number of bots; currently investigating

ERROR: test_nonblock_pipe_write_bigbuf (test.test_io.PyMiscIOTest.test_nonblock_pipe_write_bigbuf)
ERROR: test_nonblock_pipe_write_smallbuf (test.test_io.PyMiscIOTest.test_nonblock_pipe_write_smallbuf)

https://buildbot.python.org/#/builders/338/builds/7968/steps/6/logs/stdio

(Failed on latest commit in main, 10ee2d9, but I think relates to these changes)

cc: @vstinner

@cmaloney
Copy link
Contributor Author

cmaloney commented Feb 1, 2025

I think I found the issue in #129458. I tried using result[bytes_read:bufsize] = b'\0' to append bufsize-bytes_read null bytes to the bytearray, but that doesn't do that:

>>> a[0:5] = b'\0'
>>> a
bytearray(b'\x00')
>>> a[5:16] = b'\01'
>>> a
bytearray(b'\x00\x01')
>>> len(a)
2

(Pipes in the test are a case where we don't have a stat.st_size estimate, and can't seek to figure out size, so always goes through the resizing code). Not sure why that commit worked at first but later stopped working... Guessing a system needs to be somewhat loaded to hit.

Validated by adding an assert locally, which fails every time:

                result[bytes_read:bufsize] = b'\0'
                assert len(result) == bufsize, f"Should have expanded in size. {len(result)=}, {bufsize=}"
$ make && ./python -m test -uall test_io -vvv -m 'test_nonblock*'
<...>
AssertionError: Should have expanded in size. len(result)=8193, bufsize=16640

I don't see a clear way to resize the bytearray without making a whole array of null bytes which get copied into it... The tests just use the C API to resize: https://github.com/python/cpython/blob/main/Lib/test/test_capi/test_bytearray.py#L134-L160

Would adding a .resize() member be reasonable here? Is probably possible to do efficiently with slicing, but not in a way I know how...

cc: @vstinner

cmaloney added a commit to cmaloney/cpython that referenced this issue Feb 1, 2025
cmaloney added a commit to cmaloney/cpython that referenced this issue Feb 1, 2025
cmaloney added a commit to cmaloney/cpython that referenced this issue Feb 1, 2025
Move to a linear slice append with an iterator which has a length hint.
This is more expensive then PyByteArray_Resize, but I think as efficient
as can get without a new bytearray Python API to resize.

The previous code didn't append as I had intended:

```python
a = bytearray()
>>> a[0:5] = b'\0'
>>> a
bytearray(b'\x00')
>>> a[5:16] = b'\01'
>>> a
bytearray(b'\x00\x01')
>>> len(a)
2
```
cmaloney added a commit to cmaloney/cpython that referenced this issue Feb 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants