Performance issue and suggested fix #92

HaukurPall · 2024-06-27T22:44:56Z

Long story short

I was using this library and it was very slow, like, insanely. I was reading a ~100MB jsonl file and it took about an hour in my environment. I want to read the file line by line and yield a single line so I can decode the JSON and do my stuff, without loading everything into memory all at once.

Expected behavior

I expected this reading to take less than a few seconds.

Actual behavior

It took around an hour.

Steps to reproduce

Just run this either of these functions on any JSONL file of similar size.

async def read_jsonl_file(file_like, chunk_size: int = 4192) -> AsyncIterable[dict[str, str]]:
    """Read an uploaded JSONL file and yield each line as a dictionary."""
    async with aiofile.AIOFile(file_like) as aio_file:
        async for line in aiofile.LineReader(aio_file=aio_file, chunk_size=chunk_size):
            if line.strip():  # Skip empty lines
                yield json.loads(line)


async def read_jsonl_file_direct_interface(file_like) -> AsyncIterable[dict[str, str]]:
    """Read an uploaded JSONL file and yield each line as a dictionary."""
    async with aiofile.async_open(file_like) as aio_file:
        async for line in aio_file:
            if line.strip():  # Skip empty lines
                yield json.loads(line)

Suggested fix

I looked through the code and found a bad pattern repeated a lot of times in the code. The bad pattern is roughly:

Do a system call to read
Check if that chunk contains a separator, if no new line add it to a buffer, continue
Otherwise, read the line and keep the remainder.

This pattern overlooks the fact that there might be more than a single separator in a chunk. For testing I changed the LineReader implementation:

async def fixed_readline(self) -> str | bytes:
    self._buffer = cast(StringIO | BytesIO, self._buffer)
    while True:
        self._buffer.seek(0)
        line = self._buffer.readline()
        if line and line.endswith(self.linesep):
            tail = self._buffer.read()
            self._buffer.seek(0)
            self._buffer.truncate(0)
            self._buffer.write(tail)
            return line
        # No line in buffer, read more data
        chunk = await self._LineReader__reader.read_chunk()
        if not chunk:
            # No more data to read, return any remaining content in the buffer
            self._buffer.seek(0)
            remaining_content = self._buffer.read()
            # Clear the buffer so we don't return the same content again or leak memory
            self._buffer.truncate(0)
            return remaining_content
        # We have more data to read, write it to the buffer and handle it in the next iteration
        self._buffer.seek(0, 2)  # Seek to the end of the buffer
        self._buffer.write(chunk)

I did a speed test on my file and this fix really improves the performance:

Reading the file without async:
no_async 0.39684295654296875 seconds

Using the aiofiles library:
aiofiles 2.119969367980957 seconds

Either of the functions above:
manual_interface 0.5424532890319824 seconds
direct_interface 0.5431270599365234 seconds

The crux is that this code does not always read more data into memory and does a lot fewer system calls. Always reading more data into memory becomes even worse when dealing with an ever growing tail.

I hope that this new pattern gets adopted as it really makes the library usable in modern async Python environments.

The text was updated successfully, but these errors were encountered:

HaukurPall linked a pull request Jun 28, 2024 that will close this issue

Improve the performance of the LineReader by reducing the number of system calls whilst containing the tail size #93

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue and suggested fix #92

Performance issue and suggested fix #92

HaukurPall commented Jun 27, 2024 •

edited

Loading

Performance issue and suggested fix #92

Performance issue and suggested fix #92

Comments

HaukurPall commented Jun 27, 2024 • edited Loading

Long story short

Expected behavior

Actual behavior

Steps to reproduce

Suggested fix

HaukurPall commented Jun 27, 2024 •

edited

Loading