Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for core: improve performance of IO backends #684

Open
14 tasks
LtdJorge opened this issue Jan 14, 2025 · 3 comments
Open
14 tasks

Tracking issue for core: improve performance of IO backends #684

LtdJorge opened this issue Jan 14, 2025 · 3 comments

Comments

@LtdJorge
Copy link
Contributor

LtdJorge commented Jan 14, 2025

This is a tracking issue for changes to the IO backends to improve their performance. Additions, suggestions and improvements welcome!

Rationale

Motivation and my current thoughts

I have been testing both the unix and the io_uring backend against each other, and the unix backend always has lower latency. Of course, my testing is pretty limited, since it was only using the testing database file with the current banches. I got the same results for both backends on SATA SSD and NVMe, so I'm pretty sure there is more overhead with the io_uring calls than the potential performance gains. Fortunately, I think there is a lot of room for improvement, including registering the files to the ring, registering buffers (this is pretty complex with the current Limbo codebase, but I'm working on something), turning remaining syscalls into io_uring opcodes, improving O_DIRECT alignment sizes, and so much more. Many of these changes should also translate in better performance for the other backends.

However, before turning to implementing things, better observability and benchmarking are needed. I will open an issue for each in a moment and edit this. Right now, Limbo CLI uses log with env-logger and core has criterion with pprof-rs. However, the set of benches and the current log points are not very exhaustive. Since Limbo is in heavy development, this is fine, but I think improving the situation now would be beneficial even for the development process, as there is frequently a need to debug some behavior or to compare performance of implementations.

For IO testing, we also need tests stressing high concurrency. I will have new hardware in a few days, where I can work on this better than on my personal system tuned for responsiveness.

The performance improvements don't have to be isolated to the IO backends. It's just that I have worked mostly on those and I have a deeper understanding. If anyone wants to add other parts of the system, I welcome the addition to this issue and will rename it.

Steps

An initial list of steps, in rough order

  • Align Node bindings to the completion model with asynchronous Node API Improve I/O efficiency with WebAssembly on Node #271
  • Switch to IOCP/IORing API on Windows Switch to IOCP API on Windows for asynchronous I/O #41
  • io_uring: register the file descriptors to the ring
  • Use logical logging? Logical logging? #2
  • Finish support for TPC-H benchmarking (OLAP) TPC-H support #36
  • Add support for TPC-E benchmarking (OLTP) TPC-E support #685
  • Consider adding PGO by default Profile-Guided Optimization (PGO) benchmark report #78
  • Support many in-flight IO requests. Right now, my understanding of the VDBE is pretty lacking, so I'm not sure how Limbo multiplexes IO requests, if at all. Or if it depends on the application setting up multiple threads.
  • io_uring: register buffers to the ring. This feature will depend on all of the following.
  • Make decisions around aligned allocations at runtime, when running natively, to optimize Direct I/O for the block device(s) used. Goes hand in hand with the next
  • Make allocations explicit and optimize BufferData size
  • Implement page readahead Page readahead #203 (I have some ideas for this, but requires rewrites to a few components in VDBE/storage to provide hints to the IO layer, might need help from Pere, Jussi, etc)
  • Implement per-connection memory allocator Feature: per-connection memory allocator configuration #523 This is very interesting and not that difficult, but sadly allocator-api2 doesn't have Rc and Arc support yet. I have asked about it here.
  • Re-architect the storage layer. Remove the page cache? 1 23

Issues that require or would be affected by this issue

Footnotes

  1. Swizzling (LeanStore)

  2. Umbra 2

  3. Virtual-Memory Assisted Buffer Management (LeanStore)

@PThorpe92
Copy link
Contributor

Some great stuff here! ❤️

I also made some observations when re-designing the io_uring module a couple weeks ago, and although it wasn't the direction the project wanted to go (involved batching all writev sqe's into 1 syscall submission during an event of SQ overflow), I walked away with some good insight on areas that could be improved. Pretty much all of which you have listed here already :)

  • io_uring: register FD's and buffers to the ring.
  • optimize Direct I/O for the block device(s) used.
  • optimize BufferData size

I am also gaining more context daily about the rest of the codebase, but this is great to have somewhere to mark down observations or just open questions around IO performance 👍 So I'll be sure to keep up with this thread.

Note For reference I do believe that #570 is indeed fixed on main at this moment

@LtdJorge
Copy link
Contributor Author

I also made some observations when re-designing the io_uring module a couple weeks ago, and although it wasn't the direction the project wanted to go (involved batching all writev sqe's into 1 syscall submission during an event of SQ overflow), I walked away with some good insight on areas that could be improved.

Yup, I had a look at your PR a few days ago. If you come up with any other improvement, feel free to add it.

@PThorpe92
Copy link
Contributor

I think we should evaluate the efficiency or just general use of vectored IO operations with the io_uring backend. I personally still think there is a scenario where we can group contiguous writes and submit as a single syscall. Seems like anything > 1 page in size at least, there is a pretty huge opportunity for perf increase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants