This repo contains some notebooks for using tools that are much more scalable than Pandas. These notebooks are more or less for exploring these technologies, and whether they fit the use case for Pipit. If so, they will eventually be integrated natively into Pipit to enable scalable analysis of traces.
There is already a tool, called Modin, that attempts to use various compute backends while exposing a Pandas DataFrame-like frontend. In fact, even Spark has this support (called the Dataframe API).
One important consideration is that these are highly optimized for column-wise operations, using techniques like vectorization (SIMD), columnar compression, etc. In fact, all OLAP-based systems are optimized for columnar operations.
Unfortunately, many interesting algorithms require a row-by-row traversal, which in the case of distributed OLAP systems, involves materializing a certain subset of rows, an expensive operation.
One simple tradeoff we can make for maximal performance is redundancy. If we store data both row-wise and column-wise, we can perform both column-based operations, for time-series based analyses, as well as row-based operations, like the lateness algorithm linked above.
Pipit repository: https://github.com/hpcgroup/pipit
Pipit paper: https://arxiv.org/pdf/2306.11177
SC23 poster: https://github.com/hsirkar/pdfs/blob/main/sc23-pipit-poster.pptx.pdf