Scaling up trace analysis with Dask and Spark (i.e. scaling Pipit)

This repo contains some notebooks for using tools that are much more scalable than Pandas. These notebooks are more or less for exploring these technologies, and whether they fit the use case for Pipit. If so, they will eventually be integrated natively into Pipit to enable scalable analysis of traces.

Design tradeoffs

There is already a tool, called Modin, that attempts to use various compute backends while exposing a Pandas DataFrame-like frontend. In fact, even Spark has this support (called the Dataframe API).

One important consideration is that these are highly optimized for column-wise operations, using techniques like vectorization (SIMD), columnar compression, etc. In fact, all OLAP-based systems are optimized for columnar operations.

Unfortunately, many interesting algorithms require a row-by-row traversal, which in the case of distributed OLAP systems, involves materializing a certain subset of rows, an expensive operation.

One simple tradeoff we can make for maximal performance is redundancy. If we store data both row-wise and column-wise, we can perform both column-based operations, for time-series based analyses, as well as row-based operations, like the lateness algorithm linked above.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
1. Getting Started with Dask.ipynb		1. Getting Started with Dask.ipynb
2. Using Dask for trace analysis.ipynb		2. Using Dask for trace analysis.ipynb
80GB dataframe.ipynb		80GB dataframe.ipynb
README.md		README.md
Untitled.ipynb		Untitled.ipynb
amg_16.parquet		amg_16.parquet
foo-bar.csv		foo-bar.csv
messing around with dask.ipynb		messing around with dask.ipynb
messing around with modin.ipynb		messing around with modin.ipynb
messing around with pandas.ipynb		messing around with pandas.ipynb
messing around with spark.ipynb		messing around with spark.ipynb
messing around with vaex.ipynb		messing around with vaex.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scaling up trace analysis with Dask and Spark (i.e. scaling Pipit)

Design tradeoffs

More links

About

Releases

Packages

Languages

hsirkar/trace-analysis

Folders and files

Latest commit

History

Repository files navigation

Scaling up trace analysis with Dask and Spark (i.e. scaling Pipit)

Design tradeoffs

More links

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages