Intersect skewed sparse iterators by galloping #524

remysucre · 2022-05-08T16:21:19Z

Currently intersection is done by advancing each iterator by 1 every loop iteration. This can be slow when the intersected iterators are skewed. For example, intersecting a sparse dimension with N entries with another one with a single entry at the last position will take N steps. In such cases intersecting by galloping can bring the cost down to logarithm time. The following shows the galloping code that intersects 2 sparse vectors:

  while (ix < px1_end && iy < py1_end) {
    int32_t ix0 = x1_crd[ix];
    int32_t iy0 = y1_crd[iy];
    int32_t i = TACO_MAX(ix0,iy0);
    if (ix0 == i && iy0 == i) {
      q_val += x_vals[ix] * y_vals[iy];
      ix++;
      iy++;
    }
    ix = taco_gallop(x1_crd, ix, x1_pos[1], i);
    iy = taco_gallop(y1_crd, iy, y1_pos[1], i);
  }

src/lower/lowerer_impl_imperative.cpp

rohany

I took an initial pass here, and left some comments.

As you suggested in the email, we should make galloping a scheduling directive. To do this, I suggest the following: Look at transformations.cpp to see the style that other scheduling transformations are written in. This likely does not need to be as involved as other transformations. I imagine a directive .setMergeStrategy(IndexVar) that takes an enum MergeStrategy, where for now MergeStrategy can either be something like TwoFingerMerge or GallopingMerge. This transformation can then just rewrite the referenced Forall in the input IndexStmt to tag the ForallNode with this iteration strategy.

Additionally, I think galloping should be applied by default when users apply windowing through an index set iterator, similarly to how in numpy users can index arrays with other arrays to reference only a subset of the columns. The relevant code that I imagine you needing to change is at

taco/src/lower/lowerer_impl_imperative.cpp

Line 1641 in d0654a8

    
           for (auto& iter : filter(iterators, [](Iterator it) { return it.hasIndexSet(); })) {

, where the index set and the accessed level should be galloped.

FInally, I've allowed your PR to run all of the tests, so you can see what you're failing. We'll need the tests you have disabled to also start passing. You will likely have to also add the gallop function definition to the CUDA code generator, but I don't think that there is anything more sophisticated that will need to be changed there.

src/codegen/codegen_c.cpp

src/lower/lowerer_impl_imperative.cpp

apps/tensor_times_vector/tensor_times_vector.cpp

remysucre

Addressed some low level comments; will address the rest after implementing scheduling.

src/lower/lowerer_impl_imperative.cpp

rohany · 2022-05-10T21:01:55Z

@remysucre it looks like something in the test suite is hanging with your changes. It'll be best to run it locally to debug further.

remysucre · 2022-05-10T21:17:27Z

@remysucre it looks like something in the test suite is hanging with your changes. It'll be best to run it locally to debug further.

Ah yes, please disable the tests for now. The changes break them until I implement scheduling.

remysucre · 2022-05-11T19:47:25Z

Still WIP (so some tests may hang), I will mark the PR as ready for review once I'm done.

remysucre · 2022-05-12T06:43:14Z

I still need to add more tests, but currently some simple examples work and all existing tests pass on my machine. I probably won't touch the code until next week, so please feel free to make a pass. Thanks!

rohany

This is looking really good! Thanks for your attention to detail, especially in a new codebase. I've requested mostly nits and verification of preconditions at the time of transformation application rather than during lowering.

Once you add some tests we'll be good to merge this, since the full suite is passing CI right now.

include/taco/lower/lowerer_impl_imperative.h

src/codegen/codegen_c.cpp

src/index_notation/transformations.cpp

tools/taco.cpp

src/lower/lowerer_impl_imperative.cpp

include/taco/taco_tensor_t.h

remysucre · 2022-05-18T23:49:53Z

I addressed the comments and added some tests. I'm a bit out of my depth for how galloping should interact with other scheduling directives (or even how the existing directives work), so I just pushed a few cases where the tests passed.

rohany

Thanks for your work remy, the code looks great! I don't think galloping has too many interactions with other scheduling directives and should mostly "just work" with the other directives. If problems come up we should be able to fix them.

Before I merge this, please squash all of your commits into a single one.

This commit addes a new scheduling directive called mergeby. This directive specifies if the iterators of a given variable is merged by Two Finger merge or Galloping. The default strategy is Two Finger merge which is the same as the old behavior. Galloping merges the iterators with exponential search, which can be more efficient if the iterator sizes are skewed.

remysucre · 2022-05-19T00:42:15Z

Thank you for your help! I've squashed the commits.

rohany · 2022-05-19T16:30:06Z

Thanks for your work remy! Please reach out if you have questions / need guidance around performing benchmarks etc.

remysucre commented May 8, 2022

View reviewed changes

src/lower/lowerer_impl_imperative.cpp Outdated Show resolved Hide resolved

src/lower/lowerer_impl_imperative.cpp Show resolved Hide resolved

rohany requested changes May 8, 2022

View reviewed changes

remysucre commented May 10, 2022

View reviewed changes

src/lower/lowerer_impl_imperative.cpp Show resolved Hide resolved

src/lower/lowerer_impl_imperative.cpp Outdated Show resolved Hide resolved

src/lower/lowerer_impl_imperative.cpp Outdated Show resolved Hide resolved

rohany requested changes May 12, 2022

View reviewed changes

remysucre force-pushed the master branch from 491517a to a9a3d42 Compare May 18, 2022 17:01

remysucre commented May 18, 2022

View reviewed changes

include/taco/taco_tensor_t.h Outdated Show resolved Hide resolved

remysucre marked this pull request as ready for review May 18, 2022 23:46

rohany approved these changes May 19, 2022

View reviewed changes

remysucre force-pushed the master branch from ca545a7 to 7a05d63 Compare May 19, 2022 00:40

rohany merged commit 44847b7 into tensor-compiler:master May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intersect skewed sparse iterators by galloping #524

Intersect skewed sparse iterators by galloping #524

remysucre commented May 8, 2022

rohany left a comment

remysucre left a comment

rohany commented May 10, 2022

remysucre commented May 10, 2022

remysucre commented May 11, 2022

remysucre commented May 12, 2022

rohany left a comment

remysucre commented May 18, 2022

rohany left a comment

remysucre commented May 19, 2022

rohany commented May 19, 2022

Intersect skewed sparse iterators by galloping #524

Intersect skewed sparse iterators by galloping #524

Conversation

remysucre commented May 8, 2022

rohany left a comment

Choose a reason for hiding this comment

remysucre left a comment

Choose a reason for hiding this comment

rohany commented May 10, 2022

remysucre commented May 10, 2022

remysucre commented May 11, 2022

remysucre commented May 12, 2022

rohany left a comment

Choose a reason for hiding this comment

remysucre commented May 18, 2022

rohany left a comment

Choose a reason for hiding this comment

remysucre commented May 19, 2022

rohany commented May 19, 2022