Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intersect skewed sparse iterators by galloping #524

Merged
merged 1 commit into from
May 19, 2022

Conversation

remysucre
Copy link
Contributor

Currently intersection is done by advancing each iterator by 1 every loop iteration. This can be slow when the intersected iterators are skewed. For example, intersecting a sparse dimension with N entries with another one with a single entry at the last position will take N steps. In such cases intersecting by galloping can bring the cost down to logarithm time. The following shows the galloping code that intersects 2 sparse vectors:

  while (ix < px1_end && iy < py1_end) {
    int32_t ix0 = x1_crd[ix];
    int32_t iy0 = y1_crd[iy];
    int32_t i = TACO_MAX(ix0,iy0);
    if (ix0 == i && iy0 == i) {
      q_val += x_vals[ix] * y_vals[iy];
      ix++;
      iy++;
    }
    ix = taco_gallop(x1_crd, ix, x1_pos[1], i);
    iy = taco_gallop(y1_crd, iy, y1_pos[1], i);
  }

Copy link
Contributor

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took an initial pass here, and left some comments.

As you suggested in the email, we should make galloping a scheduling directive. To do this, I suggest the following: Look at transformations.cpp to see the style that other scheduling transformations are written in. This likely does not need to be as involved as other transformations. I imagine a directive .setMergeStrategy(IndexVar) that takes an enum MergeStrategy, where for now MergeStrategy can either be something like TwoFingerMerge or GallopingMerge. This transformation can then just rewrite the referenced Forall in the input IndexStmt to tag the ForallNode with this iteration strategy.

Additionally, I think galloping should be applied by default when users apply windowing through an index set iterator, similarly to how in numpy users can index arrays with other arrays to reference only a subset of the columns. The relevant code that I imagine you needing to change is at

for (auto& iter : filter(iterators, [](Iterator it) { return it.hasIndexSet(); })) {
, where the index set and the accessed level should be galloped.

FInally, I've allowed your PR to run all of the tests, so you can see what you're failing. We'll need the tests you have disabled to also start passing. You will likely have to also add the gallop function definition to the CUDA code generator, but I don't think that there is anything more sophisticated that will need to be changed there.

src/codegen/codegen_c.cpp Outdated Show resolved Hide resolved
src/codegen/codegen_c.cpp Outdated Show resolved Hide resolved
src/lower/lowerer_impl_imperative.cpp Outdated Show resolved Hide resolved
src/lower/lowerer_impl_imperative.cpp Outdated Show resolved Hide resolved
src/lower/lowerer_impl_imperative.cpp Outdated Show resolved Hide resolved
src/lower/lowerer_impl_imperative.cpp Outdated Show resolved Hide resolved
src/lower/lowerer_impl_imperative.cpp Outdated Show resolved Hide resolved
src/lower/lowerer_impl_imperative.cpp Outdated Show resolved Hide resolved
src/lower/lowerer_impl_imperative.cpp Outdated Show resolved Hide resolved
apps/tensor_times_vector/tensor_times_vector.cpp Outdated Show resolved Hide resolved
Copy link
Contributor Author

@remysucre remysucre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed some low level comments; will address the rest after implementing scheduling.

src/lower/lowerer_impl_imperative.cpp Show resolved Hide resolved
src/lower/lowerer_impl_imperative.cpp Outdated Show resolved Hide resolved
src/lower/lowerer_impl_imperative.cpp Outdated Show resolved Hide resolved
@rohany
Copy link
Contributor

rohany commented May 10, 2022

@remysucre it looks like something in the test suite is hanging with your changes. It'll be best to run it locally to debug further.

@remysucre
Copy link
Contributor Author

@remysucre it looks like something in the test suite is hanging with your changes. It'll be best to run it locally to debug further.

Ah yes, please disable the tests for now. The changes break them until I implement scheduling.

@remysucre
Copy link
Contributor Author

Still WIP (so some tests may hang), I will mark the PR as ready for review once I'm done.

@remysucre
Copy link
Contributor Author

I still need to add more tests, but currently some simple examples work and all existing tests pass on my machine. I probably won't touch the code until next week, so please feel free to make a pass. Thanks!

Copy link
Contributor

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good! Thanks for your attention to detail, especially in a new codebase. I've requested mostly nits and verification of preconditions at the time of transformation application rather than during lowering.

Once you add some tests we'll be good to merge this, since the full suite is passing CI right now.

include/taco/lower/lowerer_impl_imperative.h Show resolved Hide resolved
include/taco/lower/lowerer_impl_imperative.h Show resolved Hide resolved
src/codegen/codegen_c.cpp Outdated Show resolved Hide resolved
src/index_notation/transformations.cpp Outdated Show resolved Hide resolved
src/index_notation/transformations.cpp Show resolved Hide resolved
tools/taco.cpp Show resolved Hide resolved
src/lower/lowerer_impl_imperative.cpp Show resolved Hide resolved
src/lower/lowerer_impl_imperative.cpp Show resolved Hide resolved
src/lower/lowerer_impl_imperative.cpp Outdated Show resolved Hide resolved
src/lower/lowerer_impl_imperative.cpp Outdated Show resolved Hide resolved
@remysucre remysucre marked this pull request as ready for review May 18, 2022 23:46
@remysucre
Copy link
Contributor Author

I addressed the comments and added some tests. I'm a bit out of my depth for how galloping should interact with other scheduling directives (or even how the existing directives work), so I just pushed a few cases where the tests passed.

Copy link
Contributor

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work remy, the code looks great! I don't think galloping has too many interactions with other scheduling directives and should mostly "just work" with the other directives. If problems come up we should be able to fix them.

Before I merge this, please squash all of your commits into a single one.

This commit addes a new scheduling directive called mergeby. This directive specifies if the iterators of a given variable is merged by Two Finger merge or Galloping. The default strategy is Two Finger merge which is the same as the old behavior. Galloping merges the iterators with exponential search, which can be more efficient if the iterator sizes are skewed.
@remysucre
Copy link
Contributor Author

Thank you for your help! I've squashed the commits.

@rohany rohany merged commit 44847b7 into tensor-compiler:master May 19, 2022
@rohany
Copy link
Contributor

rohany commented May 19, 2022

Thanks for your work remy! Please reach out if you have questions / need guidance around performing benchmarks etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants