Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Vanilla Spark broadcast exchange + R2C is slow sometimes #5136

Closed
zhztheplayer opened this issue Mar 27, 2024 · 4 comments
Closed

[VL] Vanilla Spark broadcast exchange + R2C is slow sometimes #5136

zhztheplayer opened this issue Mar 27, 2024 · 4 comments
Labels
bug Something isn't working triage

Comments

@zhztheplayer
Copy link
Member

zhztheplayer commented Mar 27, 2024

Backend

VL (Velox)

Bug description

This is because the code to convert vanilla Spark's hashed relation to Gluten's sometimes produced duplicated rows.

The fix will be incorporated in #5058 (at commit 96c3fc7) since it can be tested by the ACBO changes.

@zhztheplayer zhztheplayer added bug Something isn't working triage labels Mar 27, 2024
@zhztheplayer zhztheplayer changed the title [VL] Vanilla Spark broadcast exchange + C2R is slow sometimes [VL] Vanilla Spark broadcast exchange + R2C is slow sometimes Mar 27, 2024
@ulysses-you
Copy link
Contributor

Thank you @zhztheplayer It's a good point, columnar broadcast would broadcast the origin binary data but vanilla Spark would broadcast hash relation. So I think this issue is a common case even if there is no r2c.

Is it possbile to create a new pr for this issue ?

@zhztheplayer
Copy link
Member Author

I don't have dedicated UTs for it so it was incorporated into the other PR.

Still I can open one for it if you think it's needed: #5141.

The change was already tested so I will proceed to merge after code style check is passed if it's OK to you.

@zhztheplayer
Copy link
Member Author

zhztheplayer commented Mar 27, 2024

The major issue I have found is that the flatMap approach would cause UnsafeHashedRelation to produce duplicated rows in my case (TPCDS q14a with current version of ACBO)
While the map approach would cause LongHashedRelation to lose rows (TPCDS q2).

The following fix (the same with #5141) can work but I didn't dive into it deeply to find the root reason of the inconsistency (maybe related to keyIsUnique? I am not sure).

  private def reconstructRows(relation: HashedRelation): Iterator[InternalRow] = {
    // It seems that LongHashedRelation and UnsafeHashedRelation don't follow the same
    //  criteria while getting values from them.
    // Should review the internals of this part of code.
    relation match {
      case relation: LongHashedRelation if relation.keyIsUnique =>
        relation.keys().map(k => relation.getValue(k))
      case relation: LongHashedRelation if !relation.keyIsUnique =>
        relation.keys().flatMap(k => relation.get(k))
      case other => other.valuesWithKeyIndex().map(_.getValue)
    }
  }

@zhztheplayer
Copy link
Member Author

Fixed in #5141. I assume we can close this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants