-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use faster multi-contains in rlike
regex rewrite
#11810
Use faster multi-contains in rlike
regex rewrite
#11810
Conversation
Signed-off-by: Haoyang Li <[email protected]>
rlike
regex rewrite
Signed-off-by: Haoyang Li <[email protected]>
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is leaking intermediate columns when foldLeft is used.
} | ||
withResource(boolCvs) { _ => | ||
val falseCv = withResource(Scalar.fromBool(false)) { falseScalar => | ||
ColumnVector.fromScalar(falseScalar, input.getRowCount.toInt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't reduceLeft
do what you want and not require creating a boolean column? Also how to do you close the intermediate values when running foldLeft, or reduceLeft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't reduceLeft do what you want and not require creating a boolean column?
For the lambda (l, r) => l or r)
in the reduce
, l
can either be the first value which protected with the outer withResource
or an intermediate value that needs to add a withResource
, so I failed to find an unified way to write the lambda. Updated to a little tricky way with foldLeft to save a boolean column.
Also how to do you close the intermediate values when running foldLeft, or reduceLeft.
They are closed by the withResource
in (l, r) => withResource(l) { _ => l.or(r)}
, where l
is always an intermediate value.
Could you redo your speedup numbers? Mixing % and x speedup factors is confusing. Also the 2.31x number appears to be wrong. That or the times are wrong because 6,399 ms / 3,765 ms is 1.70x speedup (70%) not 2.31x (131%). |
Signed-off-by: Haoyang Li <[email protected]>
Thanks, updated in the description. |
} | ||
withResource(boolCvs.tail) { _ => | ||
// boolCvs.head and intermediate values are closed within the withResource in the lambda | ||
boolCvs.tail.foldLeft(boolCvs.head)((l, r) => withResource(l) { _ => l.or(r)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still has not closed item.
val a = Array(1, 2, 5)
a.tail.foldLeft(a.head)((acc, c) => {println(s"acc is $acc, current is $c"); acc + c})
outputs:
acc is 1, current is 2
acc is 3, current is 5
Your approach closed 3 times, but we need to close 4 times(including the intermediate result)
Items need to close: 1, 2, 5 and 3(intermediate result)
Do the following chang:
withResource(boolCvs.tail) { _ =>
// boolCvs.head and intermediate values are closed within the withResource in the lambda
boolCvs.tail.foldLeft(boolCvs.head)((l, r) => withResource(l) { _ => l.or(r)})
==>>
boolCvs.tail.foldLeft(boolCvs.head)((acc, c) =>
withResource(acc) { _ =>
withResource(c) {_ =>
acc.or(c)
}
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just do a for loop. We are trying really hard to do something fancy when we can just make it simple.
var ret: ColumnVector = null
withResource(boolCvs) { _ =>
boolCvs.indicies.foreach { i =>
if (ret == null) {
ret = boolCvs[i]
boolCvs[i] = null
} else {
val tmp = ret.or(boolCvs[i])
ret.close()
ret = tmp
boolCvs[i].close()
boolCvs[i] = null
}
}
}
Once we have that working, then you can play around with ways to make it cleaner and/or faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my previous comment
Haoyang's method is right, I ignored that withResource(boolCvs.tail)
will close multiple times if boolCvs.size > 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just do a for loop. We are trying really hard to do something fancy when we can just make it simple.
var ret: ColumnVector = null withResource(boolCvs) { _ => boolCvs.indicies.foreach { i => if (ret == null) { ret = boolCvs[i] boolCvs[i] = null } else { val tmp = ret.or(boolCvs[i]) ret.close() ret = tmp boolCvs[i].close() boolCvs[i] = null } } }
Once we have that working, then you can play around with ways to make it cleaner and/or faster.
Talked with @res-life offline, we think the following method is correct:
withResource(boolCvs.tail) { _ =>
boolCvs.tail.foldLeft(boolCvs.head)((l, r) => withResource(l) { _ => l.or(r)})
}
because all elements in boolCvs.tail
can be closed in outer withResource
, boolCvs.head
and intermediate results are closed in inner withResource
.
But your approach looks better because it can close items in boolCvs
earlier after use, so the memory usage is less. Updated to this method with some changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok updated again, this should be cleaner and close values as early as the foreach
method:
closeOnExcept(boolCvs.tail) { _ =>
boolCvs.tail.foldLeft(boolCvs.head) {
(l, r) => withResource(l) { _ =>
withResource(r) { _ =>
l.or(r)
}
}
}
}
please take a look.
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Could you try using AST to do the ORs? An alternative would be to have a separate multi-contains that does the or automatically. |
withResource(acc) { _ => | ||
withResource(containsSearch) { _ => | ||
acc.or(containsSearch) | ||
closeOnExcept(boolCvs.tail) { _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work. We will get double frees if there is an exception. boolCvs.tail
is not updated when the or happens. That is why I wrote the code that I did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a complex safeMap
so that we can have it be safe and at the same time make the resulting code simple.
spark-rapids/sql-plugin/src/main/scala-2.12/com/nvidia/spark/rapids/implicits.scala
Lines 155 to 238 in 561068c
/** | |
* safeMap: safeMap implementation that is leveraged by other type-specific implicits. | |
* | |
* safeMap has the added safety net that as you produce AutoCloseable values they are | |
* tracked, and if an exception were to occur within the maps's body, it will make every | |
* attempt to close each produced value. | |
* | |
* Note: safeMap will close in case of errors, without any knowledge of whether it should | |
* or not. | |
* Use safeMap only in these circumstances if `fn` increases the reference count, | |
* producing an AutoCloseable, and nothing else is tracking these references: | |
* a) seq.safeMap(x => {...; x.incRefCount; x}) | |
* b) seq.safeMap(x => GpuColumnVector.from(...)) | |
* | |
* Usage of safeMap chained with other maps is a bit confusing: | |
* | |
* seq.map(GpuColumnVector.from).safeMap(couldThrow) | |
* | |
* Will close the column vectors produced from couldThrow up until the time where safeMap | |
* throws. | |
* | |
* The correct pattern of usage in cases like this is: | |
* | |
* val closeTheseLater = seq.safeMap(GpuColumnVector.from) | |
* closeTheseLater.safeMap{ x => | |
* var success = false | |
* try { | |
* val res = couldThrow(x.incRefCount()) | |
* success = true | |
* res // return a ref count of 2 | |
* } finally { | |
* if (!success) { | |
* // in case of an error, we close x as part of normal error handling | |
* // the exception will be caught by the safeMap, and it will close all | |
* // AutoCloseables produced before x | |
* // - Sequence looks like: [2, 2, 2, ..., 2] + x, which has also has a refcount of 2 | |
* x.close() // x now has a ref count of 1, the rest of the sequence has 2s | |
* } | |
* } | |
* } // safeMap cleaned, and now everything has 1s for ref counts (as they were before) | |
* | |
* closeTheseLater.safeClose() // go from 1 to 0 in all things inside closeTheseLater | |
* | |
* @param in the Seq[A] to map on | |
* @param fn a function that takes A, and produces B (a subclass of AutoCloseable) | |
* @tparam A the type of the elements in Seq | |
* @tparam B the type of the elements produced in the safeMap (should be subclasses of | |
* AutoCloseable) | |
* @tparam Repr the type of the input collection (needed by builder) | |
* @tparam That the type of the output collection (needed by builder) | |
* @return a sequence of B, in the success case | |
*/ | |
protected def safeMap[B <: AutoCloseable, That]( | |
in: collection.SeqLike[A, Repr], | |
fn: A => B) | |
(implicit bf: CanBuildFrom[Repr, B, That]): That = { | |
def builder: mutable.Builder[B, That] = { | |
val b = bf(in.asInstanceOf[Repr]) | |
b.sizeHint(in) | |
b | |
} | |
val b = builder | |
for (x <- in) { | |
var success = false | |
try { | |
b += fn(x) | |
success = true | |
} finally { | |
if (!success) { | |
val res = b.result() // can be a SeqLike or an Array | |
res match { | |
// B is erased at this point, even if ClassTag is used | |
// @ unchecked suppresses a warning that the type of B | |
// was eliminated due to erasure. That said B is AutoCloseble | |
// and SeqLike[AutoCloseable, _] is defined | |
case b: collection.SeqLike[B @ unchecked, _] => b.safeClose() | |
case a: Array[AutoCloseable] => a.safeClose() | |
} | |
} | |
} | |
} | |
b.result() | |
} | |
} |
Could we just add a safeReduce
into implicits.scala instead of trying so hard to make this small piece of code simple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind it is just a helper function safeReduceAndClose
. We don't need to jump through hoops to make the code simple if inherently what we want to do is not simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work. We will get double frees if there is an exception.
boolCvs.tail
is not updated when the or happens. That is why I wrote the code that I did.
Yes you are right, I missed that point and thought an outer withResource
would lead to double closes. Update back to withResource
for now.
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Thanks @revans2, tried AST and it looks to work with another 1x speed up in the operator level: Updated the performance results too. Not sure if I coded it right, please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good to me.
build |
Closes #11729
We have
GpuMultipleContains
to check if the input column contains any of the strings in a list, for regex rewrite. After that, we made a performance improvement for multiple string contains, which provides a similar function and runs fast.This PR leverages methods in
GpuMultiContains
toGpuMultipleContains
, and renames it toGpuContainsAny
. So regex rewrite can also get benefit from the multiple string contains performance improvement.Performance results on a 1000000 string dataset of 5000~10000 length, averaged three times: