-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a few more stage level metrics #11821
add a few more stage level metrics #11821
Conversation
Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @binmahone. The new metrics seem useful. I left some minor comments.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSemaphore.scala
Outdated
Show resolved
Hide resolved
@@ -316,7 +319,8 @@ private final class SemaphoreTaskInfo(val stageId: Int, val taskAttemptId: Long) | |||
if (hasSemaphore) { | |||
semaphore.release(numPermits) | |||
hasSemaphore = false | |||
lastHeld = System.currentTimeMillis() | |||
lastReleased = System.nanoTime() | |||
GpuTaskMetrics.get.addGpuTime(lastReleased - lastAcquired) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call it semaphoreTime
instead? Because what we measure is how long the task has held the semaphore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
semaphoreTime could also be ambiguous because it could either mean "time acquiring semaphore" or "time having the semaphore". In fact, if you take a looks at the equation: "total_core_time = total_gpu_time + total_acquire_gpu_time + other_time_spent_on_cpu", it is somehow intuitive to call it "gpuTime"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can call it semaphoreHoldingTime
if it's not clear. The reason I prefer semaphore
to gpu
is semaphore
is less misleading. We may not always use the gpu for the whole time while holding the semaphore.
Though, I realized that the naming cannot solve the ambiguity problem completely. I would suggest adding some comment here that explains exactly how this metric is measured. The naming will be less important once documentation is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to semaphoreHoldingTime. since semaphoreHoldingTime is quite self explainatory, I skipped the comment
sql-plugin/src/main/scala/com/nvidia/spark/rapids/PrioritySemaphore.scala
Show resolved
Hide resolved
Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/PrioritySemaphore.scala
Show resolved
Hide resolved
|
||
import org.apache.spark.sql.rapids.GpuTaskMetrics | ||
|
||
class PrioritySemaphore[T](val maxPermits: Int, val priorityForNonStarted: T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is there any case where we want to use a different value of priorityForNonStarted
than GpuSemaphore.DEFAULT_PRIORITY
? If not, we can probably just use GpuSemaphore.DEFAULT_PRIORITY
directly in this class instead of passing it to the constructor. The reason I prefer using the static variable directly is that this constructor parameter seems to tell me that it can be other values than GpuSemaphore.DEFAULT_PRIORITY
in some cases, which I'm not sure if it's true. This is more for code readability and easy to change later if needed, so I'm OK with the current code as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the major reason for val priorityForNonStarted: T
is that PrioritySemaphore has a template class T
(I have no idea why template class is needed here.)
@@ -316,7 +319,8 @@ private final class SemaphoreTaskInfo(val stageId: Int, val taskAttemptId: Long) | |||
if (hasSemaphore) { | |||
semaphore.release(numPermits) | |||
hasSemaphore = false | |||
lastHeld = System.currentTimeMillis() | |||
lastReleased = System.nanoTime() | |||
GpuTaskMetrics.get.addGpuTime(lastReleased - lastAcquired) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can call it semaphoreHoldingTime
if it's not clear. The reason I prefer semaphore
to gpu
is semaphore
is less misleading. We may not always use the gpu for the whole time while holding the semaphore.
Though, I realized that the naming cannot solve the ambiguity problem completely. I would suggest adding some comment here that explains exactly how this metric is measured. The naming will be less important once documentation is done.
Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @binmahone for addressing my comments. LGTM!
…s_on_gpu_contention
build |
This PR closes #11820 by adding the three metrics pointed by green arrow: