Skip to content

[native] Differentiate number of drivers and number of splits in stats reporting #24671

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 16, 2025

Conversation

anandamideShakyan
Copy link
Contributor

@anandamideShakyan anandamideShakyan commented Mar 4, 2025

Description

Presto Java reports number of drivers to number of splits in Presto UI because split and driver are 1 to 1 mapping relationship. This is not true in Prestissimo where 1 driver handles many splits. We have to extend the API (protocol::TaskStats and Presto coordinator UI) to have separate splits and drivers information.
This PR updates TaskStats, QueryStats, and BasicQueryStats to separately report driver and split stats:

  1. Presto UI currently assumes a 1:1 mapping between splits and drivers, which is incorrect for Prestissimo, where 1 driver handles multiple splits.
  2. This change extends the API (protocol::TaskStats) and updates the Presto Coordinator UI to properly display both split and driver statistics separately.
  3. The changes are propagated through StageExecutionStats, QueryStats, and BasicQueryStats to ensure consistency in reporting.

Motivation and Context

  • Separates driver and split reporting in Presto UI for both java and Prestissimo worker. The driver and split stats are identical in case of java worker due to 1:1 mapping.
  • Improves query monitoring and resource tracking by providing accurate execution details.
  • Helps in better debugging and performance optimizations by exposing correct metrics.
  • Addresses #23758

Impact

API Change:
The API now includes separate fields for totalSplits, queuedSplits, runningSplits, and completedSplits in the TaskStats, StageExecutionStats, and QueryStats classes. These additional fields are ultimately passed to QueryStateInfo, which serves as the data structure used to send infromation for each query to the UI.

UI Changes:
UI components displaying query execution stats will now show separate splits and drivers information separately. You can see in the screenshot below. I do need to ask the code owners about the new icons that need to be used to display split and driver stats, right now I am just reusing the same icons for both.

Java Worker:
Screenshot 2025-03-05 at 7 56 55 PM

Prestissimo Worker:
Screenshot 2025-03-05 at 8 01 54 PM

Future Work in UI:

  1. Decide on new icons to represent drivers and splits stats separately
  2. In QueryDetails and QueryOverview components we are displaying blockedDrivers in place of blockedSplits as there is no blockedSplits field in velox TaskStats, if we want to display blockedSplits then we need to expand TaskStats struct in velox.

Latest update in UI:

Screenshot 2025-04-04 at 12 23 15 PM Screenshot 2025-04-04 at 12 23 23 PM

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Mar 4, 2025
@prestodb-ci prestodb-ci requested review from a team, bibith4 and nmahadevuni and removed request for a team March 4, 2025 12:53
@anandamideShakyan anandamideShakyan force-pushed the shakyan-split branch 2 times, most recently from 776838c to 4c23ac5 Compare March 4, 2025 13:10
@anandamideShakyan anandamideShakyan changed the title [native] Differentiate number of drivers and number of splits in stats reporting (Part-1) [native] Differentiate number of drivers and number of splits in stats reporting Mar 5, 2025
@steveburnett
Copy link
Contributor

Nit: In Motivation and Context, suggest changing

- Addresses [link](https://github.com/prestodb/presto/issues/23758)

to

- Addresses [#23758](https://github.com/prestodb/presto/issues/23758)

@yhwang
Copy link
Member

yhwang commented Mar 6, 2025

some thoughts for the UI:

  • For backward compatibility, if there is no split information, don't use the new split fields and only show one row using driver data.
  • I am okay with two rows using the same icons for completed, running, and queued numbers. Ideally, adding an icon before each row to represent split and driver would be good. However, the tooltip info should be good enough for now.

@nmahadevuni
Copy link
Member

@anandamideShakyan Please check the failed tests. You can run the same on your local machine and then if not related, rerun the tests in CI.

@steveburnett steveburnett removed their request for review March 13, 2025 21:57
@anandamideShakyan
Copy link
Contributor Author

anandamideShakyan commented Mar 13, 2025

@nmahadevuni Some refactoring was needed in TestPlanPrinter.java afterwards the tests are running successfully on my local machine. I have pushed these changes here.
In the CI environment presto-iceberg test failure is caused due to the function not completing its execution before the timeout. As for the prestocpp-linux-build-for-test it is failing because it is trying to read from a non existent file /tmp/this-should-not-exist/velox--missing-broadcast-file.bin, this is to do with velox. I want to trigger a rerun for the second one, how can I rerun a test in CI, do I need to have any access?

@anandamideShakyan
Copy link
Contributor Author

Hi @kewang1024, Could you please elaborate on how does the scheduler uses queuedPartitionedDrivers and runningPartitionedDrivers and where does this comparison happens. In PrestoTask.cpp, as far as I can see that there are no changes done to these two:

info.taskStatus.queuedPartitionedDrivers =
      veloxTaskStats.numQueuedTableScanSplits;
  info.taskStatus.runningPartitionedDrivers =
      veloxTaskStats.numRunningTableScanSplits;

My changes are only limited to runningDrivers, totalDrivers, queuedDrivers and completedDrivers, rest are being passed just like before. So changes are only done to TaskStats, TaskStatus class remains the same as before.

@kewang1024
Copy link
Collaborator

@anandamideShakyan I see, then runningPartitionedDrivers and runningDrivers are different, then we should be safe.

I think eventually runningDrivers behavior will change for native worker, we should make sure the upstream doesn't have hard dependency on it and doesn't depend on it in critical path

@kewang1024
Copy link
Collaborator

@anandamideShakyan : Can we name the new fields with native instead of new... So queuedNativeDrivers, runningNativeDrivers, completedNativeDrivers, totalNativeDrivers. Sorry about the renaming...the usage of new is a bit vague.

Rest of it is fine.

I have a quick question, I understand why we want to use new, because it's backward-compatible and after we can switch to new then we can deprecate the old one

But if we switch to the name native, then we would not be able to deprecate queueDrivers, runningDrivers, completedDrivers etc, is that correct? Or I might have missed something @aditi-pandit

@aditi-pandit
Copy link
Contributor

aditi-pandit commented Jun 17, 2025

@kewang1024 : Didn't entirely follow the question. We do intend to deprecate.. I sort of felt new didn't really represent what the stat was for... Native seemed more understandable.

Also, related to
I think eventually runningDrivers behavior will change for native worker, we should make sure the upstream doesn't have hard dependency on it and doesn't depend on it in critical path

do you have any suggestions on how this can be tested ?

At IBM our native clusters are new, so we have fewer migration problems than you'll.

@kewang1024
Copy link
Collaborator

    private final int totalDrivers;
    private final int queuedDrivers;
    private final int runningDrivers;
    private final int completedDrivers;
    private final int totalNativeDrivers;
    private final int queuedNativeDrivers;
    private final int runningNativeDrivers;
    private final int completedNativeDrivers;

If those would exist, then one would represent java, another would represent cpp, which would be confusing and bug-prone. Ideally we should only have one set of variables to represent both

@aditi-pandit could you elaborate a bit more for the intension of creating new ones?

@aditi-pandit
Copy link
Contributor

@kewang1024 : I was given to understand that Meta could advance worker and co-ordinator at different times, so the new fields were introduced for backwards compatiblity. The flow would be as follows...The worker is advanced first, the new fields are unused and the old ones are the same. Then the co-ordinator is moved and we start using the new fields ignoring the old. Then we will do another PR to remove the old fields.

@kewang1024
Copy link
Collaborator

@aditi-pandit we advance worker and coordinator at different times only when worker is native

so what would happen if native worker is updated first, coordinator is not?
native worker starts to populate new values for

private final int totalDrivers;
private final int queuedDrivers;
private final int runningDrivers;
private final int completedDrivers;

native worker starts to populate new fields

int totalSplits,
int queuedSplits,
int runningSplits,
int completedSplits,

but both should be ok, what might I have miss here?

@aditi-pandit
Copy link
Contributor

@kewang1024 : We want to make native worker change (with new fields) a no-op (instead of showing new driver values without new split values). Everything continues to work the same. All the new values (distinguishing between splits and drivers) are correctly populated in UI when the co-ordinator is updated. Post any changes your side for the new values we can remove the old ones.

@aditi-pandit
Copy link
Contributor

@kewang1024 : Wanted to check about this code post our discussions in the Presto Native Worker working group. Do you have any more reservations ?

@kewang1024
Copy link
Collaborator

kewang1024 commented Jul 2, 2025

@aditi-pandit I think the shortest path makes sense, as long as we address the naming then we should be ok

private final int totalNativeDrivers;
private final int queuedNativeDrivers;
private final int runningNativeDrivers;
private final int completedNativeDrivers;

Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anandamideShakyan. Appreciate the iterations over the code.

@aditi-pandit
Copy link
Contributor

@kewang1024 : Please take a look and approve the merge.

@amitkdutta

@kewang1024 kewang1024 self-requested a review July 14, 2025 18:07
Copy link
Collaborator

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making the changes!

@aditi-pandit
Copy link
Contributor

@tdcmeehan @yingsu00 : Please can you help with the review.

@tdcmeehan
Copy link
Contributor

So just to be clear, we will follow up to remove the additional fields after some period of time?

@aditi-pandit
Copy link
Contributor

@tdcmeehan : Yes.

@aditi-pandit aditi-pandit merged commit 7760500 into prestodb:master Jul 16, 2025
171 of 172 checks passed
@aditi-pandit
Copy link
Contributor

@anandamideShakyan : This got missed in the PR review, but please merge commits for Presto PRs in logical sets in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:IBM PR from IBM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[native] Differentiate number of drivers and number of splits in stats reporting