Skip to content

Conversation

@DanielZhangQD
Copy link
Collaborator

@DanielZhangQD DanielZhangQD commented Oct 10, 2025

Improve performance of sky api status -a

  • Add --limit option and retrieve 50 requests by default
  • Reduce the db query for user name from O(N) to O(1)
  • Retrieve the required fields only that are used by the client

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@DanielZhangQD
Copy link
Collaborator Author

DanielZhangQD commented Oct 10, 2025

Function test:

  1. New server + new client
sky api status
sky api status -a -v
sky api status -a -l none
sky api status f6ce5c8b-1881-4d3a-a359-e24d3c0b439e
  1. New server + old client
sky api status
sky api status -a -v
sky api status xxxx-1881-4d3a-a359-e24d3c0b439e
  1. Old server + new client
sky api status
sky api status -a -v
sky api status -a -l none
sky api status f6ce5c8b-1881-4d3a-a359-e24d3c0b439e

Performance test - with 100000 requests in DB:

  1. Data fetched, new client 33MB vs old client 128 MB, 74% reduced
  2. Request duration
Commands New server + new client Old server + old client Result
multitime -n 5 sky api status > /dev/null 0.691 0.772 No visible difference
multitime -n 5 sky api status -a -v > /dev/null 0.969(50 requests) 19.081(all requests)  
multitime -n 5 sky api status -a -l none > /dev/null 8.747(all requests) 19.081(all requests, old client does not support -l, take the duration of -a -v) Duration reduced about 54%
multitime -n 5 sky api status a3282df8-45d6-4ed2-8e97-0bd29f2d6981 > /dev/null 0.697 0.828 No visible difference
image
  1. Resource consumption
    Total memory reduced about 33%, worker memory reduced about from 55%~65%
    Old version server: dan-api-server-56c4f448dc-6h925
    New version server: dan-api-server-6787c474c-5ptl5, load running from about 09:40
image image

@DanielZhangQD
Copy link
Collaborator Author

DanielZhangQD commented Oct 10, 2025

/quicktest-core
Passed - https://buildkite.com/skypilot-1/quicktest-core/builds/1530#_

@DanielZhangQD DanielZhangQD marked this pull request as ready for review October 10, 2025 11:30
@DanielZhangQD DanielZhangQD requested review from Michaelvll and aylei and removed request for Michaelvll October 10, 2025 11:30
@DanielZhangQD DanielZhangQD marked this pull request as draft October 10, 2025 13:22
@SeungjinYang SeungjinYang self-requested a review October 10, 2025 18:31
@DanielZhangQD
Copy link
Collaborator Author

/quicktest-core

@DanielZhangQD DanielZhangQD marked this pull request as ready for review October 11, 2025 02:14
Copy link
Collaborator

@SeungjinYang SeungjinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic LGTM, just a few UX related comments

Comment on lines +310 to +312
all_users = global_user_state.get_all_users()
all_users_map = {user.id: user.name for user in all_users}
for request in requests:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, could we first get all usernames that appear in the list of requests at least once and query on just those users? For teams with large users we'd expect only a small subset of them to be running active requests at a given time, and also because we did historically have some bug related to user duplication that can make all_users list particularly large.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mainly for sky status -a, where there would be many request records, like hundreds of thousands or more. In this case, it would take more time to go through all the request records and probably get almost all the users. So I would prefer to keep the current logic here.

Comment on lines 6178 to 6185
@click.option(
'--limit',
'-l',
default=_NUM_REQUESTS_TO_SHOW,
type=int,
required=False,
help=(f'Number of requests to show, default is {_NUM_REQUESTS_TO_SHOW},'
f' set to 0 to show all requests.'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we merge this PR with all of the support for limits except defaulting to _NUM_REQUESTS_TO_SHOW (i.e. keep default as is, showing everything)? I'm just worried some people might be relying on this behavior (i.e. sky api status | grep <request-id>) for whatever use case they might have.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cases like sky api status | grep <request-id> should be fine to show all requests since the number of running requests is limited. But I'm a bit concerned that sky api status -a | grep <request_id>, which has unlimited historical requests.

I'm thinking of have different handling of -l with -a and no-a cases, but this seems dirty.

To keep things simple, how about we keep the -l limit and print a warning to the stderr if the number of requests exceeds the default -l limit to hint user use pagination or sky status <request_id>?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use the default limitation, -l none or -l all to show all requests, and added a message to notify the user for more requests.

# pylint: disable=redefined-builtin
all_status: bool = False
all_status: bool = False,
request_limit: int = 0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this value be an Optional[int] where None = return everything, instead of 0 = return everything.
Also I think naming this field just limit should be fine

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

"""The request body for the API request status endpoint."""
request_ids: Optional[List[str]] = None
all_status: bool = False
request_limit: int = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here

Suggested change
request_limit: int = 0
limit: Optional[int] = None

where API defaults to sending back everything

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 352 to 354
# Valid empty values for pickled fields (base64-encoded pickled None)
# base64.b64encode(pickle.dumps(None)).decode('utf-8')
empty_pickled_value = 'gAROLg=='
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel a bit hacky, can we abstract this a bit or make it a magic constant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

empty_pickled_value = 'gAROLg=='

# Required fields in RequestPayload
if 'request_id' not in fields:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we loop over the table definition instead of duplicate it here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different fields are handled differently here, we need to know the exact field name.

Comment on lines 184 to 186
$ # List finished and ongoing requests, default to show 50 requests
$ # add `--limit 0` to show all finished and ongoing requests
$ sky api status -a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we support pagination?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, the CLI needs to use cursor-based pagination, which would require more refactoring work, we can support this later if necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traced in #7586

@DanielZhangQD
Copy link
Collaborator Author

/quicktest-core

@DanielZhangQD
Copy link
Collaborator Author

/smoke-test --kubernetes -k test_api_server_start_stop
/smoke-test --kubernetes -k test_requests_scheduling

@DanielZhangQD
Copy link
Collaborator Author

/smoke-test --kubernetes -k test_api_server_start_stop
/smoke-test --kubernetes -k test_requests_scheduling
/quicktest-core

Copy link
Collaborator

@aylei aylei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @DanielZhangQD !

@DanielZhangQD DanielZhangQD merged commit 9f28ca4 into master Oct 11, 2025
22 checks passed
@DanielZhangQD DanielZhangQD deleted the 3521 branch October 11, 2025 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants