[qob] cancel stage if any partitions fail #14803

grohli · 2025-01-27T19:55:33Z

Change Description

Fixes #14597
cancel_after_n_failures = 1; cancels whole job group if one partition fails.

Update from a stale PR, found here.

Security Assessment

This change has no security impact

Impact Description

Unit tests and no interaction on anything secure.

(Reviewers: please confirm the security impact before approving)

## Change Description Stop importing prices for reserved resources. They cause clashes with our database primary keys and can never be part of user billing. (By definition, we can't be running jobs on reservations of future resources) ## Security Assessment Delete all except the correct answer: - This change has a low security impact ### Impact Description We are doing a simple string match and then choosing not to import something that we otherwise might have into our database. (Reviewers: please confirm the security impact before approving)

…ail-is#14750) resolves hail-is#14749 Leave the override in Backend as well to avoid duplication. Future enhancements may enable us to construct a ServiceBackend from argv alone, allowing this to be reverted. ## Security Assessment Delete all except the correct answer: - This change has no security impact ### Impact Description This change restores known good functionality.

…-is#14682) Adds: - A `n_max_attempts` field to the client job object (defaulting to 20) - A `n_max_attempts` reader in the front_end which stores the field into the database - And a database field to store it in - A change in the batch driver to fail jobs which have already had their allowed maximum number of attempts Some judgement call justification - - Why fail the job on attempt "n+1" instead of during the tidy-up of job "n"? - Personally I prefer a code-driven max-out over adding any more business logic into database procedures. - The mark job complete action is triggered from the worker VM not the driver, and I didn't want to add max-out business logic into the worker either.

@jmarshall

…il-is#14754) From @jmarshall on [zulip]. [zulip]: https://hail.zulipchat.com/#narrow/channel/223457-Hail-Batch-support/topic/Batch.20UI.20Previous.20Page.20button.20causing.20500.20Server.20Errors/near/481042141 ## Security Assessment Delete all except the correct answer: - This change has no security impact ### Impact Description Tiny change in logic in a template that doesn't add any new information to view/exploit.

…is#14756) Right now, our `latest_product_version` table has a partially filled sku column. This is because we added sku and didn't back calculate the old skus, and we haven't assigned skus to 'our' products (ip fees, service fee). I'd like to fix that, so that most of our 'resources' can be tracked by sku rather than some of the more fragile name/version strings that has lead to conflicts with duplicates in GCP. ## Security Assessment Delete all except the correct answer: - This change has low security impact ### Impact Description Update to a mostly unused database field. (Reviewers: please confirm the security impact before approving)

…ail-is#14758) ## Security Assessment - This change has no security impact Logic fix within a query function, no new information exposed.

## Change Description When going through the [documentation for installing Hail on OSX](https://hail.is/docs/0.2/install/macosx.html), I noticed that the syntax for installing Java via Homebrew was out of date. This PR updates the documentation to use the latest syntax for Homebrew. It also updates the command to install version 11 of Temurin instead of version 8. ## Security Assessment - This change has no security impact ### Impact Description This change updates documentation only and has no immediate end user impact. This change updates the recommended version of Temurin to a newer version (11 vs. 8). It is reasonable to assume that the newer version is at least as secure as previous versions. So, there also should be no negative security impact on future users of this documentation.

## Change Description This change prevents packaging benchmark code in the hail wheel by updating the rsync exclude patterns in `hail/Makefile` to exclude `benchmark/`. ## Security Assessment This change has no security impact ### Impact Description Low-level build configuration change that only affects which directories are excluded during file copying operations.

## Change Description Fixes a bug which was reported in https://discuss.hail.is/t/arrayindexoutofboundsexception-using-cdf-combine/4008/6. The bug was caused by an unenforced invariant in the ApproxCDF aggregator state. Roughly, the state consists of a number of "levels" (stored flattened into a single array), where each level contains a number of samples from the data. There is also a notion of the "capacity" of each level. The data structure assumes that whenever the array containing the flattened levels becomes full, there must be at least one level which is above capacity, and can therefore be compacted to free space. In other words, the size of the array must be at least the sum of the capacities of the present levels. In hail-is#13935, we changed the ApproxCDF aggregator to return the raw internal state, and moved the computation of the cdf from the internal state to python. The problem is that we "compress" the states before returning them as hail values by reducing the size of the array of samples to only the number of present samples. But then, when the exposed "combine" function recreates ApproxCDF aggregator states from the returned values, the array of samples is no longer large enough to satisfy the invariant. In this PR, I * add a test case in python which fails in main, by running approx_cdf over a very small dataset * fix `ApproxCDFStateManager.fromData` to "uncompress" the aggregator state * add a check for the violated invariant in the `ApproxCDFStateManager` constructor. (This isn't a perfect check, as the underlying state can be changed after construction, but I didn't see a way to make a more complete check without significant refactoring, and this would have caught the current bug.) ## Security Assessment - This change has no security impact ### Impact Description Low-level refactoring of non-security code

This change combines cloud auth logic that was previously duplicated between the various `FS` implementations and the `BatchClient`. The main refactoring is to make the interface between the `ServiceBackend` more high-level and leave json serialisation to the `BatchClient`. To do this, I've added a bunch of case classes that resemble the python objects the batch service expects (or a subset of the data). To simplify the interface, I've split batch creation from job submission (update). For QoB, the python client creates the batch before handing control to the query driver; batch creation is necessary for testing only. This change has low security impact as there are minor changes to the creation and scoping of service account credentials. Note that for each `FS`, credentials are scoped to the default storage oauth2 scopes for each service.

We publish release artifacts independently of tests passing in the release pipeline. Flaky tests were causing the benchmarks to not run. This change takes the pragmatic view that if publishing does not depend on tests, then neither should starting the benchmark batch. ## Security Assessment - This change has no security impact

## Change Description Update mill to 0.12.4. Modify the build script to use the new `build.mill` style, and to move top level tasks into the root module. Also needed to update scala from 2.12.15 to 2.12.20, for compatibility with the semanticdb compiler plugin used by scalafix. ## Security Assessment - This change has no security impact ### Impact Description Small refactoring of the build system.

## Change Description Factor out the location of the test resources directory in the scala tests, so we only need to change one place when we rearrange the directory structure. ## Security Assessment - This change has no security impact ### Impact Description Tests only

## Change Description Updates the handling of the `next` query parameters on various auth URLs to only accept absolute paths within the same domain as the auth service or its equivalent batch service. Prevents a potential class of attacks exploiting unvalidated redirects. ## Security Assessment Delete all except the correct answer: - This change has a medium security impact ### Impact Description For medium/high impact: provide a description of the impact and the mitigations in place. Changes how an auth API works, but in a way that reduces its overall functional surface. Defends against inappropriate redirections following login. (Reviewers: please confirm the security impact before approving)

## Change Description Adds two endpoints in the `batch` service hosting an openapi specification and a swagger page. **Note to reviewers:** The API specification documentat itself is nice to get right, but relatively low risk. I'd prioritize time making sure the framework to host it is correct, and the openapi setup sections before scanning through the details of every individual API call. ## Security Assessment Delete all except the correct answer: - This change has a high security impact - [x] Required: The impact has been assessed and approved by appsec ### Impact Description We are adding two new endpoints in `batch`. Mitigating factors: - There are no user parameters for these endpoints - The pages being hosted are mostly static, apart from the "base url" which the API documentation and swagger page targets, which is generated in the backend. (Reviewers: please confirm the security impact before approving)

## Change Description Adds a set of standard API or web security headers to all responses for compliance. Strict transport security makes sure all clients use SSL/TLS connections to prevent man-in-the-middle attacks. The content security policy adds a layer of protection against loading or running bad referenced or injected content. ## Security Assessment Delete all except the correct answer: - This change has a medium security impact ### Impact Description Adds standard-practice strict-transport-security and content-security-policy headers. Makes our API and web responses strictly less flexible in browsers or clients than they were previously (Reviewers: please confirm the security impact before approving)

cjllanwarne and others added 17 commits January 27, 2025 13:41

[query/vds] Teach sparse_split_multi how to handle haploid genotypes (h…

97f33fb

…ail-is#14758) ## Security Assessment - This change has no security impact Logic fix within a query function, no new information exposed.

[ci] don't check scala code with serverised mill (hail-is#14787)

36e5b06

grohli requested review from patrick-schultz and ehigham January 27, 2025 19:55

grohli closed this Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[qob] cancel stage if any partitions fail #14803

[qob] cancel stage if any partitions fail #14803

grohli commented Jan 27, 2025

[qob] cancel stage if any partitions fail #14803

[qob] cancel stage if any partitions fail #14803

Conversation

grohli commented Jan 27, 2025

Change Description

Security Assessment

Impact Description