Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[qob] cancel stage if any partitions fail #14803

Closed
wants to merge 17 commits into from

Conversation

grohli
Copy link
Contributor

@grohli grohli commented Jan 27, 2025

Change Description

Fixes #14597
cancel_after_n_failures = 1; cancels whole job group if one partition fails.

Update from a stale PR, found here.

Security Assessment

  • This change has no security impact

Impact Description

Unit tests and no interaction on anything secure.

(Reviewers: please confirm the security impact before approving)

cjllanwarne and others added 17 commits January 27, 2025 13:41
## Change Description

Stop importing prices for reserved resources. They cause clashes with
our database primary keys and can never be part of user billing. (By
definition, we can't be running jobs on reservations of future
resources)

## Security Assessment

Delete all except the correct answer:
- This change has a low security impact

### Impact Description

We are doing a simple string match and then choosing not to import
something that we otherwise might have into our database.

(Reviewers: please confirm the security impact before approving)
…ail-is#14750)

resolves hail-is#14749

Leave the override in Backend as well to avoid duplication. Future
enhancements may enable us to construct a ServiceBackend from argv
alone, allowing this to be reverted.

## Security Assessment

Delete all except the correct answer:
- This change has no security impact

### Impact Description
This change restores known good functionality.
…-is#14682)

Adds:
- A `n_max_attempts` field to the client job object (defaulting to 20)
- A `n_max_attempts` reader in the front_end which stores the field into
the database
  - And a database field to store it in
- A change in the batch driver to fail jobs which have already had their
allowed maximum number of attempts

Some judgement call justification - 

- Why fail the job on attempt "n+1" instead of during the tidy-up of job
"n"?
- Personally I prefer a code-driven max-out over adding any more
business logic into database procedures.
- The mark job complete action is triggered from the worker VM not the
driver, and I didn't want to add max-out business logic into the worker
either.
…il-is#14754)

From @jmarshall on [zulip].

[zulip]:
https://hail.zulipchat.com/#narrow/channel/223457-Hail-Batch-support/topic/Batch.20UI.20Previous.20Page.20button.20causing.20500.20Server.20Errors/near/481042141

## Security Assessment

Delete all except the correct answer:
- This change has no security impact

### Impact Description
Tiny change in logic in a template that doesn't add any new information
to view/exploit.
…is#14756)

Right now, our `latest_product_version` table has a partially filled sku
column. This is because we added sku and didn't back calculate the old
skus, and we haven't assigned skus to 'our' products (ip fees, service
fee). I'd like to fix that, so that most of our 'resources' can be
tracked by sku rather than some of the more fragile name/version strings
that has lead to conflicts with duplicates in GCP.

## Security Assessment

Delete all except the correct answer:
- This change has low security impact

### Impact Description
Update to a mostly unused database field.

(Reviewers: please confirm the security impact before approving)
…ail-is#14758)

## Security Assessment
- This change has no security impact

Logic fix within a query function, no new information exposed.
## Change Description

When going through the [documentation for installing Hail on
OSX](https://hail.is/docs/0.2/install/macosx.html), I noticed that the
syntax for installing Java via Homebrew was out of date.

This PR updates the documentation to use the latest syntax for Homebrew.
It also updates the command to install version 11 of Temurin instead of
version 8.

## Security Assessment

- This change has no security impact

### Impact Description

This change updates documentation only and has no immediate end user
impact.

This change updates the recommended version of Temurin to a newer
version (11 vs. 8). It is reasonable to assume that the newer version is
at least as secure as previous versions. So, there also should be no
negative security impact on future users of this documentation.
## Change Description
This change prevents packaging benchmark code in the hail wheel by
updating the rsync exclude patterns in `hail/Makefile` to exclude
`benchmark/`.

## Security Assessment

This change has no security impact

### Impact Description

Low-level build configuration change that only affects which directories
are excluded during file copying operations.
## Change Description

Fixes a bug which was reported in
https://discuss.hail.is/t/arrayindexoutofboundsexception-using-cdf-combine/4008/6.

The bug was caused by an unenforced invariant in the ApproxCDF
aggregator state. Roughly, the state consists of a number of "levels"
(stored flattened into a single array), where each level contains a
number of samples from the data. There is also a notion of the
"capacity" of each level. The data structure assumes that whenever the
array containing the flattened levels becomes full, there must be at
least one level which is above capacity, and can therefore be compacted
to free space. In other words, the size of the array must be at least
the sum of the capacities of the present levels.

In hail-is#13935, we changed the ApproxCDF aggregator to return the raw
internal state, and moved the computation of the cdf from the internal
state to python. The problem is that we "compress" the states before
returning them as hail values by reducing the size of the array of
samples to only the number of present samples. But then, when the
exposed "combine" function recreates ApproxCDF aggregator states from
the returned values, the array of samples is no longer large enough to
satisfy the invariant.

In this PR, I
* add a test case in python which fails in main, by running approx_cdf
over a very small dataset
* fix `ApproxCDFStateManager.fromData` to "uncompress" the aggregator
state
* add a check for the violated invariant in the `ApproxCDFStateManager`
constructor. (This isn't a perfect check, as the underlying state can be
changed after construction, but I didn't see a way to make a more
complete check without significant refactoring, and this would have
caught the current bug.)

## Security Assessment

- This change has no security impact

### Impact Description

Low-level refactoring of non-security code
This change combines cloud auth logic that was previously duplicated
between the various `FS` implementations and the `BatchClient`.

The main refactoring is to make the interface between the
`ServiceBackend` more
high-level and leave json serialisation to the `BatchClient`. To do
this, I've
added a bunch of case classes that resemble the python objects the batch
service
expects (or a subset of the data). To simplify the interface, I've split
batch
creation from job submission (update). For QoB, the python client
creates the
batch before handing control to the query driver; batch creation is
necessary
for testing only.

This change has low security impact as there are minor changes to the
creation
and scoping of service account credentials. Note that for each `FS`,
credentials
are scoped to the default storage oauth2 scopes for each service.
We publish release artifacts independently of tests passing in the
release pipeline. Flaky tests were causing the benchmarks to not run.
This change takes the pragmatic view that if publishing does not depend
on tests, then neither should starting the benchmark batch.

## Security Assessment
- This change has no security impact
## Change Description

Update mill to 0.12.4. Modify the build script to use the new
`build.mill` style, and to move top level tasks into the root module.

Also needed to update scala from 2.12.15 to 2.12.20, for compatibility
with the semanticdb compiler plugin used by scalafix.

## Security Assessment

- This change has no security impact

### Impact Description

Small refactoring of the build system.
## Change Description

Factor out the location of the test resources directory in the scala
tests, so we only need to change one place when we rearrange the
directory structure.

## Security Assessment

- This change has no security impact

### Impact Description

Tests only
## Change Description

Updates the handling of the `next` query parameters on various auth URLs
to only accept absolute paths within the same domain as the auth service
or its equivalent batch service. Prevents a potential class of attacks
exploiting unvalidated redirects.

## Security Assessment

Delete all except the correct answer:
- This change has a medium security impact

### Impact Description

For medium/high impact: provide a description of the impact and the
mitigations in place.
Changes how an auth API works, but in a way that reduces its overall
functional surface. Defends against inappropriate redirections following
login.

(Reviewers: please confirm the security impact before approving)
## Change Description

Adds two endpoints in the `batch` service hosting an openapi
specification and a swagger page.

**Note to reviewers:** The API specification documentat itself is nice
to get right, but relatively low risk. I'd prioritize time making sure
the framework to host it is correct, and the openapi setup sections
before scanning through the details of every individual API call.

## Security Assessment

Delete all except the correct answer:
- This change has a high security impact
  - [x] Required: The impact has been assessed and approved by appsec

### Impact Description

We are adding two new endpoints in `batch`. Mitigating factors:

- There are no user parameters for these endpoints
- The pages being hosted are mostly static, apart from the "base url"
which the API documentation and swagger page targets, which is generated
in the backend.

(Reviewers: please confirm the security impact before approving)
## Change Description

Adds a set of standard API or web security headers to all responses for
compliance. Strict transport security makes sure all clients use SSL/TLS
connections to prevent man-in-the-middle attacks. The content security
policy adds a layer of protection against loading or running bad
referenced or injected content.

## Security Assessment

Delete all except the correct answer:
- This change has a medium security impact

### Impact Description

Adds standard-practice strict-transport-security and
content-security-policy headers. Makes our API and web responses
strictly less flexible in browsers or clients than they were previously

(Reviewers: please confirm the security impact before approving)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants