-
Notifications
You must be signed in to change notification settings - Fork 3
Fix: Manifest content hash computation times out (#6123) #7258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Fix: Manifest content hash computation times out (#6123) #7258
Conversation
b5212ca to
7ed3673
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7258 +/- ##
===========================================
- Coverage 85.13% 85.13% -0.01%
===========================================
Files 156 156
Lines 22261 22289 +28
===========================================
+ Hits 18952 18975 +23
- Misses 3309 3314 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7ed3673 to
d7e9024
Compare
2a9c76f to
e4693d5
Compare
achave11-ucsc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ✅
src/azul/service/manifest_service.py
Outdated
| """ | ||
| return self._manifest_hash('bundles') | ||
|
|
||
| def _manifest_hash(self, base: str) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PL (modal parameter smell, literal string)
bd5d131 to
81a2b78
Compare
f7e2b60 to
83f930f
Compare
cd223e3 to
b6669be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While editing the docstring, it occurred to me that we will also need a check whether an index operation is currently in progress and prevent the manifest generation if it is. With the source-based hash, the risk is now considerably higher that we'd be caching an incomplete manifest. The HTTP status should be 503 and have a Retry-After header value that's derived from the notification queue length (you can use a heuristic that's based on prod and anvilprod reindexes). Please also investigate if there's a way to lock an index in Elasticsearch. The problem with checking the queues is that it is not specific to a catalog. If index locks aren't possible we might want to create a central index with one document per catalog. Please also explore what document locking mechanisms exist in Elasticsearch.
Index: src/azul/service/manifest_service.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/service/manifest_service.py b/src/azul/service/manifest_service.py
--- a/src/azul/service/manifest_service.py (revision 839dceb6d0d7fc006d1817235ba013d1580d6ad0)
+++ b/src/azul/service/manifest_service.py (date 1754633437226)
@@ -982,7 +982,7 @@
filter_string = repr(sort_frozen(freeze(self.filters.explicit)))
# If incremental index changes are disabled, we don't need to worry
# about individual bundles, only sources.
- content_hash = str(self.manifest_hash(config.enable_bundle_notifications))
+ content_hash = str(self.manifest_hash(by_bundle=config.enable_bundle_notifications))
catalog = self.catalog
format = self.format()
manifest_hash_input = [
@@ -1147,29 +1147,38 @@
**args))
@cache
- def manifest_hash(self, by_bundle: bool) -> int:
+ def manifest_hash(self, *, by_bundle: bool) -> int:
"""
- Return a content hash for the manifest.
+ Return a hash of the input this generator builds the manifest from. The
+ input is the set of ES documents from the files index. For two generator
+ instances g1 and g2 created at two different points in time, and any
+ boolean value b, if
- If `by_bundle` is True, the hash is computed from the fully-qualified
- identifiers of all bundles containing files that match the current
- filter. The return value approximates a hash of the content of the
- manifest because a change of the file data requires a change to the file
- metadata which requires a new bundle or bundle version.
+ g1.manifest_hash(by_bundle=b) == g2.manifest_hash(by_bundle=b)
- If `by_bundle` is False, the hash is computed from the identifiers of
- the sources from which projects/datasets containing files matching the
- current filters were indexed. It's worth noting that a filter may match
- a project/dataset but none of the project's files. For example, if a
- project contains only files derived from either mouse brains or lion
- hearts, the project will match the filter `species=lion and
- organ=brain`, but none of its files will. If such a project/dataset is
- added/removed to/from the index, the manifest hash returned for a given
- filter will be different even though the contents of the manifest hasn't
- changed, as no matching files were added or removed.
+ then there is a high probability that the manifests generated by g1 and
+ g2 contain the same set of entries. This test can be used in deciding
+ whether g2 can reuse g1's manifest, thereby avoiding an expensive
+ operation. A false positive occurs when the hashes are equal but the
+ inputs differ. A false negative occurs when the hashes differ, but the
+ inputs are equal. False negatives are less problematic because they only
+ lead to redundant computations: the manifest is regenerated when it
+ could have been reused. False positives are problematic because they
+ lead to a manifest being reused erroneously, yielding an incorrect
+ manifest that is inconsistent with the input.
- So while the hash computed from the sources is less sensitive than the
- one computed from the bundles, it can be computed much more quickly.
+ If ``by_bundle`` is True, the hash is computed from the fully-qualified
+ identifiers (FQID) of all bundles (subgraphs) containing files that
+ match the current filter. The rate of false negatives is low because a
+ change to any file entity requires a new bundle or a new bundle version,
+ both of which have different FQIDs, leading to a different hash. This
+ mode is slower and should be used if the index is changing or is likely
+ to change due to the incremental incorporation of bundles.
+
+ If ``by_bundle`` is False, the hash is instead computed from the set of
+ identifiers of the sources that contributed files matching the current
+ filters. This mode should *not* be used if the index is changing or is
+ likely to change due to the incremental incorporation of bundles.
"""
log.debug('Computing content hash for manifest from %s using %r ...',
'bundles' if by_bundle else 'sources', self.filters)| ] | ||
| } | ||
| }, | ||
| "/{catalog}/{action}": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script that generates this file should patch the environment variable to True
scripts/reindex.py
Outdated
| if args.local and not config.enable_bundle_notifications: | ||
| parser.error('Local reindexing is not available while bundle ' | ||
| 'notifications are disabled.') | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is yet another client-side guard which we agreed in PL to remove.
scripts/reindex.py
Outdated
| help='Do not offload the listing of subgraphs to the indexer Lambda function. When this option is ' | ||
| 'used, this script queries the repository without partitioning, and the indexer notification ' | ||
| 'endpoint is invoked for each subgraph individually and concurrently using worker threads. ' | ||
| 'This is magnitudes slower than remote (i.e. partitioned) indexing. If this option is not ' | ||
| 'used (the default), the set of subgraphs matching the query is partitioned using the ' | ||
| 'partition prefix length configured for each of the catalog sources being reindexed. Each ' | ||
| 'query partition is processed independently and remotely by the indexer lambda. The index ' | ||
| 'Lambda function queries the repository for each partition and queues a notification for each ' | ||
| 'matching subgraph in the partition.') | ||
| help=( | ||
| '' if config.enable_bundle_notifications else '**DISABLED** ' | ||
| ) + ( | ||
| 'Do not offload the listing of subgraphs to the indexer Lambda function. When ' | ||
| 'this option is used, this script queries the repository without partitioning, ' | ||
| 'and the indexer notification endpoint is invoked for each subgraph ' | ||
| 'individually and concurrently using worker threads. This is magnitudes slower ' | ||
| 'than remote (i.e. partitioned) indexing. If this option is not used (the ' | ||
| 'default), the set of subgraphs matching the query is partitioned using the ' | ||
| 'partition prefix length configured for each of the catalog sources being ' | ||
| 'reindexed. Each query partition is processed independently and remotely by ' | ||
| 'the indexer lambda. The index Lambda function queries the repository for each ' | ||
| 'partition and queues a notification for each matching subgraph in the ' | ||
| 'partition.' | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this. If you want to reformat the help text, you should do that in a separate commit. But then you should reformat all help texts in this file, to be consistent.
test/service/test_manifest.py
Outdated
| @patch.object(type(config), | ||
| 'enable_bundle_notifications', | ||
| new=PropertyMock(return_value=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exact same decorator is used four times. Please DRY.
5b7dcac to
2549975
Compare
21bf218 to
75cd24e
Compare
75cd24e to
899bb08
Compare
Linked issues: #6123
Checklist
Author
developissues/<GitHub handle of author>/<issue#>-<slug>1 when the issue title describes a problem, the corresponding PR
title is
Fix:followed by the issue titleAuthor (partiality)
ptag to titles of partial commitspartialor completely resolves all linked issuespartiallabelAuthor (reindex)
rtag to commit title or the changes introduced by this PR will not require reindexing of any deploymentreindex:devor the changes introduced by it will not require reindexing ofdevreindex:anvildevor the changes introduced by it will not require reindexing ofanvildevreindex:anvilprodor the changes introduced by it will not require reindexing ofanvilprodreindex:prodor the changes introduced by it will not require reindexing ofprodreindex:partialand its description documents the specific reindexing procedure fordev,anvildev,anvilprodandprodor requires a full reindex or carries none of the labelsreindex:dev,reindex:anvildev,reindex:anvilprodandreindex:prodAuthor (API changes)
APIor this PR does not modify a REST APIa(A) tag to commit title for backwards (in)compatible changes or this PR does not modify a REST APIapp.pyor this PR does not modify a REST APIAuthor (upgrading deployments)
make docker_images.jsonand committed the resulting changes or this PR does not modifyazul_docker_images, or any other variables referenced in the definition of that variableutag to commit title or this PR does not require upgrading deploymentsupgradeor does not require upgrading deploymentsdeploy:sharedor does not modifydocker_images.json, and does not require deploying thesharedcomponent for any other reasondeploy:gitlabor does not require deploying thegitlabcomponentdeploy:runneror does not require deploying therunnerimageAuthor (hotfixes)
Ftag to main commit title or this PR does not include permanent fix for a temporary hotfixanvilprodandprod) have temporary hotfixes for any of the issues linked to this PRAuthor (before every review)
develop, squashed fixups from prior reviewsmake requirements_updateor this PR does not modifyrequirements*.txt,common.mk,Makefile,Dockerfileorenvironment.bootRtag to commit title or this PR does not modifyrequirements*.txtreqsor does not modifyrequirements*.txtmake integration_testpasses in personal deployment or this PR does not modify functionality that could affect the IT outcomePeer reviewer (after approval)
Note that when requesting changes, the PR must be assigned back to the author.
System administrator (after approval)
demoorno demono demono sandboxN reviewslabel is accurateOperator
reindex:…labels andrcommit title tagno demodevelopOperator (deploy
.sharedand.gitlabcomponents)_select dev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unusedor this PR is not labeleddeploy:shared_select dev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab applyor this PR is not labeleddeploy:gitlab_select anvildev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unusedor this PR is not labeleddeploy:shared_select anvildev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab applyor this PR is not labeleddeploy:gitlabdeploy:gitlabdeploy:gitlabSystem administrator (post-deploy of
.gitlabcomponent)dev.gitlabare complete or this PR is not labeleddeploy:gitlabanvildev.gitlabare complete or this PR is not labeleddeploy:gitlabOperator (deploy runner image)
_select dev.gitlab && make -C terraform/gitlab/runneror this PR is not labeleddeploy:runner_select anvildev.gitlab && make -C terraform/gitlab/runneror this PR is not labeleddeploy:runnerOperator (sandbox build)
sandboxlabel or PR is labeledno sandboxdevor PR is labeledno sandboxanvildevor PR is labeledno sandboxsandboxdeployment or PR is labeledno sandboxanvilboxdeployment or PR is labeledno sandboxsandboxdeployment or PR is labeledno sandboxanvilboxdeployment or PR is labeledno sandboxsandboxor this PR does not remove catalogs or otherwise causes unreferenced indices indevanvilboxor this PR does not remove catalogs or otherwise causes unreferenced indices inanvildevsandboxor this PR is not labeledreindex:devanvilboxor this PR is not labeledreindex:anvildevsandboxor this PR is not labeledreindex:devanvilboxor this PR is not labeledreindex:anvildevOperator (merge the branch)
pif the PR is also labeledpartialOperator (main build)
devanvildevdevdevanvildevanvildev_select dev.shared && make -C terraform/shared applyor this PR is not labeleddeploy:shared_select anvildev.shared && make -C terraform/shared applyor this PR is not labeleddeploy:shareddevanvildevOperator (reindex)
devor this PR is neither labeledreindex:partialnorreindex:devanvildevor this PR is neither labeledreindex:partialnorreindex:anvildevdevor this PR is neither labeledreindex:partialnorreindex:devanvildevor this PR is neither labeledreindex:partialnorreindex:anvildevdevor this PR is neither labeledreindex:partialnorreindex:devanvildevor this PR is neither labeledreindex:partialnorreindex:anvildevdevor this PR does not require reindexingdevanvildevor this PR does not require reindexinganvildevdevor this PR does not require reindexingdevanvildevor this PR does not require reindexinganvildevdevor this PR does not require reindexingdevanvildevor this PR does not require reindexinganvildevdevor this PR does not require reindexingdevdevor this PR does not require reindexingdevdeploy_browserjob in the GitLab pipeline for this PR indevor this PR does not require reindexingdevanvildevor this PR does not require reindexinganvildevdeploy_browserjob in the GitLab pipeline for this PR inanvildevor this PR does not require reindexinganvildevOperator (mirroring)
devor this PR does not require mirroringdevanvildevor this PR does not require mirroringanvildevdevor this PR does not require mirroringdevanvildevor this PR does not require mirroringanvildevdevor this PR does not require mirroringdevanvildevor this PR does not require mirroringanvildevOperator
deploy:shared,deploy:gitlab,deploy:runner,API,reindex:partial,reindex:anvilprodandreindex:prodlabels to the next promotion PRs or this PR carries none of these labelsdeploy:shared,deploy:gitlab,deploy:runner,API,reindex:partial,reindex:anvilprodandreindex:prodlabels, from the description of this PR to that of the next promotion PRs or this PR carries none of these labelsShorthand for review comments
Lline is too longWline wrapping is wrongQbad quotesFother formatting problem