Skip to content

Conversation

@dsotirho-ucsc
Copy link
Contributor

@dsotirho-ucsc dsotirho-ucsc commented Jul 8, 2025

Linked issues: #6123

Checklist

Author

  • PR is assigned to the author
  • PR is a draft
  • Target branch is develop
  • Name of PR branch matches issues/<GitHub handle of author>/<issue#>-<slug>
  • PR is linked to all issues it (partially) resolves
  • PR description links to connected issues
  • PR title matches1 that of a linked issue or comment in PR explains why they're different
  • PR title references all linked issues
  • For each linked issue, there is at least one commit whose title references that issue

1 when the issue title describes a problem, the corresponding PR
title is Fix: followed by the issue title

Author (partiality)

  • Added p tag to titles of partial commits
  • This PR is labeled partial or completely resolves all linked issues
  • This PR partially resolves each of the linked issues or does not have the partial label

Author (reindex)

  • Added r tag to commit title or the changes introduced by this PR will not require reindexing of any deployment
  • This PR is labeled reindex:dev or the changes introduced by it will not require reindexing of dev
  • This PR is labeled reindex:anvildev or the changes introduced by it will not require reindexing of anvildev
  • This PR is labeled reindex:anvilprod or the changes introduced by it will not require reindexing of anvilprod
  • This PR is labeled reindex:prod or the changes introduced by it will not require reindexing of prod
  • This PR is labeled reindex:partial and its description documents the specific reindexing procedure for dev, anvildev, anvilprod and prod or requires a full reindex or carries none of the labels reindex:dev, reindex:anvildev, reindex:anvilprod and reindex:prod

Author (API changes)

  • This PR and its linked issues are labeled API or this PR does not modify a REST API
  • Added a (A) tag to commit title for backwards (in)compatible changes or this PR does not modify a REST API
  • Updated REST API version number in app.py or this PR does not modify a REST API

Author (upgrading deployments)

  • Ran make docker_images.json and committed the resulting changes or this PR does not modify azul_docker_images, or any other variables referenced in the definition of that variable
  • Documented upgrading of deployments in UPGRADING.rst or this PR does not require upgrading deployments
  • Added u tag to commit title or this PR does not require upgrading deployments
  • This PR is labeled upgrade or does not require upgrading deployments
  • This PR is labeled deploy:shared or does not modify docker_images.json, and does not require deploying the shared component for any other reason
  • This PR is labeled deploy:gitlab or does not require deploying the gitlab component
  • This PR is labeled deploy:runner or does not require deploying the runner image

Author (hotfixes)

  • Added F tag to main commit title or this PR does not include permanent fix for a temporary hotfix
  • Reverted the temporary hotfixes for any linked issues or the none of the stable branches (anvilprod and prod) have temporary hotfixes for any of the issues linked to this PR

Author (before every review)

  • Rebased PR branch on develop, squashed fixups from prior reviews
  • Ran make requirements_update or this PR does not modify requirements*.txt, common.mk, Makefile, Dockerfile or environment.boot
  • Added R tag to commit title or this PR does not modify requirements*.txt
  • This PR is labeled reqs or does not modify requirements*.txt
  • make integration_test passes in personal deployment or this PR does not modify functionality that could affect the IT outcome
  • PR is awaiting requested review from a peer
  • Status of PR is Review requested
  • PR is assigned to only the peer

Peer reviewer (after approval)

Note that when requesting changes, the PR must be assigned back to the author.

  • Actually approved the PR
  • PR is not a draft
  • PR is awaiting requested review from system administrator
  • Status of PR is Review requested
  • PR is assigned to only the system administrator

System administrator (after approval)

  • Actually approved the PR
  • Labeled linked issues as demo or no demo
  • Commented on linked issues about demo expectations or all linked issues are labeled no demo
  • Decided if PR can be labeled no sandbox
  • A comment to this PR details the completed security design review
  • PR title is appropriate as title of merge commit
  • N reviews label is accurate
  • Status of PR is Approved
  • PR is assigned to only the operator

Operator

  • Checked reindex:… labels and r commit title tag
  • Checked that demo expectations are clear or all linked issues are labeled no demo
  • Squashed PR branch and rebased onto develop
  • Sanity-checked history
  • Pushed PR branch to GitHub

Operator (deploy .shared and .gitlab components)

  • Ran _select dev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused or this PR is not labeled deploy:shared
  • Ran _select dev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply or this PR is not labeled deploy:gitlab
  • Ran _select anvildev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused or this PR is not labeled deploy:shared
  • Ran _select anvildev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply or this PR is not labeled deploy:gitlab
  • Checked the items in the next section or this PR is labeled deploy:gitlab
  • PR is assigned to only the system administrator or this PR is not labeled deploy:gitlab

System administrator (post-deploy of .gitlab component)

  • Background migrations for dev.gitlab are complete or this PR is not labeled deploy:gitlab
  • Background migrations for anvildev.gitlab are complete or this PR is not labeled deploy:gitlab
  • PR is assigned to only the operator

Operator (deploy runner image)

  • Ran _select dev.gitlab && make -C terraform/gitlab/runner or this PR is not labeled deploy:runner
  • Ran _select anvildev.gitlab && make -C terraform/gitlab/runner or this PR is not labeled deploy:runner

Operator (sandbox build)

  • Added sandbox label or PR is labeled no sandbox
  • Pushed PR branch to GitLab dev or PR is labeled no sandbox
  • Pushed PR branch to GitLab anvildev or PR is labeled no sandbox
  • Build passes in sandbox deployment or PR is labeled no sandbox
  • Build passes in anvilbox deployment or PR is labeled no sandbox
  • Reviewed build logs for anomalies in sandbox deployment or PR is labeled no sandbox
  • Reviewed build logs for anomalies in anvilbox deployment or PR is labeled no sandbox
  • Deleted unreferenced indices in sandbox or this PR does not remove catalogs or otherwise causes unreferenced indices in dev
  • Deleted unreferenced indices in anvilbox or this PR does not remove catalogs or otherwise causes unreferenced indices in anvildev
  • Started reindex in sandbox or this PR is not labeled reindex:dev
  • Started reindex in anvilbox or this PR is not labeled reindex:anvildev
  • Checked for failures in sandbox or this PR is not labeled reindex:dev
  • Checked for failures in anvilbox or this PR is not labeled reindex:anvildev

Operator (merge the branch)

  • All status checks passed and the PR is mergeable
  • The title of the merge commit starts with the title of this PR
  • Added PR # reference to merge commit title
  • Collected commit title tags in merge commit title but only included p if the PR is also labeled partial
  • Pushed merge commit to GitHub
  • Status of PR is Merged lower
  • Status of blocked issues is Triage or no issues are blocked on the linked issues

Operator (main build)

  • Pushed merge commit to GitLab dev
  • Pushed merge commit to GitLab anvildev
  • Build passes on GitLab dev
  • Reviewed build logs for anomalies on GitLab dev
  • Build passes on GitLab anvildev
  • Reviewed build logs for anomalies on GitLab anvildev
  • Ran _select dev.shared && make -C terraform/shared apply or this PR is not labeled deploy:shared
  • Ran _select anvildev.shared && make -C terraform/shared apply or this PR is not labeled deploy:shared
  • Deleted PR branch from GitHub
  • Deleted PR branch from GitLab dev
  • Deleted PR branch from GitLab anvildev
  • Status of linked issues is Lower

Operator (reindex)

  • Deindexed all unreferenced catalogs in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Deindexed all unreferenced catalogs in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Deindexed specific sources in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Deindexed specific sources in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Indexed specific sources in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Indexed specific sources in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Started reindex in dev or this PR does not require reindexing dev
  • Started reindex in anvildev or this PR does not require reindexing anvildev
  • Checked for, triaged and possibly requeued messages in both fail queues in dev or this PR does not require reindexing dev
  • Checked for, triaged and possibly requeued messages in both fail queues in anvildev or this PR does not require reindexing anvildev
  • Emptied fail queues in dev or this PR does not require reindexing dev
  • Emptied fail queues in anvildev or this PR does not require reindexing anvildev
  • Restarted the Data Browser pipeline for the ucsc/hca/dev branch on GitLab in dev or this PR does not require reindexing dev
  • Restarted the Data Browser pipeline for the ucsc/lungmap/dev branch on GitLab in dev or this PR does not require reindexing dev
  • Restarted deploy_browser job in the GitLab pipeline for this PR in dev or this PR does not require reindexing dev
  • Restarted the Data Browser pipeline for the ucsc/anvil/anvildev branch on GitLab in anvildev or this PR does not require reindexing anvildev
  • Restarted deploy_browser job in the GitLab pipeline for this PR in anvildev or this PR does not require reindexing anvildev

Operator (mirroring)

  • Started mirroring in dev or this PR does not require mirroring dev
  • Started mirroring in anvildev or this PR does not require mirroring anvildev
  • Checked for, triaged and possibly requeued messages in mirror fail queue in dev or this PR does not require mirroring dev
  • Checked for, triaged and possibly requeued messages in mirror fail queue in anvildev or this PR does not require mirroring anvildev
  • Emptied mirror fail queue in dev or this PR does not require mirroring dev
  • Emptied mirror fail queue in anvildev or this PR does not require mirroring anvildev

Operator

  • Propagated the deploy:shared, deploy:gitlab, deploy:runner, API, reindex:partial, reindex:anvilprod and reindex:prod labels to the next promotion PRs or this PR carries none of these labels
  • Propagated any specific instructions related to the deploy:shared, deploy:gitlab, deploy:runner, API, reindex:partial, reindex:anvilprod and reindex:prod labels, from the description of this PR to that of the next promotion PRs or this PR carries none of these labels
  • PR is assigned to no one

Shorthand for review comments

  • L line is too long
  • W line wrapping is wrong
  • Q bad quotes
  • F other formatting problem

@github-actions github-actions bot added the orange label Jul 8, 2025
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6123-manifest-content-hash branch from b5212ca to 7ed3673 Compare July 8, 2025 00:47
@codecov
Copy link

codecov bot commented Jul 8, 2025

Codecov Report

❌ Patch coverage is 87.87879% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.13%. Comparing base (c566660) to head (899bb08).

Files with missing lines Patch % Lines
test/integration_test.py 0.00% 5 Missing ⚠️
src/azul/azulclient.py 33.33% 2 Missing ⚠️
src/azul/__init__.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7258      +/-   ##
===========================================
- Coverage    85.13%   85.13%   -0.01%     
===========================================
  Files          156      156              
  Lines        22261    22289      +28     
===========================================
+ Hits         18952    18975      +23     
- Misses        3309     3314       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coveralls
Copy link

coveralls commented Jul 8, 2025

Coverage Status

coverage: 85.327% (-0.004%) from 85.331%
when pulling 899bb08 on issues/dsotirho-ucsc/6123-manifest-content-hash
into c566660 on develop.

@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6123-manifest-content-hash branch from 7ed3673 to d7e9024 Compare July 8, 2025 19:56
@dsotirho-ucsc dsotirho-ucsc added the API API change affecting callers label Jul 8, 2025
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6123-manifest-content-hash branch 10 times, most recently from 2a9c76f to e4693d5 Compare July 10, 2025 17:27
@dsotirho-ucsc
Copy link
Contributor Author

7258_IT_2025-07-10.txt

achave11-ucsc
achave11-ucsc previously approved these changes Jul 11, 2025
Copy link
Member

@achave11-ucsc achave11-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅

@achave11-ucsc achave11-ucsc marked this pull request as ready for review July 11, 2025 23:48
"""
return self._manifest_hash('bundles')

def _manifest_hash(self, base: str) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PL (modal parameter smell, literal string)

@hannes-ucsc hannes-ucsc removed their assignment Jul 14, 2025
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6123-manifest-content-hash branch 3 times, most recently from bd5d131 to 81a2b78 Compare July 21, 2025 23:59
@hannes-ucsc hannes-ucsc added the 2 reviews [process] Lead requested changes twice label Jul 28, 2025
@hannes-ucsc hannes-ucsc removed their assignment Jul 28, 2025
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6123-manifest-content-hash branch 6 times, most recently from f7e2b60 to 83f930f Compare August 4, 2025 22:20
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6123-manifest-content-hash branch from cd223e3 to b6669be Compare August 4, 2025 23:42
@dsotirho-ucsc
Copy link
Contributor Author

Copy link
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While editing the docstring, it occurred to me that we will also need a check whether an index operation is currently in progress and prevent the manifest generation if it is. With the source-based hash, the risk is now considerably higher that we'd be caching an incomplete manifest. The HTTP status should be 503 and have a Retry-After header value that's derived from the notification queue length (you can use a heuristic that's based on prod and anvilprod reindexes). Please also investigate if there's a way to lock an index in Elasticsearch. The problem with checking the queues is that it is not specific to a catalog. If index locks aren't possible we might want to create a central index with one document per catalog. Please also explore what document locking mechanisms exist in Elasticsearch.

Index: src/azul/service/manifest_service.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/service/manifest_service.py b/src/azul/service/manifest_service.py
--- a/src/azul/service/manifest_service.py	(revision 839dceb6d0d7fc006d1817235ba013d1580d6ad0)
+++ b/src/azul/service/manifest_service.py	(date 1754633437226)
@@ -982,7 +982,7 @@
         filter_string = repr(sort_frozen(freeze(self.filters.explicit)))
         # If incremental index changes are disabled, we don't need to worry
         # about individual bundles, only sources.
-        content_hash = str(self.manifest_hash(config.enable_bundle_notifications))
+        content_hash = str(self.manifest_hash(by_bundle=config.enable_bundle_notifications))
         catalog = self.catalog
         format = self.format()
         manifest_hash_input = [
@@ -1147,29 +1147,38 @@
                                           **args))
 
     @cache
-    def manifest_hash(self, by_bundle: bool) -> int:
+    def manifest_hash(self, *, by_bundle: bool) -> int:
         """
-        Return a content hash for the manifest.
+        Return a hash of the input this generator builds the manifest from. The
+        input is the set of ES documents from the files index. For two generator
+        instances g1 and g2 created at two different points in time, and any
+        boolean value b, if
 
-        If `by_bundle` is True, the hash is computed from the fully-qualified
-        identifiers of all bundles containing files that match the current
-        filter. The return value approximates a hash of the content of the
-        manifest because a change of the file data requires a change to the file
-        metadata which requires a new bundle or bundle version.
+        g1.manifest_hash(by_bundle=b) == g2.manifest_hash(by_bundle=b)
 
-        If `by_bundle` is False, the hash is computed from the identifiers of
-        the sources from which projects/datasets containing files matching the
-        current filters were indexed. It's worth noting that a filter may match
-        a project/dataset but none of the project's files. For example, if a
-        project contains only files derived from either mouse brains or lion
-        hearts, the project will match the filter `species=lion and
-        organ=brain`, but none of its files will. If such a project/dataset is
-        added/removed to/from the index, the manifest hash returned for a given
-        filter will be different even though the contents of the manifest hasn't
-        changed, as no matching files were added or removed.
+        then there is a high probability that the manifests generated by g1 and
+        g2 contain the same set of entries. This test can be used in deciding
+        whether g2 can reuse g1's manifest, thereby avoiding an expensive
+        operation. A false positive occurs when the hashes are equal but the
+        inputs differ. A false negative occurs when the hashes differ, but the
+        inputs are equal. False negatives are less problematic because they only
+        lead to redundant computations: the manifest is regenerated when it
+        could have been reused. False positives are problematic because they
+        lead to a manifest being reused erroneously, yielding an incorrect
+        manifest that is inconsistent with the input.
 
-        So while the hash computed from the sources is less sensitive than the
-        one computed from the bundles, it can be computed much more quickly.
+        If ``by_bundle`` is True, the hash is computed from the fully-qualified
+        identifiers (FQID) of all bundles (subgraphs) containing files that
+        match the current filter. The rate of false negatives is low because a
+        change to any file entity requires a new bundle or a new bundle version,
+        both of which have different FQIDs, leading to a different hash. This
+        mode is slower and should be used if the index is changing or is likely
+        to change due to the incremental incorporation of bundles.
+
+        If ``by_bundle`` is False, the hash is instead computed from the set of
+        identifiers of the sources that contributed files matching the current
+        filters. This mode should *not* be used if the index is changing or is
+        likely to change due to the incremental incorporation of bundles.
         """
         log.debug('Computing content hash for manifest from %s using %r ...',
                   'bundles' if by_bundle else 'sources', self.filters)

]
}
},
"/{catalog}/{action}": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script that generates this file should patch the environment variable to True

Comment on lines 144 to 135
if args.local and not config.enable_bundle_notifications:
parser.error('Local reindexing is not available while bundle '
'notifications are disabled.')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is yet another client-side guard which we agreed in PL to remove.

Comment on lines 50 to 57
help='Do not offload the listing of subgraphs to the indexer Lambda function. When this option is '
'used, this script queries the repository without partitioning, and the indexer notification '
'endpoint is invoked for each subgraph individually and concurrently using worker threads. '
'This is magnitudes slower than remote (i.e. partitioned) indexing. If this option is not '
'used (the default), the set of subgraphs matching the query is partitioned using the '
'partition prefix length configured for each of the catalog sources being reindexed. Each '
'query partition is processed independently and remotely by the indexer lambda. The index '
'Lambda function queries the repository for each partition and queues a notification for each '
'matching subgraph in the partition.')
help=(
'' if config.enable_bundle_notifications else '**DISABLED** '
) + (
'Do not offload the listing of subgraphs to the indexer Lambda function. When '
'this option is used, this script queries the repository without partitioning, '
'and the indexer notification endpoint is invoked for each subgraph '
'individually and concurrently using worker threads. This is magnitudes slower '
'than remote (i.e. partitioned) indexing. If this option is not used (the '
'default), the set of subgraphs matching the query is partitioned using the '
'partition prefix length configured for each of the catalog sources being '
'reindexed. Each query partition is processed independently and remotely by '
'the indexer lambda. The index Lambda function queries the repository for each '
'partition and queues a notification for each matching subgraph in the '
'partition.'
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this. If you want to reformat the help text, you should do that in a separate commit. But then you should reformat all help texts in this file, to be consistent.

Comment on lines 893 to 895
@patch.object(type(config),
'enable_bundle_notifications',
new=PropertyMock(return_value=True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exact same decorator is used four times. Please DRY.

@hannes-ucsc hannes-ucsc added 3 reviews [process] Lead requested changes thrice and removed 2 reviews [process] Lead requested changes twice labels Aug 8, 2025
@hannes-ucsc hannes-ucsc removed their assignment Aug 8, 2025
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6123-manifest-content-hash branch from 5b7dcac to 2549975 Compare August 26, 2025 01:37
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6123-manifest-content-hash branch 2 times, most recently from 21bf218 to 75cd24e Compare August 28, 2025 18:09
@hannes-ucsc hannes-ucsc linked an issue Sep 11, 2025 that may be closed by this pull request
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6123-manifest-content-hash branch from 75cd24e to 899bb08 Compare September 18, 2025 22:48
@hannes-ucsc hannes-ucsc removed the orange label Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 reviews [process] Lead requested changes thrice API API change affecting callers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Manifest content hash computation times out

5 participants