@@ -985,7 +985,9 @@ def manifest_key(self) -> ManifestKey:
985985 filter_string = repr (sort_frozen (freeze (self .filters .explicit )))
986986 # If incremental index changes are disabled, we don't need to worry
987987 # about individual bundles, only sources.
988- content_hash = str (self .manifest_hash (config .enable_bundle_notifications ))
988+ content_hash = str (
989+ self .manifest_hash (by_bundle = config .enable_bundle_notifications )
990+ )
989991 catalog = self .catalog
990992 format = self .format ()
991993 manifest_hash_input = [
@@ -1150,29 +1152,38 @@ def _azul_file_url(self,
11501152 ** args ))
11511153
11521154 @cache
1153- def manifest_hash (self , by_bundle : bool ) -> int :
1155+ def manifest_hash (self , * , by_bundle : bool ) -> int :
11541156 """
1155- Return a content hash for the manifest.
1156-
1157- If `by_bundle` is True, the hash is computed from the fully-qualified
1158- identifiers of all bundles containing files that match the current
1159- filter. The return value approximates a hash of the content of the
1160- manifest because a change of the file data requires a change to the file
1161- metadata which requires a new bundle or bundle version.
1162-
1163- If `by_bundle` is False, the hash is computed from the identifiers of
1164- the sources from which projects/datasets containing files matching the
1165- current filters were indexed. It's worth noting that a filter may match
1166- a project/dataset but none of the project's files. For example, if a
1167- project contains only files derived from either mouse brains or lion
1168- hearts, the project will match the filter `species=lion and
1169- organ=brain`, but none of its files will. If such a project/dataset is
1170- added/removed to/from the index, the manifest hash returned for a given
1171- filter will be different even though the contents of the manifest hasn't
1172- changed, as no matching files were added or removed.
1173-
1174- So while the hash computed from the sources is less sensitive than the
1175- one computed from the bundles, it can be computed much more quickly.
1157+ Return a hash of the input this generator builds the manifest from. The
1158+ input is the set of ES documents from the files index. For two generator
1159+ instances g1 and g2 created at two different points in time, and any
1160+ boolean value b, if
1161+
1162+ g1.manifest_hash(by_bundle=b) == g2.manifest_hash(by_bundle=b)
1163+
1164+ then there is a high probability that the manifests generated by g1 and
1165+ g2 contain the same set of entries. This test can be used in deciding
1166+ whether g2 can reuse g1's manifest, thereby avoiding an expensive
1167+ operation. A false positive occurs when the hashes are equal but the
1168+ inputs differ. A false negative occurs when the hashes differ, but the
1169+ inputs are equal. False negatives are less problematic because they only
1170+ lead to redundant computations: the manifest is regenerated when it
1171+ could have been reused. False positives are problematic because they
1172+ lead to a manifest being reused erroneously, yielding an incorrect
1173+ manifest that is inconsistent with the input.
1174+
1175+ If ``by_bundle`` is True, the hash is computed from the fully-qualified
1176+ identifiers (FQID) of all bundles (subgraphs) containing files that
1177+ match the current filter. The rate of false negatives is low because a
1178+ change to any file entity requires a new bundle or a new bundle version,
1179+ both of which have different FQIDs, leading to a different hash. This
1180+ mode is slower and should be used if the index is changing or is likely
1181+ to change due to the incremental incorporation of bundles.
1182+
1183+ If ``by_bundle`` is False, the hash is instead computed from the set of
1184+ identifiers of the sources that contributed files matching the current
1185+ filters. This mode should *not* be used if the index is changing or is
1186+ likely to change due to the incremental incorporation of bundles.
11761187 """
11771188 log .debug ('Computing content hash for manifest from %s using %r ...' ,
11781189 'bundles' if by_bundle else 'sources' , self .filters )
0 commit comments