Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Revisit FileBasedSignatureProvider for a possible performance optimization #271

Open
imback82 opened this issue Nov 24, 2020 · 13 comments · May be fixed by #404
Open

Revisit FileBasedSignatureProvider for a possible performance optimization #271

imback82 opened this issue Nov 24, 2020 · 13 comments · May be fixed by #404

Comments

@imback82
Copy link
Contributor

imback82 commented Nov 24, 2020

@apoorvedave1 mentioned:

"could you also hide this feature behind a flag which defaults to false?
My concern is I think this would be unnecessary calculation for regular scenarios. In any case, if signature doesn't match, RuleUtils will make sure that if index can be used, it will be used right? I am just wondering why bother O(NlogN) for sorting always for a rare scenario which will still work without this change(assuming hybrid scan is enabled eventually by default)?

Alternately if we still want to fix this issue, could you avoid sorting? Instead you could convert fileInfos to set and then create fingerprint. Set will ensure the order of iteration is unique for a unique collection of elements. That way we can still achieve O(N)."

Originally posted by @apoorvedave1 in #268 (comment)

@imback82
Copy link
Contributor Author

Related thread: #268 (comment)

@sezruby
Copy link
Collaborator

sezruby commented Nov 24, 2020

I think we cannot guarantee the order of HashSet iterator. It can differ depending on the internal implementation. (e.g. different hash bucket size, hash functions..)

@clee704
Copy link

clee704 commented Mar 30, 2021

Maybe I misunderstood the issue, but why is the order important? Why can't we simply XOR or even just add signatures? I'm asking this because I don't get why "false positives" mentioned in #268 are problematic. After all you must check if two objects are really equal, even though their hash codes are equal.

Also, is there a specific reason we use String instead of byte[] and use md5Hex instead of md5 for signatures?

@sezruby
Copy link
Collaborator

sezruby commented Mar 30, 2021

Maybe I misunderstood the issue, but why is the order important? Why can't we simply XOR or even just add signatures? I'm asking this because I don't get why "false positives" mentioned in #268 are problematic. After all you must check if two objects are really equal, even though their hash codes are equal.

@clee704 Since currently we don't check object after that, to avoid the comparison overhead during the optimize phase. (but for hybrid scan, we do compare all FileInfos)

Also, is there a specific reason we use String instead of byte[] and use md5Hex instead of md5 for signatures?

Because the hash lib - import org.apache.commons.codec.digest.DigestUtils takes & returns String.

@clee704
Copy link

clee704 commented Mar 31, 2021

Maybe I misunderstood the issue, but why is the order important? Why can't we simply XOR or even just add signatures? I'm asking this because I don't get why "false positives" mentioned in #268 are problematic. After all you must check if two objects are really equal, even though their hash codes are equal.

@clee704 Since currently we don't check object after that, to avoid the comparison overhead during the optimize phase. (but for hybrid scan, we do compare all FileInfos)

Okay. So it's like how SHA-1 hashes are used in git to identify objects. But if we accept the 2^-128 (or whatever bits you use) probability of collision, then I believe XOR or addition should be sufficient too to combine hashes of unordered elements. Assuming random distribution and 128-bit hash, the probability that A_1 + A_2 + ... + A_k == B_1 + B_2 + ... + B_k mod 2^128 (where A_i and B_i are hash values) happens is the same as that of A == B mod 2^128, which is, 2^-128, regardless of k.

Also, is there a specific reason we use String instead of byte[] and use md5Hex instead of md5 for signatures?

Because the hash lib - import org.apache.commons.codec.digest.DigestUtils takes & returns String.

DigestUtils has both functions for returning byte[] and String. byte[] is a better option for internal handling for performance, and we can encode/decode the value to/from HEX whenever needed (e.g. for human or JSON).

@sezruby
Copy link
Collaborator

sezruby commented Apr 1, 2021

@clee704 Yea I'm good with XOR approach & using byte[] (be aware of its format in JSON)

cc @imback82

@clee704
Copy link

clee704 commented Apr 1, 2021

Yeah, we should do proper HEX encoding/decoding when JSON is involved.

By the way, I think we should also avoid being locked in MD5 and should make it easy to change the underlying hashing algorithm (SHA-1, SHA-256, etc.) in the future.

clee704 pushed a commit to clee704/hyperspace that referenced this issue Apr 1, 2021
Fixes microsoft#271.

With newly added classes for fingerprinting, now it is easy to make a
fingerprint from various types of data.

FingerprintBuilder can be used to build a fingerprint from data.
FileBasedRelation.signature now takes it as an argument, so the
implementations don't have to do the hashing themselves. They can just
use the provided builder.

For unorderd combining, one can use bitwise XOR to combine multiple
fingerprints. This way, the order becomes irrelavant.
@clee704 clee704 linked a pull request Apr 1, 2021 that will close this issue
@sezruby
Copy link
Collaborator

sezruby commented Apr 1, 2021

Btw changing signature will break backward compatibility; so would be good to go with other breaking changes.

clee704 pushed a commit to clee704/hyperspace that referenced this issue Apr 2, 2021
Fixes microsoft#271.

With newly added classes for fingerprinting, now it is easy to make a
fingerprint from various types of data.

FingerprintBuilder can be used to build a fingerprint from data.
FileBasedRelation.signature now takes it as an argument, so the
implementations don't have to do the hashing themselves. They can just
use the provided builder.

For unorderd combining, one can use bitwise XOR to combine multiple
fingerprints. This way, the order becomes irrelavant.
@clee704
Copy link

clee704 commented Apr 2, 2021

I assume that the signatures are stored as JSON in users' data lake. Therefore, any change to the signature computation will make existing signatures won't work (even if we don't change the hash function). Am I right? Then how could you merge the PR #268? @sezruby

Do we provide some migration utility to update their index data?

@rapoth
Copy link
Contributor

rapoth commented Apr 2, 2021

@clee704 Migration utility is a great idea! Can you please open a feature request and link ot here so we can keep track of it? While it may not be necessary for the current version, I think as we move towards stabilization, it'd be useful to provide one for users so we don't always have to worry about breaking changes.

@clee704
Copy link

clee704 commented Apr 5, 2021

@rapoth Release notes state that whenever there are breaking changes, indexes should be "reconstructed". Does it mean that users should manually create indexes again? Because refreshAction won't work because it can't load the previous version's IndexLogEntry, or weird things can happen if it can (e.g. only the signature computation logic changed). Right? So that could be the why we might need a migration utility?

@sezruby
Copy link
Collaborator

sezruby commented Apr 7, 2021

Does it mean that users should manually create indexes again?

Yes

Because refreshAction won't work because it can't load the previous version's IndexLogEntry, or weird things can happen if it can (e.g. only the signature computation logic changed). Right?

Right, usually it failed to create IndexLogEntry, but in that case we could reuse the previous index data.
I think we could catch the exception from building IndexLogEntry & trigger the utility to create a new log file.

@clee704
Copy link

clee704 commented Apr 7, 2021

Does it mean that users should manually create indexes again?

Yes

Hmm, okay. Now I see a need for a migration utility.

Because refreshAction won't work because it can't load the previous version's IndexLogEntry, or weird things can happen if it can (e.g. only the signature computation logic changed). Right?

Right, usually it failed to create IndexLogEntry, but in that case we could reuse the previous index data.
I think we could catch the exception from building IndexLogEntry & trigger the utility to create a new log file.

Blindly trying to create a log from JSON and catching an exception doesn't seem to be a good approach, because we can't be sure that the constructed log is valid. For example, if only the signature computation logic is updated between Hyperspace versions, the log will be created just fine but with invalid signatures.

clee704 pushed a commit to clee704/hyperspace that referenced this issue May 4, 2021
Fixes microsoft#271.

With newly added classes for fingerprinting, now it is easy to make a
fingerprint from various types of data.

FingerprintBuilder can be used to build a fingerprint from data.
FileBasedRelation.signature now takes it as an argument, so the
implementations don't have to do the hashing themselves. They can just
use the provided builder.

For unorderd combining, one can use bitwise XOR to combine multiple
fingerprints. This way, the order becomes irrelavant.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants