Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logarithmic histograms for sum #533

Merged
merged 8 commits into from
Jan 22, 2025
Merged

Conversation

dvadym
Copy link
Collaborator

@dvadym dvadym commented Jan 15, 2025

The current implementation of sum histogram uses linear buckets, namely 10^4 buckets between [min_value, max_value]. In practice significant downsides where found:

  1. Often the dataset is concentrated, so the biggest part of the dataset is in 1 bucket
  2. Bucket bounds not round numbers, which leads in candidate search for non-round numbers (or additional logic for choosing round numbers, which is not implemented yet).

This PR inroduces logarithmic histograms for sum, which has the following buckets:

[buckets_for_negative] + [bucket_for_zero] + [buckets_for_positive].

The number is assigned to Bucket(lower=get_lower(x), upper=get_upper(x))

where:

for x > 0:
get_lower(x) - returns the number with 2 digits the same as in x, and get_upper is the next 2 digits number e.g.

get_lower(123) == 120, get_upper(123) == 130, get_lower(0.1234) == 0.12,get_upper(0.1234) == 0.13

The old histograms will be removed in future, when they will be used everywhere

@dvadym dvadym changed the title (WIP) Log sum histograms Logarithmic histograms for sum Jan 17, 2025
@dvadym dvadym requested a review from RamSaw January 17, 2025 10:05
NUMBER_OF_BUCKETS)
col = backend.to_multi_transformable_collection(col)

return backend.flatten((_compute_frequency_histogram_per_key(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe extract and save results of _compute_frequency_histogram_per_key into variables to improve readability a little

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thx, done

Copy link
Collaborator Author

@dvadym dvadym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review!

NUMBER_OF_BUCKETS)
col = backend.to_multi_transformable_collection(col)

return backend.flatten((_compute_frequency_histogram_per_key(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thx, done

@dvadym dvadym merged commit fa6cd38 into OpenMined:main Jan 22, 2025
6 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants