-
-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tuning for multiple columns part 1: Computing Sum histograms for multiple columns #523
Conversation
analysis/contribution_bounders.py
Outdated
# output: v_0 + .... | ||
# k columns (k > 1): | ||
# input: values = [v_0=(v00, ... v0(k-1)), ...] | ||
# output: (00+v10+..., ...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 00 supposed to be v00?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, yex
tests/dataset_histograms/__init__.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why is this file showing up on the diff when it says "Empty file"?
NUMBER_OF_BUCKETS_SUM_HISTOGRAM. | ||
|
||
Args: | ||
col: collection with elements ((privacy_id, partition_key), value(s)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does value(s)
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added Where value(s) can be one float of list of floats.
return backend.flat_map(col, flat_values, "Flat values") | ||
|
||
|
||
def _compute_linf_sum_contributions_histogram( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the functions below these seem to be doing very similar things. Do you think it would be possible to merge them into a single function with a parameter for choosing the type of histogram? Would that be more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of those functions have different input formats, all they have format:
<pre-processing>
_compute_frequency_histogram_per_key
And all heavy lifiting has been already extracted into _compute_frequency_histogram_per_key
, and pre-processing is simple.
input, expected, | ||
pre_aggregated): | ||
# Lambdas are used for returning input and expected. Passing lists | ||
# instead lead to printing whole lists as test names in the output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: leads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is moved to sum_histogram_test. Fixed that in sum_histogram_test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comments! PTAL
analysis/contribution_bounders.py
Outdated
# output: v_0 + .... | ||
# k columns (k > 1): | ||
# input: values = [v_0=(v00, ... v0(k-1)), ...] | ||
# output: (00+v10+..., ...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, yex
# Computations is the following: | ||
# 1. Find min_x = min(X), max_x = max(X) of X | ||
# 2. Split the segment [min_x, max_x] in NUMBER_OF_BUCKETS_SUM_HISTOGRAM = 10000 | ||
# equals size intervals [l_i, r_i), the last interval includes both endpoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I meant that it includes max_x also to the left endpoint. I've updated
|
||
Attributes: | ||
left, right: bounds on interval on which we compute histogram. | ||
num_buckets: number of buckets on [left, right]. Buckets have the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
it can be different in tests and in principle in future it can be different, so it's better to have it as a parameter
right: float, | ||
num_buckets: int = NUMBER_OF_BUCKETS_SUM_HISTOGRAM, | ||
): | ||
assert left <= right, "The left bound must be smaller then the right one, but {left=} and {right=}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
== is not a problem. E.g. when all values are the same. It can be for example if somebody want's to compute count, but SUM is used, and the value is 1.
NUMBER_OF_BUCKETS_SUM_HISTOGRAM. | ||
|
||
Args: | ||
col: collection with elements ((privacy_id, partition_key), value(s)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added Where value(s) can be one float of list of floats.
@@ -254,6 +254,8 @@ def test_calculate_private_contribution_filters_partitions(self): | |||
result, | |||
pipeline_dp.PrivateContributionBounds(max_partitions_contributed=1)) | |||
|
|||
@unittest.skip( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear why it fails. Anyway this functionality is not used for now. If it's really the problem we will during development of "Multi column feature" :)
return backend.flat_map(col, flat_values, "Flat values") | ||
|
||
|
||
def _compute_linf_sum_contributions_histogram( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of those functions have different input formats, all they have format:
<pre-processing>
_compute_frequency_histogram_per_key
And all heavy lifiting has been already extracted into _compute_frequency_histogram_per_key
, and pre-processing is simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed comments in moved code
input, expected, | ||
pre_aggregated): | ||
# Lambdas are used for returning input and expected. Passing lists | ||
# instead lead to printing whole lists as test names in the output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is moved to sum_histogram_test. Fixed that in sum_histogram_test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
|
||
NUMBER_OF_BUCKETS_SUM_HISTOGRAM = 10000 | ||
# Functions _compute_* computes histogram for counts. TODO: move them to | ||
# a separate file, similar to sum_histogram_computation.py. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to split. This PR is already big.
"""Packs histograms from a list to ContributionHistograms.""" | ||
l0_contributions = l1_contributions = None | ||
linf_contributions = linf_sum_contributions = None | ||
count_per_partition = privacy_id_per_partition_count = None | ||
sum_per_partition_histogram = None | ||
for histogram in histograms: | ||
if histogram.name == hist.HistogramType.L0_CONTRIBUTIONS: | ||
if isinstance(histogram, Iterable): | ||
if not histogram: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[] is iterable. And that's valid case, when no values are provided (for example for count) or dataset is empty
|
||
NUMBER_OF_BUCKETS_SUM_HISTOGRAM = 10000 | ||
# Functions _compute_* computes histogram for counts. TODO: move them to | ||
# a separate file, similar to sum_histogram_computation.py. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to split. This PR is already big.
Thanks for review! |
This PR introduces computing sum histograms for multiple columns. This covers cases when DP aggregations can be presented in the pseudo-SQL terms as
Histograms for each column is computed independently. This is implemented by changing existing code of computing histogram on colleciton
(value: float)
to computing per key histograms of collection(column_id:int, value:float)