Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuning for multiple columns part 1: Computing Sum histograms for multiple columns #523

Merged
merged 13 commits into from
Sep 9, 2024

Conversation

dvadym
Copy link
Collaborator

@dvadym dvadym commented Aug 30, 2024

This PR introduces computing sum histograms for multiple columns. This covers cases when DP aggregations can be presented in the pseudo-SQL terms as

SELECT partition_key, DP_SUM(column1), DP_SUM(columns2)
GROUP BY partition_key

Histograms for each column is computed independently. This is implemented by changing existing code of computing histogram on colleciton (value: float) to computing per key histograms of collection (column_id:int, value:float)

@dvadym dvadym changed the title (WIP) Tuning for multiple aggregations Tuning for multiple columns part 1: Computing Sum histograms for multiple columns Sep 3, 2024
analysis/contribution_bounders.py Outdated Show resolved Hide resolved
analysis/contribution_bounders.py Outdated Show resolved Hide resolved
# output: v_0 + ....
# k columns (k > 1):
# input: values = [v_0=(v00, ... v0(k-1)), ...]
# output: (00+v10+..., ...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 00 supposed to be v00?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, yex

analysis/contribution_bounders.py Outdated Show resolved Hide resolved
pipeline_dp/data_extractors.py Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why is this file showing up on the diff when it says "Empty file"?

NUMBER_OF_BUCKETS_SUM_HISTOGRAM.

Args:
col: collection with elements ((privacy_id, partition_key), value(s)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does value(s) mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added Where value(s) can be one float of list of floats.

return backend.flat_map(col, flat_values, "Flat values")


def _compute_linf_sum_contributions_histogram(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the functions below these seem to be doing very similar things. Do you think it would be possible to merge them into a single function with a parameter for choosing the type of histogram? Would that be more readable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of those functions have different input formats, all they have format:

<pre-processing>
_compute_frequency_histogram_per_key

And all heavy lifiting has been already extracted into _compute_frequency_histogram_per_key, and pre-processing is simple.

input, expected,
pre_aggregated):
# Lambdas are used for returning input and expected. Passing lists
# instead lead to printing whole lists as test names in the output.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: leads

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is moved to sum_histogram_test. Fixed that in sum_histogram_test

Copy link
Collaborator Author

@dvadym dvadym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comments! PTAL

# output: v_0 + ....
# k columns (k > 1):
# input: values = [v_0=(v00, ... v0(k-1)), ...]
# output: (00+v10+..., ...)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, yex

# Computations is the following:
# 1. Find min_x = min(X), max_x = max(X) of X
# 2. Split the segment [min_x, max_x] in NUMBER_OF_BUCKETS_SUM_HISTOGRAM = 10000
# equals size intervals [l_i, r_i), the last interval includes both endpoints.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant that it includes max_x also to the left endpoint. I've updated


Attributes:
left, right: bounds on interval on which we compute histogram.
num_buckets: number of buckets on [left, right]. Buckets have the same
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.
it can be different in tests and in principle in future it can be different, so it's better to have it as a parameter

right: float,
num_buckets: int = NUMBER_OF_BUCKETS_SUM_HISTOGRAM,
):
assert left <= right, "The left bound must be smaller then the right one, but {left=} and {right=}"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== is not a problem. E.g. when all values are the same. It can be for example if somebody want's to compute count, but SUM is used, and the value is 1.

NUMBER_OF_BUCKETS_SUM_HISTOGRAM.

Args:
col: collection with elements ((privacy_id, partition_key), value(s)).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added Where value(s) can be one float of list of floats.

@@ -254,6 +254,8 @@ def test_calculate_private_contribution_filters_partitions(self):
result,
pipeline_dp.PrivateContributionBounds(max_partitions_contributed=1))

@unittest.skip(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why it fails. Anyway this functionality is not used for now. If it's really the problem we will during development of "Multi column feature" :)

return backend.flat_map(col, flat_values, "Flat values")


def _compute_linf_sum_contributions_histogram(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of those functions have different input formats, all they have format:

<pre-processing>
_compute_frequency_histogram_per_key

And all heavy lifiting has been already extracted into _compute_frequency_histogram_per_key, and pre-processing is simple.

Copy link
Collaborator Author

@dvadym dvadym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed comments in moved code

input, expected,
pre_aggregated):
# Lambdas are used for returning input and expected. Passing lists
# instead lead to printing whole lists as test names in the output.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is moved to sum_histogram_test. Fixed that in sum_histogram_test

Copy link
Collaborator Author

@dvadym dvadym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL


NUMBER_OF_BUCKETS_SUM_HISTOGRAM = 10000
# Functions _compute_* computes histogram for counts. TODO: move them to
# a separate file, similar to sum_histogram_computation.py.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to split. This PR is already big.

"""Packs histograms from a list to ContributionHistograms."""
l0_contributions = l1_contributions = None
linf_contributions = linf_sum_contributions = None
count_per_partition = privacy_id_per_partition_count = None
sum_per_partition_histogram = None
for histogram in histograms:
if histogram.name == hist.HistogramType.L0_CONTRIBUTIONS:
if isinstance(histogram, Iterable):
if not histogram:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[] is iterable. And that's valid case, when no values are provided (for example for count) or dataset is empty


NUMBER_OF_BUCKETS_SUM_HISTOGRAM = 10000
# Functions _compute_* computes histogram for counts. TODO: move them to
# a separate file, similar to sum_histogram_computation.py.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to split. This PR is already big.

pipeline_dp/data_extractors.py Outdated Show resolved Hide resolved
@dvadym
Copy link
Collaborator Author

dvadym commented Sep 9, 2024

Thanks for review!

@dvadym dvadym merged commit 71875ea into OpenMined:main Sep 9, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants