Tuning for multiple columns part 1: Computing Sum histograms for multiple columns #523

dvadym · 2024-08-30T15:23:13Z

This PR introduces computing sum histograms for multiple columns. This covers cases when DP aggregations can be presented in the pseudo-SQL terms as

SELECT partition_key, DP_SUM(column1), DP_SUM(columns2)
GROUP BY partition_key

Histograms for each column is computed independently. This is implemented by changing existing code of computing histogram on colleciton (value: float) to computing per key histograms of collection (column_id:int, value:float)

analysis/contribution_bounders.py

miracvbasaran · 2024-09-05T15:37:17Z

analysis/contribution_bounders.py

+                #   output: v_0 + ....
+                # k columns (k > 1):
+                #   input: values = [v_0=(v00, ... v0(k-1)), ...]
+                #   output: (00+v10+..., ...)


Is 00 supposed to be v00?

analysis/contribution_bounders.py

pipeline_dp/data_extractors.py

miracvbasaran · 2024-09-05T15:59:21Z

tests/dataset_histograms/__init__.py

Hmm, why is this file showing up on the diff when it says "Empty file"?

miracvbasaran · 2024-09-05T16:00:53Z

pipeline_dp/dataset_histograms/sum_histogram_computation.py

+    NUMBER_OF_BUCKETS_SUM_HISTOGRAM.
+
+    Args:
+        col: collection with elements ((privacy_id, partition_key), value(s)).


What does value(s) mean?

I've added Where value(s) can be one float of list of floats.

miracvbasaran · 2024-09-05T16:03:22Z

pipeline_dp/dataset_histograms/sum_histogram_computation.py

+    return backend.flat_map(col, flat_values, "Flat values")
+
+
+def _compute_linf_sum_contributions_histogram(


All the functions below these seem to be doing very similar things. Do you think it would be possible to merge them into a single function with a parameter for choosing the type of histogram? Would that be more readable?

All of those functions have different input formats, all they have format:

<pre-processing> _compute_frequency_histogram_per_key

And all heavy lifiting has been already extracted into _compute_frequency_histogram_per_key, and pre-processing is simple.

tests/dataset_histograms/computing_histograms_test.py

miracvbasaran · 2024-09-05T16:04:36Z

tests/dataset_histograms/computing_histograms_test.py

-                                                      input, expected,
-                                                      pre_aggregated):
-        # Lambdas are used for returning input and expected. Passing lists
-        # instead lead to printing whole lists as test names in the output.


nit: leads

this code is moved to sum_histogram_test. Fixed that in sum_histogram_test

dvadym

Thanks for comments! PTAL

dvadym · 2024-09-06T08:41:21Z

analysis/contribution_bounders.py

+                #   output: v_0 + ....
+                # k columns (k > 1):
+                #   input: values = [v_0=(v00, ... v0(k-1)), ...]
+                #   output: (00+v10+..., ...)


dvadym · 2024-09-06T08:48:30Z

pipeline_dp/dataset_histograms/sum_histogram_computation.py

+# Computations is the following:
+# 1. Find min_x = min(X), max_x = max(X) of X
+# 2. Split the segment [min_x, max_x] in NUMBER_OF_BUCKETS_SUM_HISTOGRAM = 10000
+# equals size intervals [l_i, r_i), the last interval includes both endpoints.


Yes, I meant that it includes max_x also to the left endpoint. I've updated

dvadym · 2024-09-06T08:52:22Z

pipeline_dp/dataset_histograms/sum_histogram_computation.py

+
+    Attributes:
+        left, right: bounds on interval on which we compute histogram.
+        num_buckets: number of buckets on [left, right]. Buckets have the same


done.
it can be different in tests and in principle in future it can be different, so it's better to have it as a parameter

dvadym · 2024-09-06T08:53:36Z

pipeline_dp/dataset_histograms/sum_histogram_computation.py

+        right: float,
+        num_buckets: int = NUMBER_OF_BUCKETS_SUM_HISTOGRAM,
+    ):
+        assert left <= right, "The left bound must be smaller then the right one, but {left=} and {right=}"


== is not a problem. E.g. when all values are the same. It can be for example if somebody want's to compute count, but SUM is used, and the value is 1.

dvadym · 2024-09-06T08:55:38Z

pipeline_dp/dataset_histograms/sum_histogram_computation.py

+    NUMBER_OF_BUCKETS_SUM_HISTOGRAM.
+
+    Args:
+        col: collection with elements ((privacy_id, partition_key), value(s)).


I've added Where value(s) can be one float of list of floats.

dvadym · 2024-09-06T08:56:57Z

tests/dp_engine_test.py

@@ -254,6 +254,8 @@ def test_calculate_private_contribution_filters_partitions(self):
            result,
            pipeline_dp.PrivateContributionBounds(max_partitions_contributed=1))

+    @unittest.skip(


It's not clear why it fails. Anyway this functionality is not used for now. If it's really the problem we will during development of "Multi column feature" :)

dvadym · 2024-09-06T09:01:57Z

pipeline_dp/dataset_histograms/sum_histogram_computation.py

+    return backend.flat_map(col, flat_values, "Flat values")
+
+
+def _compute_linf_sum_contributions_histogram(


All of those functions have different input formats, all they have format:

<pre-processing> _compute_frequency_histogram_per_key

And all heavy lifiting has been already extracted into _compute_frequency_histogram_per_key, and pre-processing is simple.

dvadym

Missed comments in moved code

dvadym · 2024-09-06T13:23:54Z

tests/dataset_histograms/computing_histograms_test.py

-                                                      input, expected,
-                                                      pre_aggregated):
-        # Lambdas are used for returning input and expected. Passing lists
-        # instead lead to printing whole lists as test names in the output.


this code is moved to sum_histogram_test. Fixed that in sum_histogram_test

dvadym

PTAL

dvadym · 2024-09-06T13:31:40Z

pipeline_dp/dataset_histograms/computing_histograms.py


-NUMBER_OF_BUCKETS_SUM_HISTOGRAM = 10000
+# Functions _compute_* computes histogram for counts. TODO: move them to
+# a separate file, similar to sum_histogram_computation.py.


It's better to split. This PR is already big.

dvadym · 2024-09-06T13:32:51Z

pipeline_dp/dataset_histograms/computing_histograms.py

    """Packs histograms from a list to ContributionHistograms."""
    l0_contributions = l1_contributions = None
    linf_contributions = linf_sum_contributions = None
    count_per_partition = privacy_id_per_partition_count = None
    sum_per_partition_histogram = None
    for histogram in histograms:
-        if histogram.name == hist.HistogramType.L0_CONTRIBUTIONS:
+        if isinstance(histogram, Iterable):
+            if not histogram:


[] is iterable. And that's valid case, when no values are provided (for example for count) or dataset is empty

dvadym · 2024-09-06T13:34:24Z

pipeline_dp/dataset_histograms/computing_histograms.py


-NUMBER_OF_BUCKETS_SUM_HISTOGRAM = 10000
+# Functions _compute_* computes histogram for counts. TODO: move them to
+# a separate file, similar to sum_histogram_computation.py.


It's better to split. This PR is already big.

pipeline_dp/data_extractors.py

dvadym · 2024-09-09T13:12:29Z

Thanks for review!

dvadym added 4 commits August 30, 2024 17:21

Tuning for multiple aggregations

a9ba1fe

pipeline functions tests

b10287c

Test for sum_histograms

bdca9e1

Docstring, more tests

ac24ac4

dvadym changed the title ~~(WIP) Tuning for multiple aggregations~~ Tuning for multiple columns part 1: Computing Sum histograms for multiple columns Sep 3, 2024

dvadym requested a review from miracvbasaran September 3, 2024 13:59

dvadym added 5 commits September 3, 2024 16:03

Type annotation fix

19529b1

fix test

ae2be8e

fix

bcc6997

skip

9874338

fix

cac1ba7

miracvbasaran reviewed Sep 5, 2024

View reviewed changes

Addressed comments

95800fc

dvadym commented Sep 6, 2024

View reviewed changes

Minor fix

23846ab

dvadym commented Sep 6, 2024

View reviewed changes

Addressing comments

8b70a4a

dvadym commented Sep 6, 2024

View reviewed changes

miracvbasaran approved these changes Sep 9, 2024

View reviewed changes

pipeline_dp/data_extractors.py Outdated Show resolved Hide resolved

Addressed comments

8665f83

dvadym merged commit 71875ea into OpenMined:main Sep 9, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tuning for multiple columns part 1: Computing Sum histograms for multiple columns #523

Tuning for multiple columns part 1: Computing Sum histograms for multiple columns #523

dvadym commented Aug 30, 2024 •

edited

Loading

miracvbasaran Sep 5, 2024

dvadym Sep 6, 2024

miracvbasaran Sep 5, 2024

miracvbasaran Sep 5, 2024

dvadym Sep 6, 2024

miracvbasaran Sep 5, 2024

dvadym Sep 6, 2024

miracvbasaran Sep 5, 2024

dvadym Sep 6, 2024

dvadym left a comment

dvadym Sep 6, 2024

dvadym Sep 6, 2024

dvadym Sep 6, 2024

dvadym Sep 6, 2024

dvadym Sep 6, 2024

dvadym Sep 6, 2024

dvadym Sep 6, 2024

dvadym left a comment

dvadym Sep 6, 2024

dvadym left a comment

dvadym Sep 6, 2024

dvadym Sep 6, 2024

dvadym Sep 6, 2024

dvadym commented Sep 9, 2024

		return backend.flat_map(col, flat_values, "Flat values")


		def _compute_linf_sum_contributions_histogram(

Tuning for multiple columns part 1: Computing Sum histograms for multiple columns #523

Tuning for multiple columns part 1: Computing Sum histograms for multiple columns #523

Conversation

dvadym commented Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dvadym left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dvadym left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dvadym left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dvadym commented Sep 9, 2024

dvadym commented Aug 30, 2024 •

edited

Loading