speed/time-optimization of label-aggregation step #57

jesteria · 2021-04-07T16:13:17Z

This work attempts to reduce the bottleneck of – or anyway the time spent on – the label-aggregation step, by optimizing its (single-threaded) code.

In particular, this involves ensuring that data are only processed/iterated the minimum number of times, and relying on lower-level and likely more optimized tools (namely NumPy over Pandas) where appropriate.

While this likely does not satisfy #54 in its theoretical entirety – more could perhaps be done with this single-threaded code, and it's foreseeable that some implementation of multiprocessing would help further – nonetheless, in so far as this work can be shown to alleviate the high-priority need, and if it is otherwise shown to be acceptable, we might want to merge it as a resolution to #54, (and open separate issue(s) for further investigation of optimizations here).

Resolves #54.
Resolves #56.

jesteria · 2021-04-07T16:16:11Z

src/nprintml/label/aggregator/pcap.py

 from . import LabelAggregator, AggregationLengthError, AggregationPathError


+NPT_DTYPE = np.dtype('int8')


In my quick consideration this is a reasonable optimization of NumPy/Pandas's handling of nPrint data 🤷

This is a totally reasonable assumption, and right for 99.9% of the features. Unfortunately, we have a few features that could extend beyond int8, such as the relative timestamps of each incoming packet (-R in nPrint). Should we just look for this column and if we don't find it use int8?

Ah that makes sense.

What data type is appropriate for the relative timestamps? (Are they just unbounded?) I'm tempted to just choose a single dtype that is universally appropriate, (though depending on the implications of that we can certainly make it check).

This is now resolved.

At least for now, we seem to do fine by treating them as default int at first (generally int64) → float (due to NaN) → int8+ (either int8 or larger as allowed by downcast).

I've checked that this does indeed save a lot of RAM and it's about as speedy as before.

jesteria · 2021-04-07T16:23:29Z

src/nprintml/label/aggregator/pcap.py

@@ -48,93 +55,113 @@ def normalize_npt(cls, npt_csv, path_input_base=None):
                f"data result but stream was empty: {npt_csv}"
            )

+    @staticmethod
+    @storeresults


I devised this helper storeresults to exploit the workings of Python generators, to recreate in a factored-out generator function the way in which unfactored loops can both perform a filter/map and have side-effects (by manipulating variables in a higher scope).

Indeed this is what this method does internally, but now factored out in a useful, cleaner way. Thanks to the helper, it can both provide an iterable stream of items via yield and report a separate, final, "report" via return.

In summary, the helper wraps the iterator resulting from invocation of the generator method, such that it can be iterated as normal, and such that it features a new attribute result, which is set to the value of the method's return.

jesteria · 2021-04-07T16:25:12Z

src/nprintml/label/aggregator/pcap.py

+            if header is None:
+                header = np.genfromtxt(npt_file, delimiter=',', max_rows=1, dtype=str)
+                usecols = np.arange(1, len(header))  # ignore data index (src_ip)
+                skip_header = 1 if isinstance(npt_file, (str, pathlib.Path)) else 0
+            else:
+                skip_header = 1


Bear in mind that this branching is just optimization (and perhaps superfluous), such that it only grabs the header once, on the first item.

Clever. Instead of branching, why not grab it before the loop and avoid the check altogether?

Sure thought the same. That's extra too 🤷 😸

Had the same thought, and easily gave in 😸

Like I said, it's actually in some ways more complicated this way, (more lines of code anyway, particularly to handle a lazy iterator, a file descriptor, etc.); but, in another way, it's more straightforward and has no performance relationship with the length of the iterator (not that that should've mattered anyway).

jesteria · 2021-04-07T16:27:36Z

src/nprintml/label/aggregator/pcap.py

+                dtype=NPT_DTYPE,
+            )
+
+            npt_dim1 = len(npt.shape) == 1


If the nPrint result is only two rows (one the header), then genfromtxt will return a one-dimensional array; or, if it's multiple data rows, then it'll return a two-dimensional array 🤷 ….

this makes sense, I think?

jesteria · 2021-04-07T16:29:38Z

src/nprintml/label/aggregator/pcap.py

+
+            npt_flat = npt if npt_dim1 else npt.ravel()
+
+            yield list(itertools.chain([file_index], npt_flat))


As discussed elsewhere, this is how we then smuggle our row of index+data through into Pandas, (despite its heterogeneous typing).

It's entirely possible that this can be further optimized (either for performance or simplicity), but it seems to work pretty well as is.

jesteria · 2021-04-07T16:36:07Z

src/nprintml/label/aggregator/pcap.py

+
+        # though column names might not be TOO important they may be helpful.
+        # here we flatten these as well from the maximum size we might require:
+        (header, max_length) = npts_flat.result


(And here we make use of that factored-out for-loop's side-effect 😉)

jesteria · 2021-04-07T16:37:04Z

src/nprintml/label/aggregator/pcap.py

-        stream of file objects, to indicate their common base path (even
-        if this is virtual), such that they may be matched with the
-        label index.
+        `path_input_base` is suggested when `npt_csv` specifies a stream


(You'll see below that I simplified some previous work such that some of this path_input_base stuff is no longer required.)

jesteria · 2021-04-07T16:38:16Z

src/nprintml/net/step.py

@@ -8,7 +8,6 @@
 import re
 import sys
 import textwrap
-import time


To be clear work in this file is "optimization" but really largely clean-up.

jesteria · 2021-04-07T16:41:09Z

src/nprintml/util.py

+        self.result = None
+
+    def __iter__(self):
+        self.result = yield from self.iterator


(The storeresults wrapper. …For whatever reason, modern Python offers this feature, but hasn't brought it as high-level as this helper makes it.)

jesteria · 2021-04-07T16:43:03Z

src/nprintml/util.py

+class _NamedIO:
+
+    def __repr__(self):
+        return f'<{self.__class__.__name__}: {self.name}>'


May as well have a useful representation for logging.

jesteria · 2021-04-13T20:12:38Z

This branch now includes the benchmarking work of #58 and as such resolves #56.

jesteria · 2021-04-13T20:18:26Z

Issue #59 has been created and as such this work resolves #54.

jesteria · 2021-04-13T20:28:59Z

This branch features the following (rough) performance profile (as established by Benchmark #29):

net = [ 0.478609561920166, 0.492505204,]
generate_npts = [ 141.15912246704102, 142.682466065,]
label = [ 251.8321077823639, 253.28944440700002,]
learn = [ 209.753977060318, 229.66684026500002,]
total = [ 462.16988372802734, 483.53657390899997,]

This is approximately twice as fast as what's in main (as established by Benchmark #22):

net = [ 1.1385128498077393, 1.1894039999999997,]
generate_npts = [ 265.42256808280945, 269.595662,]
label = [ 978.5154480934143, 982.064151,]
learn = [ 213.12949800491333, 245.99847499999998,]
total = [ 1192.977017879486, 1229.371341,]

(Note: the first number is the real time duration and the second the process time, both in seconds.)

also streamlined communication of npt "file" paths (really pcap file paths) between Net and Label steps -- Net now sets referential/relative path names on the in-memory buffers it returns, such that these may match the labels' index, without requiring further munging -- (specifically the `path_input_base` is now optional). --- ensures features use smallest appropriate dtype forcing int8 is inappropriate as data may include a relative timestamp column. moreover, typing is initially mixed up by presence of NaN (floats) in unequal-length rows. instead, (for now), data is simply downcast as appropriate as a final step. --- resolves #54

Ensures support for macOS; Ubuntu 16.04, 18.04 & 20.04. (Windows attempted and given up.) Further summary of changes: * write meta file meta.toml to results directory contains pre-existing nprint.cfg information and adds pipeline timing information. * added support files for benchmarking and manual testing against snowflake fingerprintability dataset * permit nprint-install to force thru whereis-reported missing dependencies at least on macOS appears difficult to get whereis to do anything useful * add argp to list of nPrint dependencies * workflows ... * benchmark python3.8 on ubuntu-latest & macos-latest via snowflake dataset * attach benchmark timing in meta.toml as workflow artifact * cache snowflake data * install libpcap, libargp as needed * use space instead of = to ensure shell recognizes/expands ~

jesteria requested a review from JordanHolland April 7, 2021 16:13

jesteria self-assigned this Apr 7, 2021

jesteria commented Apr 7, 2021

View reviewed changes

jesteria marked this pull request as ready for review April 13, 2021 20:09

jesteria added 2 commits April 13, 2021 15:35

jesteria force-pushed the jsl/54-label-agg-optimization branch from b9fe473 to 21828cc Compare April 13, 2021 20:38

jesteria merged commit 7833646 into main Apr 13, 2021

jesteria deleted the jsl/54-label-agg-optimization branch April 13, 2021 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed/time-optimization of label-aggregation step #57

speed/time-optimization of label-aggregation step #57

jesteria commented Apr 7, 2021 •

edited

Loading

jesteria Apr 7, 2021

JordanHolland Apr 8, 2021 •

edited

Loading

jesteria Apr 12, 2021

jesteria Apr 13, 2021

jesteria Apr 7, 2021

jesteria Apr 7, 2021

JordanHolland Apr 8, 2021 •

edited

Loading

jesteria Apr 12, 2021

jesteria Apr 12, 2021

jesteria Apr 7, 2021

JordanHolland Apr 8, 2021

jesteria Apr 12, 2021

jesteria Apr 7, 2021

jesteria Apr 7, 2021

jesteria Apr 7, 2021

jesteria Apr 7, 2021

jesteria Apr 7, 2021

jesteria Apr 7, 2021

jesteria commented Apr 13, 2021

jesteria commented Apr 13, 2021

jesteria commented Apr 13, 2021 •

edited

Loading

		from . import LabelAggregator, AggregationLengthError, AggregationPathError


		NPT_DTYPE = np.dtype('int8')


		npt_flat = npt if npt_dim1 else npt.ravel()

		yield list(itertools.chain([file_index], npt_flat))

speed/time-optimization of label-aggregation step #57

speed/time-optimization of label-aggregation step #57

Conversation

jesteria commented Apr 7, 2021 • edited Loading

Choose a reason for hiding this comment

JordanHolland Apr 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JordanHolland Apr 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jesteria commented Apr 13, 2021

jesteria commented Apr 13, 2021

jesteria commented Apr 13, 2021 • edited Loading

jesteria commented Apr 7, 2021 •

edited

Loading

JordanHolland Apr 8, 2021 •

edited

Loading

JordanHolland Apr 8, 2021 •

edited

Loading

jesteria commented Apr 13, 2021 •

edited

Loading