-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed/time-optimization of label-aggregation step #57
Conversation
from . import LabelAggregator, AggregationLengthError, AggregationPathError | ||
|
||
|
||
NPT_DTYPE = np.dtype('int8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my quick consideration this is a reasonable optimization of NumPy/Pandas's handling of nPrint data 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a totally reasonable assumption, and right for 99.9% of the features. Unfortunately, we have a few features that could extend beyond int8
, such as the relative timestamps of each incoming packet (-R
in nPrint). Should we just look for this column and if we don't find it use int8
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes sense.
What data type is appropriate for the relative timestamps? (Are they just unbounded?) I'm tempted to just choose a single dtype that is universally appropriate, (though depending on the implications of that we can certainly make it check).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now resolved.
At least for now, we seem to do fine by treating them as default int
at first (generally int64) → float (due to NaN) → int8+ (either int8 or larger as allowed by downcast).
I've checked that this does indeed save a lot of RAM and it's about as speedy as before.
@@ -48,93 +55,113 @@ def normalize_npt(cls, npt_csv, path_input_base=None): | |||
f"data result but stream was empty: {npt_csv}" | |||
) | |||
|
|||
@staticmethod | |||
@storeresults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I devised this helper storeresults
to exploit the workings of Python generators, to recreate in a factored-out generator function the way in which unfactored loops can both perform a filter/map and have side-effects (by manipulating variables in a higher scope).
Indeed this is what this method does internally, but now factored out in a useful, cleaner way. Thanks to the helper, it can both provide an iterable stream of items via yield
and report a separate, final, "report" via return
.
In summary, the helper wraps the iterator resulting from invocation of the generator method, such that it can be iterated as normal, and such that it features a new attribute result
, which is set to the value of the method's return
.
if header is None: | ||
header = np.genfromtxt(npt_file, delimiter=',', max_rows=1, dtype=str) | ||
usecols = np.arange(1, len(header)) # ignore data index (src_ip) | ||
skip_header = 1 if isinstance(npt_file, (str, pathlib.Path)) else 0 | ||
else: | ||
skip_header = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bear in mind that this branching is just optimization (and perhaps superfluous), such that it only grabs the header once, on the first item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever. Instead of branching, why not grab it before the loop and avoid the check altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thought the same. That's extra too 🤷 😸
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had the same thought, and easily gave in 😸
Like I said, it's actually in some ways more complicated this way, (more lines of code anyway, particularly to handle a lazy iterator, a file descriptor, etc.); but, in another way, it's more straightforward and has no performance relationship with the length of the iterator (not that that should've mattered anyway).
dtype=NPT_DTYPE, | ||
) | ||
|
||
npt_dim1 = len(npt.shape) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the nPrint result is only two rows (one the header), then genfromtxt
will return a one-dimensional array; or, if it's multiple data rows, then it'll return a two-dimensional array 🤷 ….
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep 👍
|
||
npt_flat = npt if npt_dim1 else npt.ravel() | ||
|
||
yield list(itertools.chain([file_index], npt_flat)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed elsewhere, this is how we then smuggle our row of index+data through into Pandas, (despite its heterogeneous typing).
It's entirely possible that this can be further optimized (either for performance or simplicity), but it seems to work pretty well as is.
|
||
# though column names might not be TOO important they may be helpful. | ||
# here we flatten these as well from the maximum size we might require: | ||
(header, max_length) = npts_flat.result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And here we make use of that factored-out for-loop's side-effect 😉)
stream of file objects, to indicate their common base path (even | ||
if this is virtual), such that they may be matched with the | ||
label index. | ||
`path_input_base` is suggested when `npt_csv` specifies a stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(You'll see below that I simplified some previous work such that some of this path_input_base
stuff is no longer required.)
@@ -8,7 +8,6 @@ | |||
import re | |||
import sys | |||
import textwrap | |||
import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear work in this file is "optimization" but really largely clean-up.
self.result = None | ||
|
||
def __iter__(self): | ||
self.result = yield from self.iterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The storeresults
wrapper. …For whatever reason, modern Python offers this feature, but hasn't brought it as high-level as this helper makes it.)
class _NamedIO: | ||
|
||
def __repr__(self): | ||
return f'<{self.__class__.__name__}: {self.name}>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May as well have a useful representation for logging.
This branch features the following (rough) performance profile (as established by Benchmark #29):
This is approximately twice as fast as what's in
(Note: the first number is the real time duration and the second the process time, both in seconds.) |
also streamlined communication of npt "file" paths (really pcap file paths) between Net and Label steps -- Net now sets referential/relative path names on the in-memory buffers it returns, such that these may match the labels' index, without requiring further munging -- (specifically the `path_input_base` is now optional). --- ensures features use smallest appropriate dtype forcing int8 is inappropriate as data may include a relative timestamp column. moreover, typing is initially mixed up by presence of NaN (floats) in unequal-length rows. instead, (for now), data is simply downcast as appropriate as a final step. --- resolves #54
Ensures support for macOS; Ubuntu 16.04, 18.04 & 20.04. (Windows attempted and given up.) Further summary of changes: * write meta file meta.toml to results directory contains pre-existing nprint.cfg information and adds pipeline timing information. * added support files for benchmarking and manual testing against snowflake fingerprintability dataset * permit nprint-install to force thru whereis-reported missing dependencies at least on macOS appears difficult to get whereis to do anything useful * add argp to list of nPrint dependencies * workflows ... * benchmark python3.8 on ubuntu-latest & macos-latest via snowflake dataset * attach benchmark timing in meta.toml as workflow artifact * cache snowflake data * install libpcap, libargp as needed * use space instead of = to ensure shell recognizes/expands ~
b9fe473
to
21828cc
Compare
This work attempts to reduce the bottleneck of – or anyway the time spent on – the label-aggregation step, by optimizing its (single-threaded) code.
In particular, this involves ensuring that data are only processed/iterated the minimum number of times, and relying on lower-level and likely more optimized tools (namely NumPy over Pandas) where appropriate.
While this likely does not satisfy #54 in its theoretical entirety – more could perhaps be done with this single-threaded code, and it's foreseeable that some implementation of multiprocessing would help further – nonetheless, in so far as this work can be shown to alleviate the high-priority need, and if it is otherwise shown to be acceptable, we might want to merge it as a resolution to #54, (and open separate issue(s) for further investigation of optimizations here).
Resolves #54.
Resolves #56.