Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed/time-optimization of label-aggregation step #57

Merged
merged 2 commits into from
Apr 13, 2021

Conversation

jesteria
Copy link
Collaborator

@jesteria jesteria commented Apr 7, 2021

This work attempts to reduce the bottleneck of – or anyway the time spent on – the label-aggregation step, by optimizing its (single-threaded) code.

In particular, this involves ensuring that data are only processed/iterated the minimum number of times, and relying on lower-level and likely more optimized tools (namely NumPy over Pandas) where appropriate.

While this likely does not satisfy #54 in its theoretical entirety – more could perhaps be done with this single-threaded code, and it's foreseeable that some implementation of multiprocessing would help further – nonetheless, in so far as this work can be shown to alleviate the high-priority need, and if it is otherwise shown to be acceptable, we might want to merge it as a resolution to #54, (and open separate issue(s) for further investigation of optimizations here).

Resolves #54.
Resolves #56.

@jesteria jesteria requested a review from JordanHolland April 7, 2021 16:13
@jesteria jesteria self-assigned this Apr 7, 2021
from . import LabelAggregator, AggregationLengthError, AggregationPathError


NPT_DTYPE = np.dtype('int8')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my quick consideration this is a reasonable optimization of NumPy/Pandas's handling of nPrint data 🤷

Copy link
Collaborator

@JordanHolland JordanHolland Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a totally reasonable assumption, and right for 99.9% of the features. Unfortunately, we have a few features that could extend beyond int8, such as the relative timestamps of each incoming packet (-R in nPrint). Should we just look for this column and if we don't find it use int8?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense.

What data type is appropriate for the relative timestamps? (Are they just unbounded?) I'm tempted to just choose a single dtype that is universally appropriate, (though depending on the implications of that we can certainly make it check).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now resolved.

At least for now, we seem to do fine by treating them as default int at first (generally int64) → float (due to NaN) → int8+ (either int8 or larger as allowed by downcast).

I've checked that this does indeed save a lot of RAM and it's about as speedy as before.

@@ -48,93 +55,113 @@ def normalize_npt(cls, npt_csv, path_input_base=None):
f"data result but stream was empty: {npt_csv}"
)

@staticmethod
@storeresults
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I devised this helper storeresults to exploit the workings of Python generators, to recreate in a factored-out generator function the way in which unfactored loops can both perform a filter/map and have side-effects (by manipulating variables in a higher scope).

Indeed this is what this method does internally, but now factored out in a useful, cleaner way. Thanks to the helper, it can both provide an iterable stream of items via yield and report a separate, final, "report" via return.

In summary, the helper wraps the iterator resulting from invocation of the generator method, such that it can be iterated as normal, and such that it features a new attribute result, which is set to the value of the method's return.

Comment on lines 70 to 75
if header is None:
header = np.genfromtxt(npt_file, delimiter=',', max_rows=1, dtype=str)
usecols = np.arange(1, len(header)) # ignore data index (src_ip)
skip_header = 1 if isinstance(npt_file, (str, pathlib.Path)) else 0
else:
skip_header = 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bear in mind that this branching is just optimization (and perhaps superfluous), such that it only grabs the header once, on the first item.

Copy link
Collaborator

@JordanHolland JordanHolland Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever. Instead of branching, why not grab it before the loop and avoid the check altogether?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thought the same. That's extra too 🤷 😸

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had the same thought, and easily gave in 😸

Like I said, it's actually in some ways more complicated this way, (more lines of code anyway, particularly to handle a lazy iterator, a file descriptor, etc.); but, in another way, it's more straightforward and has no performance relationship with the length of the iterator (not that that should've mattered anyway).

dtype=NPT_DTYPE,
)

npt_dim1 = len(npt.shape) == 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the nPrint result is only two rows (one the header), then genfromtxt will return a one-dimensional array; or, if it's multiple data rows, then it'll return a two-dimensional array 🤷 ….

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense, I think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep 👍


npt_flat = npt if npt_dim1 else npt.ravel()

yield list(itertools.chain([file_index], npt_flat))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed elsewhere, this is how we then smuggle our row of index+data through into Pandas, (despite its heterogeneous typing).

It's entirely possible that this can be further optimized (either for performance or simplicity), but it seems to work pretty well as is.


# though column names might not be TOO important they may be helpful.
# here we flatten these as well from the maximum size we might require:
(header, max_length) = npts_flat.result
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And here we make use of that factored-out for-loop's side-effect 😉)

stream of file objects, to indicate their common base path (even
if this is virtual), such that they may be matched with the
label index.
`path_input_base` is suggested when `npt_csv` specifies a stream
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(You'll see below that I simplified some previous work such that some of this path_input_base stuff is no longer required.)

@@ -8,7 +8,6 @@
import re
import sys
import textwrap
import time
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear work in this file is "optimization" but really largely clean-up.

self.result = None

def __iter__(self):
self.result = yield from self.iterator
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The storeresults wrapper. …For whatever reason, modern Python offers this feature, but hasn't brought it as high-level as this helper makes it.)

class _NamedIO:

def __repr__(self):
return f'<{self.__class__.__name__}: {self.name}>'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May as well have a useful representation for logging.

@jesteria jesteria marked this pull request as ready for review April 13, 2021 20:09
@jesteria
Copy link
Collaborator Author

This branch now includes the benchmarking work of #58 and as such resolves #56.

@jesteria
Copy link
Collaborator Author

Issue #59 has been created and as such this work resolves #54.

@jesteria
Copy link
Collaborator Author

jesteria commented Apr 13, 2021

This branch features the following (rough) performance profile (as established by Benchmark #29):

net = [ 0.478609561920166, 0.492505204,]
generate_npts = [ 141.15912246704102, 142.682466065,]
label = [ 251.8321077823639, 253.28944440700002,]
learn = [ 209.753977060318, 229.66684026500002,]
total = [ 462.16988372802734, 483.53657390899997,]

This is approximately twice as fast as what's in main (as established by Benchmark #22):

net = [ 1.1385128498077393, 1.1894039999999997,]
generate_npts = [ 265.42256808280945, 269.595662,]
label = [ 978.5154480934143, 982.064151,]
learn = [ 213.12949800491333, 245.99847499999998,]
total = [ 1192.977017879486, 1229.371341,]

(Note: the first number is the real time duration and the second the process time, both in seconds.)

also streamlined communication of npt "file" paths (really pcap file
paths) between Net and Label steps -- Net now sets referential/relative
path names on the in-memory buffers it returns, such that these may
match the labels' index, without requiring further munging --
(specifically the `path_input_base` is now optional).

---

ensures features use smallest appropriate dtype

forcing int8 is inappropriate as data may include a relative timestamp
column.

moreover, typing is initially mixed up by presence of NaN (floats) in
unequal-length rows.

instead, (for now), data is simply downcast as appropriate as a final
step.

---

resolves #54
Ensures support for macOS; Ubuntu 16.04, 18.04 & 20.04.
(Windows attempted and given up.)

Further summary of changes:

* write meta file meta.toml to results directory

  contains pre-existing nprint.cfg information and adds pipeline timing
  information.

* added support files for benchmarking and manual testing against snowflake
  fingerprintability dataset

* permit nprint-install to force thru whereis-reported missing dependencies

  at least on macOS appears difficult to get whereis to do anything useful

* add argp to list of nPrint dependencies

* workflows ...

  * benchmark python3.8 on ubuntu-latest & macos-latest via snowflake
    dataset
  * attach benchmark timing in meta.toml as workflow artifact
  * cache snowflake data
  * install libpcap, libargp as needed
  * use space instead of = to ensure shell recognizes/expands ~
@jesteria jesteria force-pushed the jsl/54-label-agg-optimization branch from b9fe473 to 21828cc Compare April 13, 2021 20:38
@jesteria jesteria merged commit 7833646 into main Apr 13, 2021
@jesteria jesteria deleted the jsl/54-label-agg-optimization branch April 13, 2021 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

automated benchmarks optimized label-aggregation
2 participants