-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error using pickle #20
Comments
Thanks for posting this. There's basically no support now for pickling but it's a reasonable thing to add. |
Hi, is it something that would be complicated to add ? |
Not really, but I have zero time to do it. Maybe over the summer but not any time soon. Sorry. |
No problem, I understand. |
I am using a naive way to do this in my project: def save_tree(tree, file_name):
tree_saver = {}
for prefix in tree:
tree_saver[prefix] = tree[prefix]
joblib.dump(tree_saver, file_name)
def load_tree(file_name):
tree_saver = joblib.load(file_name)
tree = pytricia.PyTricia()
for prefix in tree_saver.keys():
tree[prefix] = tree_saver[prefix]
return tree |
Hello, is this the only way to serialise an instantiated tree? I have a tree that takes about 30 seconds to build, so would be keen to be able to serialise it for future use. Or is there a best practice to bulk import? Thanks |
Yes, for now it is. The routeviews file is used by |
Thanks for the reply, and glad to see you're still active on the project. The dataset I am currently using is about 9 million entries, it takes about 30 seconds to build, but only a few seconds to run a set of 2 million IPs against it. Do you have any pointers as to where you'd start to implement such a solution? Alternatively, being able to serialise it to a binary file and then load this for future use would also work. This wouldn't need to be pickle necessarily. Pytricia is amazing for very fast lookups, and works very nicely with the apply function in Pandas across very large datasets. Another advantage of pickle or a method to serialise/deserialise is that the trie construction could be run across multiple cores, and membership of each sub trie checked individually. Ideally the sub tries (eg one per /8) could be reparented, but this would be a little more involved. I'd like to be able to help implement these - but some pointers would be very helpful. thanks! |
I'd have to spend some time thinking about different ways to serialize and tradeoffs. Not sure I'd have any meaningful pointers at this time, but I appreciate you asking! I would likely be able to spend some time starting in later october on this. |
Thanks, that timeframe sounds good. I've been playing around with a pure-Python radix tree to look at possible performance enhancements. There's a few potential avenues to improve bulk performance. The time consuming part is building the tree, and this is cpu bound by the Python GIL. This would allow the work to be divided amongst cores using multiprocessing, eg a tree for each /8 (or similar depending on the tree distribution), and then a simple lookup table to map this to the appropriate trie. The next step from this would be to parent these sub-trees to the root node (and its immediate children). To be able to this would mean the user needs to sort the data - but this is fast using something like Pandas (which itself allows for parallelising workloads using tools like Dask). For very large trees, I think it is reasonable that the user uses the trie as an index lookup. This simplifies the nodes to only needing to store an integer key, rather than needing to handle arbitrary Python data structures. This would give very fast performance to build the tree, relatively easily, even for very large datasets. If the user provides the data sorted, then a further performance optimisation would be to cache the parent node. In my pure Python experiment, I was just caching at the classic octet boundaries. I kept a lookup dictionary of the /8 /16 and /24, which were pointers to that specific node. This removed the many redundant tree traversals that would occur if the prefixes being inserted have a common close ancestor. These options would significantly increase performance for very large datasets, and based on my pure Python experiments, shouldn't be too much new code to write (I'd do it if I was more familiar with c). Given that building the tree is time consuming, and saving/loading/searching are fast, being able to store to a file would allow multiple tries (such as for multiple routers) to be loaded/unloaded as necessary (with only taking the cpu hit to build the tree initially). These improvements would really open up the PyData ecosystem to powerful network analytics. This has strayed way beyond the initial Pickle discussion - happy to take it to a new thread if that's preferable. Thanks! Edit: actually the bulk import/bulk cross referencing speedup could be simplified:
This would save repeated lookup steps if there is clustering, which would happen if the source (to build) or the lookup datasets are large, and are sorted. Sorting is cheap (and can be parallelised), so this would exploit the inherent locality of very large datasets. |
Hey, are there workarounds to save the tree to the disk? |
I found another bug, related this time to pickling pytricia objects:
gives a
Segmentation fault (core dumped)
. I'm using Python 3.The text was updated successfully, but these errors were encountered: