-
-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making tldextract serializable and faster with Trie dictionary #339
Comments
Thank you for the kind words! I'm open to this. It sounds like not much change. While I like /cc @elliotwutingfeng who wrote the trie |
Would be fairly straightforward. Something like this:
And:
|
The case for subclassing would be if you want to have built in methods for search, adding matches, etc. Those methods would only be available at the root of the Trie. And they could help you simplify subsequent use of the Trie (for example in your calling class that uses the trie during the extract process. Also, this does seem to be pythonic. Python gives an example of this in their UserDict class. The case against it is probably backward compatibility as subclassing dict is a newer Python thing. But, could go either way on this. |
I guess my subclassing builtins worry came from posts like this SO and this blog. There are certainly nuances to subclassing. However, in this library's trie's case, it looks like we won't be tripping over the majority of gotchas, which concern juggling overridden methods, because we won't be overriding any methods. |
If you tackle this, I have a couple requests please! 🙏
|
I think that's right. But I do wonder about backward compatibility. As I understand it, subclassing a dict is a newer Python thing. However, to your blogs point, subclassing UserDict has been around a long time so that may be the answer. As for benchmarking, my simple premise is that lookup from a dictionary is faster than lookup from a custom class as it leverages underlying Cython. Some links above suggested as much as 20%. I think I've seen an example test setup on tldextract where it's been benchmarked before. Can you point me to that? I'll queue this up. May take me a little while to get back on it. |
Fair request! #283's description has some example input data to pass to |
@john-kurkowski , I've started this process with a fork and I've made progress on a "NewTrie" while saving "OldTrie" in the src so that I can benchmark the two. Although they seem intuitive, I'm realizing I don't have a full understanding of how you use the properties: Here are some questions:
One that excludes private PSLs and one that includes them all refreshed from your cache/update mechanisms.
Thanks, Lee |
I see you dug deep on this! Hmm. Maybe your recursion isn't going deep enough to reach a private node? import tldextract
extract = tldextract.TLDExtract(include_psl_private_domains=True)
extract("any.domain")
trie = extract._extractor.tlds_incl_private_trie
print(trie.matches["com"].matches["amazonaws"].matches["s3"].is_private) # True I think |
I'm just using |
@john-kurkowski and @leeprevost Consider this private suffix From the https://publicsuffix.org/list/public_suffix_list.dat as of 13 November 2024, the following suffixes are valid
But not these intermediates
The So the eTLD for something like
would be recognized as import tldextract
tlde = tldextract.TLDExtract()
tlde._get_tld_extractor()
trie = tlde._extractor.tlds_incl_private_trie
print(trie.matches['cn'].end) # True
print(trie.matches['cn'].matches['com'].end) # True
print(trie.matches['cn'].matches['com'].matches['amazonaws'].end) # False
print(trie.matches['cn'].matches['com'].matches['amazonaws'].matches['cn-north-1'].end) # False
print(trie.matches['cn'].matches['com'].matches['amazonaws'].matches['cn-north-1'].matches['s3'].end) # True The relevant test cases can be found at Line 555 in 100da81
|
OK @elliotwutingfeng that makes sense. I was mistakenly trying to manage the intermediate leafs in the Trie with:
The previous thinking was that @john-kurkowski I confirmed your example
|
OK guys - I have a new branch that has tested completely. Have not created a PR just yet. Here is that new branch: A few notes:
I'll let you absorb this and point me on how/if you want me to finish this. |
I also have a question about usage on tldextract related to extracting either fqdn or registered_domain from a list of potentially problematic google sites urls where tldextract either has problems with suffix or domain. eg: probs with domain: probs with suffix: A third group relates to super domains (like google sites) for hosting websites where registered domain does not really relate to the entity (pay level domain). mbtigers.weebly.com These could be just data entry problems. But am wondering if you see a pattern I can add to extra_suffixes that would process these? |
Thanks for the detailed breakdown on your branch and the chance to absorb. I will absorb and get back to you. As for your usage questions, in the probs with domain and probs with suffix cases, it's hard to say what problems you're seeing. What is an example call of yours to tldextract? What is your expected and actual output? For the third group, you're right, |
@leeprevost again thanks for your detailed breakdown in your branch. I see a path to converting your branch to a PR. However, I saw your test case and want to back up a bit. In your test case, and what I missed in your original post, I see you're trying to JSON serialize a private property, Maybe there's a public interface change to discuss instead? Maybe I'm missing something about your use case. I figure most libraries in Python are not JSON serializable without the caller manually picking what to serialize (there's no e.g. Is pickling still an option? You say you're encountering errors with this library and import pickle
import tldextract
def test_pickle_lazy_load_before() -> None:
"""Test that the library is pickleable, before lazy loading."""
extract = tldextract.TLDExtract()
pickle.loads(pickle.dumps(extract))
def test_pickle_lazy_load_after() -> None:
"""Test that the library is pickleable, after lazy loading."""
extract = tldextract.TLDExtract()
extract._get_tld_extractor()
pickle.loads(pickle.dumps(extract)) |
We could remove that and agree, probably not good to guarantee. I added that only for my use case as it was why I originally realized the serialization issue. I was trying to serialize your trie to broadcast to nodes and because it wasn't a pure dict, the serializer errored out. So, this was an added test to test whether the trie can be serialized (using json.dumps/loads). |
An update this morning. It was bothering me that we didn't get speed gains. So, I tried to implement item 5 above (collapsing matches into itself by creating an attrib key The Trie is substantially smaller with the more streamlined structure. But, no speed gain. So, I rolled back. |
Odd. I had convinced myself that the library wouldn't pickle (see original post in this issue). I'll go back and try to recreate this. I agree with your logic about your JSON related points. And again, I arrived at your door by trying to broadcast your Trie (a mechanism spark uses to speed up large joins where small tables or small python objects are serialized and sent to remote nodes for further processing). |
Now wondering if I tried to pickle the private extractor? If your test above works (pre and post lazy) on current rev, am now wondering if a pure python dict Trie really adds anything for you at all? my issue about serializing the Trie for broadcast in Spark is a separate issue unique to my use case. But, if your test works above, I agree that your extactor looks is pickleable and should work as a udf in spark. |
Googling your pickle error, this SO may be a guiding set of links. |
This sentence in particular applies to the current release Trie:
The current release Trie is essentially an instance of a Class inside as data in a dictionary. So, it would seem a pure dict trie would resolve that problem. But from your POV, that's got to be a very small user community limited to Big Data analytics. It would be hard for me to make the case for you to take a PR just for that, especially without other gains. No difference to me. Y'all just let me know and I'll finish it or we can save it if it comes up later? But, I could have sworn I produced the pickle error outside of spark in standard python. But now I am thinking I could have tried to pickle the lazy loaded version of itself. I didn't realize how the load works when I flagged it earlier. |
I interpreted the quote as, Python
Right, given the private attribute use case and no other gains, I'm leaning toward declining this issue. Sorry for your hard work not to pan out. Sometimes you don't know until you try. 😕 I'm still open to public interface changes that make this library easier to broadcast in a distributed setting. |
Ok, I'll try to circle back to reproduce that error. No problem on not pursuing it. I would do the same if I were you. As to a public interface, I'll give some thought to that. |
First of all, let me say I'm a huge fan of @john-kurkowski 's tldextract. I am find it to be critical in doing work with the common crawl dataset and other projects.
I have found, quite by accident, that the package is not serializable but I believe could be modified quite easily to do so. and by doing so, I think it could speed the lookup function by ~20% or so. Serializability could be important for big data projects using spark broadcast or other distributed processing beyond a single core.
here is what I'm seeing:
also:
This seems to be because the underlying Trie is a custom class.
This could be resolved in several ways:
An untested way to do this that would likely require no additional changes to the private calling class.
If this is of sufficient interest, I'd be glad to provide a PR.
The text was updated successfully, but these errors were encountered: