Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making tldextract serializable and faster with Trie dictionary #339

Open
leeprevost opened this issue Sep 16, 2024 · 24 comments
Open

Making tldextract serializable and faster with Trie dictionary #339

leeprevost opened this issue Sep 16, 2024 · 24 comments

Comments

@leeprevost
Copy link

leeprevost commented Sep 16, 2024

First of all, let me say I'm a huge fan of @john-kurkowski 's tldextract. I am find it to be critical in doing work with the common crawl dataset and other projects.

I have found, quite by accident, that the package is not serializable but I believe could be modified quite easily to do so. and by doing so, I think it could speed the lookup function by ~20% or so. Serializability could be important for big data projects using spark broadcast or other distributed processing beyond a single core.

here is what I'm seeing:

import json
ext = tldextract.TLDExtract()
ext._extractor.tlds_incl_private_trie
Out[14]: <tldextract.tldextract.Trie at 0x2305deb1840>
json.dumps(ext._extractor.tlds_incl_private_trie)
Traceback (most recent call last):
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-d4e5d6e8c9ec>", line 1, in <module>
    json.dumps(ext._extractor.tlds_incl_private_trie)
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Trie is not JSON serializable

also:

import pickle
pickle.dumps(ext)
Traceback (most recent call last):
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-17-188183203f90>", line 1, in <module>
    pickle.dumps(extract)
_pickle.PicklingError: Can't pickle <function ext at 0x000002305F541480>: attribute lookup extract on __main__ failed

This seems to be because the underlying Trie is a custom class.

This could be resolved in several ways:

  1. Add a method to Trie class to tell it how to serialize/deserialize (a bit hack-ey in my opinion)
  2. Tell json or pickle how to serialize/deserialize. (again, a band-aid)
  3. Rewrite Trie class to be a standard dict (I think this is the best way a the dict would likely be faster - ~ 20%). ref(

An untested way to do this that would likely require no additional changes to the private calling class.

If this is of sufficient interest, I'd be glad to provide a PR.

Updated 10/11/24

class Trie(dict):
    """
    alt trie for tldextrct using python dict class
    """
    __getattr__ = dict.get  # key for allowing calling functions to use dot attrib calls.

    


    @staticmethod
    def create(
            match_list: Collection[str],
            reverse: bool = False, 
            is_private=False


    ) -> 'Trie':
        """Create a Trie from a list of matches and return its root node."""
        root_node = Trie()

        for m in match_list:
            root_node._add_match(m, is_private)

        return root_node

    def _add_match(self, match: str, reverse=False, is_private=False):
        """Append a suffix's labels to this Trie node."""
        labels = match.split(".")
        node = self
        if reverse:
              labels = reverse(labels)

        for label in labels:
            node = node.setdefault(label, {})
        node['is_private'] =  is_private
@leeprevost leeprevost changed the title Making tldextract serializable Making tldextract serializable and faster with Trie dictionary Oct 11, 2024
@john-kurkowski
Copy link
Owner

Thank you for the kind words!

I'm open to this. It sounds like not much change.

While I like dicts, I'm always wary of subclassing, especially such a common Python type. (I guess another option would be to avoid classes altogether, and pass dicts between standalone trie functions?)

/cc @elliotwutingfeng who wrote the trie

@leeprevost
Copy link
Author

(I guess another option would be to avoid classes altogether, and pass dicts between standalone trie functions?)

Would be fairly straightforward. Something like this:

def make_tld_trie(matches, is_private=False):
    root = dict()
    for match in matches:
        node = root
        for label in match.split("."):
            node = node.setdefault(label, {})
        node['is_private'] = is_private
    return root

And:
matches = ['com.github', 'com.kurkowski', 'net.prevost.lee']
make_tld_trie(matches)

{'com': {'github': {'is_private': False}, 'kurkowski': {'is_private': False}},
 'net': {'prevost': {'lee': {'is_private': False}}}}

@leeprevost
Copy link
Author

leeprevost commented Oct 17, 2024

While I like dicts, I'm always wary of subclassing, especially such a common Python type.

The case for subclassing would be if you want to have built in methods for search, adding matches, etc. Those methods would only be available at the root of the Trie. And they could help you simplify subsequent use of the Trie (for example in your calling class that uses the trie during the extract process.

Also, this does seem to be pythonic. Python gives an example of this in their UserDict class.

The case against it is probably backward compatibility as subclassing dict is a newer Python thing.

But, could go either way on this.

@john-kurkowski
Copy link
Owner

I guess my subclassing builtins worry came from posts like this SO and this blog. There are certainly nuances to subclassing. However, in this library's trie's case, it looks like we won't be tripping over the majority of gotchas, which concern juggling overridden methods, because we won't be overriding any methods.

@john-kurkowski
Copy link
Owner

If you tackle this, I have a couple requests please! 🙏

  1. Add test coverage that whatever you want to serialize is serializable.
  2. And deserializable, if that's your use case?
  3. Share benchmarks that the speed of the library is same or better.

@leeprevost
Copy link
Author

However, in this library's trie's case, it looks like we won't be tripping over the majority of gotchas, which concern juggling overridden methods, because we won't be overriding any methods.

I think that's right. But I do wonder about backward compatibility. As I understand it, subclassing a dict is a newer Python thing. However, to your blogs point, subclassing UserDict has been around a long time so that may be the answer.

As for benchmarking, my simple premise is that lookup from a dictionary is faster than lookup from a custom class as it leverages underlying Cython. Some links above suggested as much as 20%.

I think I've seen an example test setup on tldextract where it's been benchmarked before. Can you point me to that?

I'll queue this up. May take me a little while to get back on it.

@john-kurkowski
Copy link
Owner

I think I've seen an example test setup on tldextract where it's been benchmarked before. Can you point me to that?

Fair request! #283's description has some example input data to pass to timeit. I think the existing test suite (i.e. pytest) duration would be useful too.

@leeprevost
Copy link
Author

leeprevost commented Nov 12, 2024

@john-kurkowski , I've started this process with a fork and I've made progress on a "NewTrie" while saving "OldTrie" in the src so that I can benchmark the two.

Although they seem intuitive, I'm realizing I don't have a full understanding of how you use the properties: end and is_private in Trie

Here are some questions:

  1. I see that _PublicSuffixListExtractor initializes two Tries using the Trie function in the init as follows:
       self.tlds_incl_private_trie = Trie.create(
       self.tlds_excl_private, frozenset(private_tlds)
        )
        self.tlds_excl_private_trie = Trie.create(self.tlds_excl_private)

One that excludes private PSLs and one that includes them all refreshed from your cache/update mechanisms.

  1. However, when I capture those Tries (initialized using current Trie) after they are initialized using tldextract.TLDExtract(include_psl_private_domains = True)._extractor.tlds_incl_private_trie, I can find no nodes in the Trie where the flag "is_private" is set to True. As an aside, to some degree, I think the properties "end" and "is_private" could be deprecated anyway since you can iterate to end of Trie on a search until search returns {} and since you are keeping separate Tries, neither flag seems important now?
  2. As I write this, I now do see that your Trie Search is largely here. Initially, I had convinced myself that "end" and "is_private" attributes aren't used anywhere but I see they are used here. I don't have a full understanding of what you must do with decoding labels in your search. I see in your other note a lot of work around making that search faster using reverse search, etc. so I don't really need to know how all that works nor will I break what doesn't need fixed! Could it be that I missed the "is_private" flags when I first checked? I was using a recursive iterator to try to write those flags out when found. I could have missed something on this.
  3. Finally, I was trying to avoid setting up a venv for this as I thought I could test using my environment but am realizing I need several dependencies for testing. Are you using Poetry to manage your environment? Any pointers on how to set that up using your .toml settings? I can write the test script for the benchmarking and the serialization without those dependencies but wanted to make sure that I didn't break anything... I found some problems with the Trie test script so I wrote my own. But the above questions on Trie properties make me wonder if I'm replicating it fully in the new Trie.

Thanks,

Lee

@john-kurkowski
Copy link
Owner

I can find no nodes in the Trie where the flag "is_private" is set to True. As an aside, to some degree, I think the properties "end" and "is_private" could be deprecated anyway since you can iterate to end of Trie on a search until search returns {} and since you are keeping separate Tries, neither flag seems important now? … Could it be that I missed the "is_private" flags when I first checked? I was using a recursive iterator to try to write those flags out when found. I could have missed something on this.

I see you dug deep on this! Hmm. Maybe your recursion isn't going deep enough to reach a private node? is_private=True nodes are in there. Here's an example.

import tldextract

extract = tldextract.TLDExtract(include_psl_private_domains=True)
extract("any.domain")
trie = extract._extractor.tlds_incl_private_trie
print(trie.matches["com"].matches["amazonaws"].matches["s3"].is_private) # True

I think is_private is still beneficial to store within the trie, because we don't want to duplicate the trie search 2x per query, when we could always search 1x in a trie that has all the info. I do agree there could be an optimization that eliminates end.

@john-kurkowski
Copy link
Owner

I was trying to avoid setting up a venv for this as I thought I could test using my environment but am realizing I need several dependencies for testing. Are you using Poetry to manage your environment? Any pointers on how to set that up using your .toml settings?

I'm just using pip to get the test dependencies. See this install command in the README.

@elliotwutingfeng
Copy link
Contributor

@john-kurkowski and @leeprevost

Consider this private suffix s3.cn-north-1.amazonaws.com.cn.

From the https://publicsuffix.org/list/public_suffix_list.dat as of 13 November 2024, the following suffixes are valid

  • s3.cn-north-1.amazonaws.com.cn
  • com.cn
  • cn

But not these intermediates

  • cn-north-1.amazonaws.com.cn
  • amazonaws.com.cn

The end=False boolean flag is used to mark these intermediates as invalid in the trie.

So the eTLD for something like

tldextract.extract("tldextractyay.cn-north-1.amazonaws.com.cn", include_psl_private_domains=True)

would be recognized as com.cn and not cn-north-1.amazonaws.com.cn or amazonaws.com.cn

import tldextract

tlde = tldextract.TLDExtract()
tlde._get_tld_extractor()

trie = tlde._extractor.tlds_incl_private_trie

print(trie.matches['cn'].end)  # True
print(trie.matches['cn'].matches['com'].end)  # True
print(trie.matches['cn'].matches['com'].matches['amazonaws'].end)  # False
print(trie.matches['cn'].matches['com'].matches['amazonaws'].matches['cn-north-1'].end)  # False
print(trie.matches['cn'].matches['com'].matches['amazonaws'].matches['cn-north-1'].matches['s3'].end)  # True

The relevant test cases can be found at

assert tldextract.extract(

@leeprevost
Copy link
Author

OK @elliotwutingfeng that makes sense. I was mistakenly trying to manage the intermediate leafs in the Trie with:

     for n, label in enumerate(labels):
            node = node.setdefault(label, Trie())
            last = len(labels) == n
            if not last:  # handle case of a growing tree
                if node.end:
                    for key in trailer.keys():
                        _ = node.pop(key)

        # only do this if no other keys in the leaf
        if node == {}:
            node.update(trailer) 

The previous thinking was that end and is_private was only added to the end of leafs of the trie. I'll now rethink this with my new understanding and just replicate building the leafs into the pure Python dict.

@john-kurkowski I confirmed your example is_private but what was throwing me was finding end in nodes along the way:

trie.matches['com'].end
Out[7]: True

@leeprevost
Copy link
Author

OK guys - I have a new branch that has tested completely. Have not created a PR just yet. Here is that new branch:
https://github.com/leeprevost/tldextract

A few notes:

  1. New Trie that is now a pure Python dictionary.
  2. All tests performed including new serialization tests. The added use case for you is that both the extractor and the Trie can be serialized using standard methods (json.dumps/loads, and pickle.loads/dumps). The benefit here is that tldextract could be used in spark cluster type applications (really big data such as CommonCrawl.org). In those use cases, the Trie could be broadcast to all nodes for lookup and the extractor could be packages as a user defined function and used on remote nodes.
  3. the odd thing is my first pass resulted in slower benchmark results. 60ms for 10 loops on 15K urls vs. 39ms.
  4. I refactored the extractor search function (suffix_index) to use keys rather than dot assessors and got the performance back to roughly the same as the baseline.
  5. Was surprised a pure dict is not faster than a custom class based on some things I had read. But, I do have some thoughts for you about further optimizations including replacing 'matches' with the label keys, and deprecating 'end'. the first one of those seems like it could be exponential in speed/size/performance.
  6. John - I had to go with subclassing a dict rather than UserDict as we originally discussed. that may limit backward compatibility as subclassing dict is a new(er) thing in Python than is UserDict. I think this problem is solvable if you think it important.
    Best way to follow the work is probably through my changes.md file here.

I'll let you absorb this and point me on how/if you want me to finish this.

@leeprevost
Copy link
Author

I also have a question about usage on tldextract related to extracting either fqdn or registered_domain from a list of potentially problematic google sites urls where tldextract either has problems with suffix or domain.

eg:

probs with domain:
org/harmonyelementaryschool.sites/a/aos94.google.com
org/cayucos-elementary-school-district.sites/a/cayucosschool.google.com
com/educational-solutions.sites/a/edsolns.google.com
us/hhs.mt.k12.sites/a/harrison.google.com

probs with suffix:
sites/site/arkomak12.google.com
sites/site/lagunitaschooldistrict.google.com
sites/site/mattoleunified.google.com
sites/site/mcclellanccschool/home.google.com

A third group relates to super domains (like google sites) for hosting websites where registered domain does not really relate to the entity (pay level domain).

mbtigers.weebly.com
ractc.weebly.com
shermancountyschooldistrict.weebly.com
strangeschool.weebly.com
triplains.weebly.com
usd209.weebly.com

These could be just data entry problems. But am wondering if you see a pattern I can add to extra_suffixes that would process these?

@john-kurkowski
Copy link
Owner

Thanks for the detailed breakdown on your branch and the chance to absorb. I will absorb and get back to you.

As for your usage questions, in the probs with domain and probs with suffix cases, it's hard to say what problems you're seeing. What is an example call of yours to tldextract? What is your expected and actual output?

For the third group, you're right, extra_suffixes would let you treat https://weebly.com the same way the PSL treats e.g. https://blogspot.com.

@john-kurkowski
Copy link
Owner

@leeprevost again thanks for your detailed breakdown in your branch. I see a path to converting your branch to a PR. However, I saw your test case and want to back up a bit.

In your test case, and what I missed in your original post, I see you're trying to JSON serialize a private property, extractor._get_tld_extractor().tlds_incl_private_trie. I don't necessarily want to make any public guarantees about a private property.

Maybe there's a public interface change to discuss instead?

Maybe I'm missing something about your use case. I figure most libraries in Python are not JSON serializable without the caller manually picking what to serialize (there's no e.g. __json__ standard method one can implement). In your distributed computing environment (Spark?), what is the expectation for job authors to serialize library data? Is this not an issue for other libraries?

Is pickling still an option? You say you're encountering errors with this library and pickle. Yet the following tests pass for me in 5.1.3.

import pickle

import tldextract


def test_pickle_lazy_load_before() -> None:
    """Test that the library is pickleable, before lazy loading."""
    extract = tldextract.TLDExtract()
    pickle.loads(pickle.dumps(extract))


def test_pickle_lazy_load_after() -> None:
    """Test that the library is pickleable, after lazy loading."""
    extract = tldextract.TLDExtract()
    extract._get_tld_extractor()
    pickle.loads(pickle.dumps(extract))

@leeprevost
Copy link
Author

In your test case, and what I missed in your original post, I see you're trying to JSON serialize a private property, extractor._get_tld_extractor().tlds_incl_private_trie. I don't necessarily want to make any public guarantees about a private property.

Maybe there's a public interface change to discuss instead?

We could remove that and agree, probably not good to guarantee. I added that only for my use case as it was why I originally realized the serialization issue. I was trying to serialize your trie to broadcast to nodes and because it wasn't a pure dict, the serializer errored out. So, this was an added test to test whether the trie can be serialized (using json.dumps/loads).

@leeprevost
Copy link
Author

@leeprevost again thanks for your detailed breakdown in your branch.

An update this morning. It was bothering me that we didn't get speed gains. So, I tried to implement item 5 above (collapsing matches into itself by creating an attrib key %attrib% so as to not conflict with actual tld related keys).

The Trie is substantially smaller with the more streamlined structure. But, no speed gain. So, I rolled back.

@leeprevost
Copy link
Author

Is pickling still an option? You say you're encountering errors with this library and pickle. Yet the following tests pass for me in 5.1.3.

Odd. I had convinced myself that the library wouldn't pickle (see original post in this issue). I'll go back and try to recreate this. I agree with your logic about your JSON related points. And again, I arrived at your door by trying to broadcast your Trie (a mechanism spark uses to speed up large joins where small tables or small python objects are serialized and sent to remote nodes for further processing).

@leeprevost
Copy link
Author

Now wondering if I tried to pickle the private extractor? If your test above works (pre and post lazy) on current rev, am now wondering if a pure python dict Trie really adds anything for you at all? my issue about serializing the Trie for broadcast in Spark is a separate issue unique to my use case. But, if your test works above, I agree that your extactor looks is pickleable and should work as a udf in spark.

@john-kurkowski
Copy link
Owner

_pickle.PicklingError: Can't pickle <function ext at 0x000002305F541480>: attribute lookup extract on __main__ failed

Googling your pickle error, this SO may be a guiding set of links.

@leeprevost
Copy link
Author

Googling your pickle error, this SO may be a guiding set of links.

This sentence in particular applies to the current release Trie:

So, instances of classes that don't have a module name, but rather live as attribute in other classes, or data inside lists and dictionaries, typically will not work.

The current release Trie is essentially an instance of a Class inside as data in a dictionary.

So, it would seem a pure dict trie would resolve that problem. But from your POV, that's got to be a very small user community limited to Big Data analytics. It would be hard for me to make the case for you to take a PR just for that, especially without other gains.

No difference to me. Y'all just let me know and I'll finish it or we can save it if it comes up later?

But, I could have sworn I produced the pickle error outside of spark in standard python. But now I am thinking I could have tried to pickle the lazy loaded version of itself. I didn't realize how the load works when I flagged it earlier.

@john-kurkowski
Copy link
Owner

So, instances of classes that don't have a module name, but rather live as attribute in other classes, or data inside lists and dictionaries, typically will not work.

The current release Trie is essentially an instance of a Class inside as data in a dictionary.

I interpreted the quote as, Python pickle has a problem with instances of classes that don't have a module name. This library isn't using any e.g. dynamic classes, so all classes have modules. Per my test case above, all attributes seem to pickle, whether a leaf Trie instance or starting at a TLDExtract instance all the way down to the leaves. I'm interested if you can reproduce your pickle error.

So, it would seem a pure dict trie would resolve that problem. But from your POV, that's got to be a very small user community limited to Big Data analytics. It would be hard for me to make the case for you to take a PR just for that, especially without other gains.

No difference to me. Y'all just let me know and I'll finish it or we can save it if it comes up later?

Right, given the private attribute use case and no other gains, I'm leaning toward declining this issue. Sorry for your hard work not to pan out. Sometimes you don't know until you try. 😕

I'm still open to public interface changes that make this library easier to broadcast in a distributed setting.

@leeprevost
Copy link
Author

Ok, I'll try to circle back to reproduce that error.

No problem on not pursuing it. I would do the same if I were you.

As to a public interface, I'll give some thought to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants