Skip to content

Pickling a customized Constants silently discards all customizations and corrupts prefixes #167

Description

@derek73

Summary

A Constants instance does not survive a pickle round-trip. Any customizations (added/removed titles, prefixes, suffixes, etc.) are silently lost, every collection reverts to its module default, and prefixes is left holding garbage — the names of the config attributes rather than actual prefixes. No exception is raised, so callers get a silently-wrong parser.

This is pre-existing on master (unrelated to the _pst cache work in #166) but was noticed while reviewing that PR.

Reproduction

import pickle
from nameparser.config import Constants

c = Constants()
c.titles.add('customtitle')
c.prefixes.add('customprefix')

r = pickle.loads(pickle.dumps(c))

print('customtitle'  in r.titles)    # False  (expected True)
print('customprefix' in r.prefixes)  # False  (expected True)
print(sorted(r.prefixes)[:5])
# ['capitalization_exceptions', 'capitalize_name', 'conjunctions',
#  'empty_attribute_default', 'first_name_titles']   <-- attribute names, not prefixes
print(len(r.prefixes))               # 15  (the number of public attrs, not the 36 default prefixes)

Root cause

nameparser/config/__init__.py:

def __setstate__(self, state: Mapping[str, Any]) -> None:
    Constants.__init__(self, state)

def __getstate__(self) -> Mapping[str, Any]:
    attrs = [x for x in dir(self) if not x.startswith('_')]
    return dict([(a, getattr(self, a)) for a in attrs])

__getstate__ returns a dict of every public attribute. __setstate__ then passes that whole dict as the first positional argument to __init__, i.e. as prefixes. So:

  • prefixes becomes SetManager(state_dict) — iterating a dict yields its keys, which is why prefixes ends up containing attribute names like conjunctions, capitalize_name, …
  • every other __init__ parameter (titles, suffix_acronyms, conjunctions, …) falls back to its module default, discarding the pickled values entirely.

Impact

  • Any workflow that pickles a customized Constants (e.g. caching, passing config across process boundaries / multiprocessing, some serialization layers) gets a silently corrupted parser rather than the configured one.
  • The failure is silent — no error, just wrong parsing results — which makes it easy to miss.

Suggested fix

__setstate__ should restore the saved attributes onto the instance rather than re-running __init__ with the dict as a positional arg. Note that __getstate__ mixes two kinds of attributes that need different handling:

  • __init__ collection parameters: prefixes, suffix_acronyms, suffix_not_acronyms, titles, first_name_titles, conjunctions, capitalization_exceptions, regexes
  • plain class-level scalars that are not __init__ parameters: string_format, initials_format, initials_delimiter, empty_attribute_default, capitalize_name, force_mixed_case_capitalization

So a simple Constants.__init__(self, **state) would raise TypeError on the scalar keys. Restoring via setattr(self, key, value) for each item in state is the more robust direction (and naturally re-wires the _pst cache invalidation for the four cached set attributes, since assignment goes through their descriptor). A round-trip test asserting customizations survive should accompany the fix.

A couple of related things worth deciding while in there:

  • copy.deepcopy(Constants()) currently raises TypeError: 're.Pattern' object is not callable via RegexTupleManager.__reduce__; the test suite's conftest.py already works around this by rebuilding managers instead of deep-copying. Worth confirming whether deepcopy should be supported too.
  • Consider whether Constants needs to be picklable at all, or whether documenting it as non-picklable is acceptable — but if it stays picklable, the current silent corruption should be fixed.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions