Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Skip costly is_unique call while creating Categorical arrays #60981

Open
1 of 3 tasks
boxblox opened this issue Feb 21, 2025 · 1 comment
Open
1 of 3 tasks

ENH: Skip costly is_unique call while creating Categorical arrays #60981

boxblox opened this issue Feb 21, 2025 · 1 comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@boxblox
Copy link

boxblox commented Feb 21, 2025

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I often create Categorical data structures. In certain circumstances the number of unique categories can be quite large -- the overall length of the Categorical can be very long indeed (hundreds of millions of records). I always create these arrays using the Categorical.from_codes path for performance (my codes are stored in a numpy array). Even still... I would like to bypass an expensive is_unique call that is made during the creation of the categories.

My simple (and somewhat contrived) example:

arr = np.array(list(range(10_000_000)) * 10, dtype=np.int32, order="C")
cats = [f"a{i}" for i in range(10_000_000)]
pd.Categorical.from_codes(codes=arr, categories=cats, validate=False)

shows with cProfile:

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.877    1.877    1.877    1.877 base.py:2313(is_unique)
       93    1.539    0.017    1.539    0.017 {built-in method numpy.array}
        1    0.709    0.709    4.574    4.574 extract_test.py:1(<module>)
        1    0.120    0.120    0.120    0.120 missing.py:305(_isna_string_dtype)
        4    0.092    0.023    0.098    0.024 cast.py:1579(construct_1d_object_array_from_listlike)
        4    0.032    0.008    0.131    0.033 construction.py:517(sanitize_array)

Checking that the categories are unique take a large chunk of time. I've tried to bypass the public API in order to avoid this is_unique call, but keep on running into trouble. And... generally... I would like to stick to public features only. I know with certainty that my categories are unique.

Feature Description

There could be a couple solutions here:

  1. Perhaps someone knows how to create a Categorical array very fast assuming that I have pristine data (no Nans, or bad codes, plus guaranteed unique categories)? I'd welcome a solution with current methods!

  2. If no solution is currently available, perhaps a new is_unique argument could be introduced to the Categorical.from_codes classmethod (with a safe default of False)? The user could turn this on at their own peril. This doesn't seem to be without precedence:

validate : bool, default True

If True, validate that the codes are valid for the dtype.

If False, don't validate that the codes are valid. Be careful about skipping validation, as invalid codes can lead to severe problems, such as segfaults.

I'm willing risk segfaults for speed.

Many hats off to the pandas team/community. I appreciate your hard work!

Alternative Solutions

not aware of any other package that would satisfy the goal here

Additional Context

No response

@boxblox boxblox added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 21, 2025
@boxblox
Copy link
Author

boxblox commented Feb 21, 2025

the _simple_new classmethod seems to be the culprit in categorical.py. Specifically, this is_unique call comes in with the update_dtype call.

@classmethod
# error: Argument 2 of "_simple_new" is incompatible with supertype
# "NDArrayBacked"; supertype defines the argument type as
# "Union[dtype[Any], ExtensionDtype]"
def _simple_new(  # type: ignore[override]
    cls, codes: np.ndarray, dtype: CategoricalDtype
) -> Self:
    # NB: This is not _quite_ as simple as the "usual" _simple_new
    codes = coerce_indexer_dtype(codes, dtype.categories)
    dtype = CategoricalDtype(ordered=False).update_dtype(dtype)
    return super()._simple_new(codes, dtype)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

1 participant