ENH: Skip costly is_unique
call while creating Categorical
arrays
#60981
Labels
is_unique
call while creating Categorical
arrays
#60981
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
I often create
Categorical
data structures. In certain circumstances the number of unique categories can be quite large -- the overall length of theCategorical
can be very long indeed (hundreds of millions of records). I always create these arrays using theCategorical.from_codes
path for performance (my codes are stored in anumpy
array). Even still... I would like to bypass an expensiveis_unique
call that is made during the creation of the categories.My simple (and somewhat contrived) example:
shows with
cProfile
:Checking that the categories are unique take a large chunk of time. I've tried to bypass the public API in order to avoid this
is_unique
call, but keep on running into trouble. And... generally... I would like to stick to public features only. I know with certainty that my categories are unique.Feature Description
There could be a couple solutions here:
Perhaps someone knows how to create a
Categorical
array very fast assuming that I have pristine data (no Nans, or bad codes, plus guaranteed unique categories)? I'd welcome a solution with current methods!If no solution is currently available, perhaps a new
is_unique
argument could be introduced to theCategorical.from_codes
classmethod
(with a safe default ofFalse
)? The user could turn this on at their own peril. This doesn't seem to be without precedence:I'm willing risk segfaults for speed.
Many hats off to the pandas team/community. I appreciate your hard work!
Alternative Solutions
not aware of any other package that would satisfy the goal here
Additional Context
No response
The text was updated successfully, but these errors were encountered: