-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
Description
PDEP-7: https://pandas.pydata.org/pdeps/0007-copy-on-write.html
An initial implementation was merged in #46958 (with the proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit / discussed in #36195).
In #36195 (comment) I mentioned some next steps that are still needed; moving this to a new issue.
Implementation
Complete the API surface:
- Use the lazy copy (with CoW) mechanism in more methods where it can be used. Overview issue at ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473
- What to do with the existing
copy
keyword? - Implement the new behaviour for constructors (eg constructing a DataFrame/Series from an existing DataFrame/Series should follow the same rules as indexing -> behaves as a copy through CoW). Although, what about creating from a numpy array?
- API / CoW: constructing Series from Series creates lazy copy (with default copy=False) #49524
- API / CoW: constructing DataFrame from DataFrame/BlockManager creates lazy copy #51239
- API / CoW: DataFrame(<dict of Series>, copy=False) constructor now gives lazy copy #50777
- API / CoW: copy/view behaviour when constructing DataFrame/Series from a numpy array #50776
- Constructing DataFrame/Series from Index object (also keep track of references): API / CoW: Respect CoW for DataFrame(Index) #52276 and CoW: Avoid copying Index in Series constructor #52008 and CoW: Add lazy copy mechanism to DataFrame constructor for dict of Index #52947
- Explore / update the APIs that return numpy arrays (
.values
,to_numpy()
). Potential idea is to make the returned array read-only by default.- API / CoW: return read-only numpy arrays in .values/to_numpy() #51082
- CoW: Return read-only array in Index.values #53704
- We need to do the same for EAs? (now only for numpy arrays) -> CoW: add readonly flag to ExtensionArrays, return read-only EA/ndarray in .array/EA.to_numpy() #61925
- Warning / error for chained setitem that no longer works -> API / CoW: detect and raise error for chained assignment under Copy-on-Write #49467
- BUG: ChainedAssignmentError for CoW not working when setitem is called from cython #51315
- BUG: ChainedAssignmentError for CoW not working for chained inplace methods when passing
*args
or**kwargs
#56456 - Add the same warning/error for inplace methods that are called in a chained context (eg
df['a'].fillna(.., inplace=True)
Improve the performance
- Optimize setitem operations to prevent copies of whole blocks (eg splitting the block could help keeping a view for all other columns, and we only take a copy for the columns that are modified) where splitting the block could keep a view for all other columns, and
- Check overall performance impact (eg run asv with / without CoW enabled by default and see the difference)
Provide upgrade path:
- Add a warning mode that gives deprecation warnings for all cases where the current behaviour would change (initially also behind an option): CoW warning mode for cases that will change behaviour #56019
- We can also update the message of the existing SettingWithCopyWarnings to point users towards enabling CoW as a way to get rid of the warnings
- Add a general FutureWarning "on first use that would change" that is only raised a single time
Documentation / feedback
Aside from finalizing the implementation, we also need to start documenting this, and it will be super useful to have people give this a try, run their code or test suites with it, etc, so we can iron out bugs / missing warnings / or discover unexpected consequences that need to be addressed/discussed.
- Document this new feature (how it works, how you can test it)
- We can still add a note to the 1.5 whatsnew linking to those docs
- Write a set of blogposts on the topic
- Gather feedback from users / downstream packages
- Update existing documentation:
- Write an upgrade guide
Some remaining aspects of the API to figure out:
- What to do with the
Series.view()
method -> is deprecated - Let
head()
/tail()
return eager copies? (to avoid using those methods for exploration trigger CoW) -> API/CoW: Return copies for head and tail #54011