Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Hash Code for Worksheet #4207

Merged
merged 5 commits into from
Nov 13, 2024
Merged

Conversation

oleibman
Copy link
Collaborator

Fix #4192. Although that issue can be dealt with by changing user code, it would be better to fix it within PhpSpreadsheet. A cloned worksheet may have a pointer to a spreadsheet to which it is not attached. Code can assume it does belong to the spreadsheet, and throw an exception when the spreadsheet cannot find the worksheet in question. It may also not throw an exception when it should.

In my comments to the issue, I was concerned that adding in the needed protection would add overhead to an extremely common situation (setting a cell's value) in order to avoid a pretty rare problem. However, there are problems with both the accuracy and efficiency of the existing code, and I think any performance losses caused by the additional checks will be offset by the performance gains and accuracy of the new code.

Spreadsheet getIndex attempts to find the index of a worksheet within its spreadsheet collection. It does so by comparing the hash codes of each sheet in its collection with the hash code of the sheet it is looking for. Its major problem problem is performance-related, namely that it recomputes the hash code of the target sheet with each iteration.

A more severe problem is the accuracy of the hash code. It generates this by hashing together the sheet title, the string range of its auto-filter, and a character representation of whether sheet protection is enabled. Title should definitely be part of the calculation (it must be unique for all sheets attached to a spreadsheet), but it is not clear why this subset of the other properties of Worksheet is used. It tries to save some cycles by using a dirty property to indicate whether re-hashing is necessary. It sets that property whenever the title changes, or when setProtection is called. So, it doesn't set it when auto-filter changes, and you can easily bypass setProtection when changing any of the Protection properties. Not to mention the many other properties of worksheet that can be changed. Additionally, if you clone a worksheet, the clone and the original will have the same hash code, which can lead to problems:

$clone = clone $original;
$spreadsheet->getSheet($spreadsheet->getIndex($clone))
    ->setCellValue('A1', 100);

That code will change the value of A1 in the original, not the clone.

The hash property in Worksheet will now be calculated immediately when the object is constructed or cloned or unserialized. It will not be recalculated, and there is no longer a need for the dirty property, which is removed. Hash will be generated by spl_object_id, which was designed for this purpose. (So was spl_object_hash, but many online references suggest that _id performs much better than _hash.) Our problem example above will now throw an Exception, as it should, rather than changing the wrong cell. setValueExplicit, the problem in the original issue, will now test that the worksheet is attached to the spreadsheet before doing any style manipulation. In order that this not be a breaking change, getHashCode will continue to return string, but it is deprecated in favor of getHashInt, and Worksheet will no longer implement IComparable to facilitate the deprecation.

I had a vague hope that this change might help with issue #641. It doesn't.

This is:

  • a bugfix
  • a new feature
  • refactoring
  • additional unit tests

Checklist:

  • Changes are covered by unit tests
    • Changes are covered by existing unit tests
    • New unit tests have been added
  • Code style is respected
  • Commit message explains why the change is made (see https://github.com/erlang/otp/wiki/Writing-good-commit-messages)
  • CHANGELOG.md contains a short summary of the change and a link to the pull request if applicable
  • Documentation is updated as necessary

Why this change is needed?

Provide an explanation of why this change is needed, with links to any Issues (if appropriate).
If this is a bugfix or a new feature, and there are no existing Issues, then please also create an issue that will make it easier to track progress with this PR.

Fix PHPOffice#4192. Although that issue can be dealt with by changing user code, it would be better to fix it within PhpSpreadsheet. A cloned worksheet may have a pointer to a spreadsheet to which it is not attached. Code can assume it does belong to the spreadsheet, and throw an exception when the spreadsheet cannot find the worksheet in question. It may also not throw an exception when it should.

In my comments to the issue, I was concerned that adding in the needed protection would add overhead to an extremely common situation (setting a cell's value) in order to avoid a pretty rare problem. However, there are problems with both the accuracy and efficiency of the existing code, and I think any performance losses caused by the additional checks will be offset by the performance gains and accuracy of the new code.

Spreadsheet `getIndex` attempts to find the index of a worksheet within its spreadsheet collection. It does so by comparing the hash codes of each sheet in its collection with the hash code of the sheet it is looking for. Its major problem problem is performance-related, namely that it recomputes the hash code of the target sheet with each iteration.

A more severe problem is the accuracy of the hash code. It generates this by hashing together the sheet title, the string range of its auto-filter, and a character representation of whether sheet protection is enabled. Title should definitely be part of the calculation (it must be unique for all sheets attached to a spreadsheet), but it is not clear why this subset of the other properties of Worksheet is used. It tries to save some cycles by using a `dirty` property to indicate whether re-hashing is necessary. It sets that property whenever the title changes, or when `setProtection` is called. So, it doesn't set it when auto-filter changes, and you can easily bypass `setProtection` when changing any of the `Protection` properties. Not to mention the many other properties of worksheet that can be changed. Additionally, if you clone a worksheet, the clone and the original will have the same hash code, which can lead to problems:
```php
$clone = clone $original;
$spreadsheet->getSheet($spreadsheet->getIndex($clone))
    ->setCellValue('A1', 100);
```
That code will change the value of A1 in the original, not the clone.

The `hash` property in Worksheet will now be calculated immediately when the object is constructed or cloned or unserialized. It will not be recalculated, and there is no longer a need for the `dirty` property, which is removed. Hash will be generated by spl_object_id, which was designed for this purpose. (So was spl_object_hash, but many online references suggest that \_id performs much better than \_hash.) Our problem example above will now throw an Exception, as it should, rather than changing the wrong cell. `setValueExplicit`, the problem in the original issue, will now test that the worksheet is attached to the spreadsheet before doing any style manipulation. In order that this not be a breaking change, `getHashCode` will continue to return string, but it is deprecated in favor of `getHashInt`, and Worksheet will no longer implement IComparable to facilitate the deprecation.

I had a vague hope that this change might help with issue PHPOffice#641. It doesn't.
See if typehint helps.
@oleibman oleibman added this pull request to the merge queue Nov 13, 2024
Merged via the queue into PHPOffice:master with commit 2cbf08c Nov 13, 2024
13 of 14 checks passed
@oleibman oleibman deleted the sheetindex branch November 14, 2024 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Exception 'PhpOffice\PhpSpreadsheet\Exception' with message 'Sheet does not exist.'
1 participant