-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow valid XML characters outside of BMP when storing property #163
Conversation
stackoverflow-driven development :-D thanks for looking into this one! can you please replace the test Jackalope\Transport\Jackrabbit\ClientTest::testOutOfRangeCharacterOccurrence with something that validates that we can actually successfully store utf-8 to jackrabbit and read it again? does utf-8 work for both content and node/property names? |
The tests are failing and it seems this regex was changed in #65. Did test some jackrabbit versions and at current state I have the following state not sure how we should handle this over an option? Or is there a version constraint/detection currently so we can build to logic on top of that?:
|
There exists a version parameter can we maybe use that and use based on that another regular expression?
@dbu do you know if If its always set we could maybe go with something like: $regex = '/[^\x{9}\x{a}\x{d}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}]+/u';
if (version_compare($this->descriptors['jcr.repository.version'], '2.18.0', '>=')) {
// add support for storage of emoticons
$regex = '/[^\x{9}\x{a}\x{d}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/u';
} |
we could say that if the version descriptor exists and the version is high enough, we use the new regex? |
Just tested this. Looks like |
@nnatter That is really sad, I think we should not call getRepositoryDescriptors, as triggering a additional Request I think is bad and should be avoided. So as we have no version I see the only way by adding a additonal config parameter. Something like @dbu What do you think? Do you see another solution then a parameter? |
c3dea89
to
8143fde
Compare
Added a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a clean solution, great job!
and i agree that we should prefer configuring the version. i added some nitpicks.
i was wondering if we should initialize the version from the descriptors if it is not set, if we happen to fetch descriptors anyways. but that would lead to super weird behaviour when something works or does not work depending on whether one did something else before that...
what we should do however is compare the configured version to the actual version around line
if (! version_compare(self::VERSION, $this->descriptors['jcr.repository.version'], '<=')) { |
@@ -1328,6 +1333,11 @@ protected function isStringValid($string) | |||
{ | |||
$regex = '/[^\x{9}\x{a}\x{d}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}]+/u'; | |||
|
|||
if ($this->version && version_compare($this->version, '2.18.0', '>=')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please extract this version into a constant declared next to the VERSION constant. lets call it UTF8_SUPPORT_MINIMAL_VERSION or something like that.
ah, and please add a note to the changelog. do we need a PR on the phpcr-odm bundle as well to allow configuring this, or is that already generic enough that it just works? if it "just works" we should document the new parameter at https://symfony.com/doc/current/cmf/bundles/phpcr_odm/configuration.html#phpcr-session-with-jackalope-jackrabbit |
Co-authored-by: David Buchmann <[email protected]>
@dbu Thanks for your review! I implemented the suggestions and created a PR for the documentation here: symfony-cmf/symfony-cmf-docs#866 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did some cleanup and merge this in #164
thanks a lot for your effort, this is good now!
This PR adjusts the validation of valid XML characters in the
Client
to allow UTF8 characters that are not in the BMP. At the moment, treating these characters as invalid makes it impossible to store emojis such as 🤓 or 🤯 (see #85 (comment)).Per specification, the characters in the range
U+10000–U+10FFFF
are valid inside of XML documents: https://en.wikipedia.org/wiki/Valid_characters_in_XML#XML_1.0It looks like the stackoverflow answer that was used as a reference when implementing the RegEx does not include this range, but the range is included in multiple other answers: