-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Microdata parser not understanding itemscope properly #2711
Fix Microdata parser not understanding itemscope properly #2711
Conversation
Closes nextcloud#1617 Signed-off-by: Krzysztof Haładyn <[email protected]>
60718ca
to
a937535
Compare
Test Results 12 files 588 suites 1m 54s ⏱️ Results for commit a8aa43d. ♻️ This comment has been updated with latest results. |
Signed-off-by: Christian Wolf <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the HTML pages from the linked issue as test cases and also the test case from you. I tried to extract useful output information mainly based on the current implementation (so, this might be wrong, please check the added JSON files).
All in all, your tests made quite some tests fail. I do not see any obvious errors in the test data. Can you please elaborate what is the problem?
Adding a changelog will at least make this test succeed easily.
@@ -208,7 +208,7 @@ private function searchSimpleProperties(DOMNode $recipeNode, string $property): | |||
* @return DOMNodeList A list of all found child nodes with the given property | |||
*/ | |||
private function searchChildEntries(DOMNode $recipeNode, string $prop): DOMNodeList { | |||
return $this->xpath->query(".//*[@itemprop='$prop']", $recipeNode); | |||
return $this->xpath->query(".//*[@itemprop='$prop' and count(ancestor::*[@itemscope]) = 1]", $recipeNode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure about this XPath stuff right now. Especially the hard-coded 1
makes me nervous.
Also one test case, that I can think of (and that might cause problems) but I have not found a concrete example of: Let the top-most element be a non-recipe. Like a WebPage. The WebPage is about a Recipe. That way, the Recipe is no longer the top-most element. How will ancestor::*[@itemscope]
then behave? Will it detect the recipe's props or the Webpages?
Maybe you could even create a test case (similar to the PR introduction) and add similarly to the test suite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the failing test cases fail because... they are not actually valid Microdata. If you run them through https://validator.schema.org/ you will see that it can't see any elements either, same goes for https://foolip.org/microdatajs/live/. The spec says:
The
itemtype
attribute can only be specified on elements which have anitemscope
attribute specified.
https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/itemtype
But these files specify an itemtype
without an itemscope
. I added this requirement to make sure the lookup for the parent itemscope
works correctly. We can try to relax this requirement, but if even the official validator can't see them, then I don't expect this happens often in the real world. How would you like me to proceed here, add the missing itemscope
to the test cases, or make the parser more lenient?
This is the case for: caseB.html
, caseC.html
, caseD.html
, caseE.html
, caseImageAttribute.html
, caseImageContent.html
, caseIngredient.html
and caseInstruction.html
.
The only actually failing test case is caseFix1209.html
, and it seems to be exactly the issue with the parent WebPage
itemscope you mentioned. I'm not really good with XPath so I wrongly assumed the ancestor
query would be limited to the subtree of the node specified through $recipeNode
, but this it not the case. Sadly I can't find a way around this without using XPath 2.0, which is not supported by DOMXPath
. The easiest way I can see to fix this is to create a child DOMDocument
containing only the recipe using DOMDocument::importNode
, create a separate DOMXPath
instance for it, and keep the current query that checks if we are not deeper than 1 itemscope from the root. Would you like me to do that, or perhaps we should replace the XPath queries with manual iteration over the nodes in PHP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think, I fixed the cases you mentioned (B through E, Image*, as well as Ingredient and Instruction). I think I fixed these in the tests.
Regarding the test Fix1209, this was added for #1209 and its fix in #1220. I am a bit more concerned as this might break the existing code somewhat. For the given test case, this is maybe even a good break (as it actually separates the instructions at least partially). However, it drops the first instruction without notice.
I marked it as incomplete for the meantime. If you see any good way to fix this or to work around it, feel free to give me a hint (or better open another PR 😉 )
Regarding the new parser routines, please do not put too much effort into it. There is a plan to rework the complete parser routines (also for the JSON+LD and other parsers). So, this would be wasted time. You are welcome to help with the rework if you want to. The point is to put the parsers on a common base.
Signed-off-by: Christian Wolf <[email protected]>
Signed-off-by: Christian Wolf <[email protected]>
Signed-off-by: Christian Wolf <[email protected]>
Signed-off-by: Christian Wolf <[email protected]>
Signed-off-by: Christian Wolf <[email protected]>
4173391
to
a8aa43d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The missing changes are longer term and this should only be a quick fix PR. So, let's do smaller steps.
Topic and Scope
This makes the simple Microdata parser properly handle nested objects. Here is a minimal repro example of the bug it fixes:
Previously, the import would crash because both "name" props would be interpreted as the name of the recipe. Now, itemprops nested inside child itemscopes are ignored.
Closes #1617
Concerns/issues
This PR should be extended with unit tests, but I don't have the entire dev environment set up to do it right now. I mostly just wanted to bodge a fix for personal use because the bug bothered me. I only tested that the three URLs mentioned in #1617 work fine now.
Formal requirements
There are some formal requirements that should be satisfied. Please mark those by checking the corresponding box.