Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Microdata parser not understanding itemscope properly #2711

Merged
merged 7 commits into from
Mar 29, 2025

Conversation

krzys-h
Copy link
Contributor

@krzys-h krzys-h commented Mar 24, 2025

Topic and Scope

This makes the simple Microdata parser properly handle nested objects. Here is a minimal repro example of the bug it fixes:

<div itemscope itemtype="https://schema.org/Recipe">
  <span itemprop="name">Sample recipe name</span>
  <div itemprop="author" itemscope itemtype="https://schema.org/Person">
    <span itemprop="name">Sample person</span>
  </div>
</div>

Previously, the import would crash because both "name" props would be interpreted as the name of the recipe. Now, itemprops nested inside child itemscopes are ignored.

Closes #1617

Concerns/issues

This PR should be extended with unit tests, but I don't have the entire dev environment set up to do it right now. I mostly just wanted to bodge a fix for personal use because the bug bothered me. I only tested that the three URLs mentioned in #1617 work fine now.

Formal requirements

There are some formal requirements that should be satisfied. Please mark those by checking the corresponding box.

  • I did check that the app can still be opened and does not throw any browser logs
  • I created tests for newly added PHP code (check this if no PHP changes were made)
  • I updated the OpenAPI specs and added an entry to the API changelog (check if API was not modified)
  • I notified the matrix channel if I introduced an API change

@krzys-h krzys-h force-pushed the fix-microdata-itemscope branch from 60718ca to a937535 Compare March 24, 2025 18:31
Copy link

github-actions bot commented Mar 27, 2025

Test Results

   12 files    588 suites   1m 54s ⏱️
  581 tests   580 ✅ 1 💤 0 ❌
2 324 runs  2 319 ✅ 5 💤 0 ❌

Results for commit a8aa43d.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@christianlupus christianlupus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the HTML pages from the linked issue as test cases and also the test case from you. I tried to extract useful output information mainly based on the current implementation (so, this might be wrong, please check the added JSON files).

All in all, your tests made quite some tests fail. I do not see any obvious errors in the test data. Can you please elaborate what is the problem?

Adding a changelog will at least make this test succeed easily.

@@ -208,7 +208,7 @@ private function searchSimpleProperties(DOMNode $recipeNode, string $property):
* @return DOMNodeList A list of all found child nodes with the given property
*/
private function searchChildEntries(DOMNode $recipeNode, string $prop): DOMNodeList {
return $this->xpath->query(".//*[@itemprop='$prop']", $recipeNode);
return $this->xpath->query(".//*[@itemprop='$prop' and count(ancestor::*[@itemscope]) = 1]", $recipeNode);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure about this XPath stuff right now. Especially the hard-coded 1 makes me nervous.

Also one test case, that I can think of (and that might cause problems) but I have not found a concrete example of: Let the top-most element be a non-recipe. Like a WebPage. The WebPage is about a Recipe. That way, the Recipe is no longer the top-most element. How will ancestor::*[@itemscope] then behave? Will it detect the recipe's props or the Webpages?

Maybe you could even create a test case (similar to the PR introduction) and add similarly to the test suite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the failing test cases fail because... they are not actually valid Microdata. If you run them through https://validator.schema.org/ you will see that it can't see any elements either, same goes for https://foolip.org/microdatajs/live/. The spec says:

The itemtype attribute can only be specified on elements which have an itemscope attribute specified.
https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/itemtype

But these files specify an itemtype without an itemscope. I added this requirement to make sure the lookup for the parent itemscope works correctly. We can try to relax this requirement, but if even the official validator can't see them, then I don't expect this happens often in the real world. How would you like me to proceed here, add the missing itemscope to the test cases, or make the parser more lenient?

This is the case for: caseB.html, caseC.html, caseD.html, caseE.html, caseImageAttribute.html, caseImageContent.html, caseIngredient.html and caseInstruction.html.


The only actually failing test case is caseFix1209.html, and it seems to be exactly the issue with the parent WebPage itemscope you mentioned. I'm not really good with XPath so I wrongly assumed the ancestor query would be limited to the subtree of the node specified through $recipeNode, but this it not the case. Sadly I can't find a way around this without using XPath 2.0, which is not supported by DOMXPath. The easiest way I can see to fix this is to create a child DOMDocument containing only the recipe using DOMDocument::importNode, create a separate DOMXPath instance for it, and keep the current query that checks if we are not deeper than 1 itemscope from the root. Would you like me to do that, or perhaps we should replace the XPath queries with manual iteration over the nodes in PHP?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think, I fixed the cases you mentioned (B through E, Image*, as well as Ingredient and Instruction). I think I fixed these in the tests.

Regarding the test Fix1209, this was added for #1209 and its fix in #1220. I am a bit more concerned as this might break the existing code somewhat. For the given test case, this is maybe even a good break (as it actually separates the instructions at least partially). However, it drops the first instruction without notice.

I marked it as incomplete for the meantime. If you see any good way to fix this or to work around it, feel free to give me a hint (or better open another PR 😉 )


Regarding the new parser routines, please do not put too much effort into it. There is a plan to rework the complete parser routines (also for the JSON+LD and other parsers). So, this would be wasted time. You are welcome to help with the rework if you want to. The point is to put the parsers on a common base.

Signed-off-by: Christian Wolf <[email protected]>
Signed-off-by: Christian Wolf <[email protected]>
Signed-off-by: Christian Wolf <[email protected]>
Signed-off-by: Christian Wolf <[email protected]>
@christianlupus christianlupus force-pushed the fix-microdata-itemscope branch from 4173391 to a8aa43d Compare March 29, 2025 17:45
Copy link
Collaborator

@christianlupus christianlupus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The missing changes are longer term and this should only be a quick fix PR. So, let's do smaller steps.

@christianlupus christianlupus merged commit c736c64 into nextcloud:master Mar 29, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot parse recipe from swrfernsehen.de/kaffee-oder-tee
2 participants