Ignore invalid jsonld elements on the page source.#189
Ignore invalid jsonld elements on the page source.#189naveen17797 wants to merge 7 commits intoscrapinghub:masterfrom
Conversation
lopuhin
left a comment
There was a problem hiding this comment.
Thanks @naveen17797 , the feature makes sense, I left some minor comments regarding the code, however I also have a more general question - previously, if we had invalid JSON, we'd crash or log it (depending on the error setting). Now, it would be silently ignored. What do you think about still having some logging if our last attempt at parsing the JSON fails, similar to this?
Line 111 in 547c8a0
extruct/jsonld.py
Outdated
| try: | ||
| return json.loads(script, strict=False) | ||
| except Exception: | ||
| return False |
There was a problem hiding this comment.
minor, but to me it would make more sense to return None here, what do you think? Both None and False can be valid results of JSON parsing, but None looks like a more "neutral" value to use for error case.
| # TODO: `strict=False` can be configurable if needed | ||
| data = json.loads(script, strict=False) | ||
| except ValueError: | ||
| # sometimes JSON-decoding errors are due to leading HTML or JavaScript comments |
There was a problem hiding this comment.
I'd rather not remove this comment, I think it's useful
There was a problem hiding this comment.
Added the comment back to the code.
extruct/jsonld.py
Outdated
| data = self._may_be_get_json(script) | ||
| # check if valid json. | ||
| if not data: | ||
| script = jstyleson.dispose( HTML_OR_JS_COMMENTLINE.sub('', script)) |
There was a problem hiding this comment.
Nitpick - let's remove a an extra space here
| script = jstyleson.dispose( HTML_OR_JS_COMMENTLINE.sub('', script)) | |
| script = jstyleson.dispose(HTML_OR_JS_COMMENTLINE.sub('', script)) |
extruct/jsonld.py
Outdated
| data = self._may_be_get_json(script) | ||
| # After processing check if json is still valid. | ||
| if not data: | ||
| return False |
There was a problem hiding this comment.
Nitpick - we don't use return value, as this is a generator, so return would make more sense
| return False | |
| return |
|
|
||
| <head> | ||
| <script type="application/ld+json"> | ||
| { foo: bar} |
There was a problem hiding this comment.
Thanks for adding the test! From what I understand, it would fail previously, right?
There was a problem hiding this comment.
yes, it should ignore this jsonld element since it is not valid.
i have added the log statement
|
|
i understand if we merge this PR this would remove the ability to stop parsing the page with invalid jsonld elements, previously extruct will raise an exception and fail for jsonld, i am not sure how could i retain this behaviour. may be i can pass errors argument from extract() function to jsonld extractor and determine if we need to raise an exception or just return all valid elements @lopuhin ? |
|
Is there a timeline on this issue. It would be good to get a fix for this issue. |
This PR alters the behaviour such that If there are invalid jsonld elements with valid elements on the page source, it returns only the valid jsonld elements.