-
Notifications
You must be signed in to change notification settings - Fork 119
Open
gaurav19063/extruct
#1Labels
Description
Some web pages contain badly formatted JSON-LD data, e.g., an example
The JSON-LD in this page is:
{
"@context": "http://schema.org",
"@type": "Product",
"name": "Black 'Clint' FT0511 cat eye sunglasses",
"image": "https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001",
"brand": {
"@type": "Thing",
"name": "Tom Ford"
},
"offers": {
"@type": "Offer",
"priceCurrency": "GBP",
"price": "285.00",
"itemCondition": "http://schema.org/NewCondition",
"availability": "http://schema.org/InStock"
}
}
}
In the JSON-LD above, the last }
is extra. And extruct
or json.loads
won't handle it properly.
The json.loads
in Python after 3.5 will give detailed error information as JSONDecodeError: Extra data: line 19 column 1 (char 624)
In [7]: try:
...: data = json.loads(json_ld_string)
...: except json.JSONDecodeError as err:
...: print(err)
...: print(err.msg)
...: print(err.pos)
...:
Extra data: line 19 column 1 (char 624)
Extra data
624
The error.msg
and error.pos
can give some hint to fix the JSON-LD data, e.g., this one we can remove the character at position 624 and parse the data string again to correctly get:
{'@context': 'http://schema.org',
'@type': 'Product',
'brand': {'@type': 'Thing', 'name': 'Tom Ford'},
'image': 'https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001',
'name': "Black 'Clint' FT0511 cat eye sunglasses",
'offers': {'@type': 'Offer',
'availability': 'http://schema.org/InStock',
'itemCondition': 'http://schema.org/NewCondition',
'price': '285.00',
'priceCurrency': 'GBP'}}
There're many possible format errors and some can be fixed easily some might be harder or even impossible.
I propose 3 ways to improve the situation:
extruct
try various ways to fix the json-ld data case by case, but need to adapt to Python >= 3.5 to allow to get detailed error infoextruct
allow the user to pass in a function to parse JSON data, and let the user to handle his own possible error typesextruct
can output the extracted JSON-LD string not parsed data and let the user to parse and handle his own possible error types
I personally recommend the latter 2 ways.
Thanks.