Skip to content

Handle badly formatted JSON-LD data. #87

@shiquanwang

Description

@shiquanwang

Some web pages contain badly formatted JSON-LD data, e.g., an example

The JSON-LD in this page is:


{
  "@context": "http://schema.org",
        "@type": "Product",
                "name": "Black 'Clint' FT0511 cat eye sunglasses",
                "image": "https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001",
		"brand": {
                  "@type": "Thing",
                  "name": "Tom Ford"
                },
                "offers": {
                	"@type": "Offer",
                	"priceCurrency": "GBP",
                	"price": "285.00",
                	"itemCondition": "http://schema.org/NewCondition",
                	"availability": "http://schema.org/InStock"
                }
    }
}

In the JSON-LD above, the last } is extra. And extruct or json.loads won't handle it properly.

The json.loads in Python after 3.5 will give detailed error information as JSONDecodeError: Extra data: line 19 column 1 (char 624)

In [7]: try:
   ...:     data = json.loads(json_ld_string)
   ...: except json.JSONDecodeError as err:
   ...:     print(err)
   ...:     print(err.msg)
   ...:     print(err.pos)
   ...:
Extra data: line 19 column 1 (char 624)
Extra data
624

The error.msg and error.pos can give some hint to fix the JSON-LD data, e.g., this one we can remove the character at position 624 and parse the data string again to correctly get:

{'@context': 'http://schema.org',
 '@type': 'Product',
 'brand': {'@type': 'Thing', 'name': 'Tom Ford'},
 'image': 'https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001',
 'name': "Black 'Clint' FT0511 cat eye sunglasses",
 'offers': {'@type': 'Offer',
            'availability': 'http://schema.org/InStock',
            'itemCondition': 'http://schema.org/NewCondition',
            'price': '285.00',
            'priceCurrency': 'GBP'}}

There're many possible format errors and some can be fixed easily some might be harder or even impossible.

I propose 3 ways to improve the situation:

  • extruct try various ways to fix the json-ld data case by case, but need to adapt to Python >= 3.5 to allow to get detailed error info
  • extruct allow the user to pass in a function to parse JSON data, and let the user to handle his own possible error types
  • extruct can output the extracted JSON-LD string not parsed data and let the user to parse and handle his own possible error types

I personally recommend the latter 2 ways.

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions