-
Notifications
You must be signed in to change notification settings - Fork 135
Additional format support for the Python Libraries, Google Summer of Code 2019
SPDX is an open standard for conveying components, licenses and copyrights information of software in a human-and-machine readable, unambiguous way.
SPDX community has developed some collaterals such as the SPDX specification, programing languages tools, among others.
As part of the programing languages tools, there is a Python tool that allows its users to write and read SPDX documents represented in two formats: RDF/XML and tag/value.
This Google Summer of Code 2019 project consists in extending the format support to include JSON, YAML and XML formats.
The eventual wide range of formats to interchange SPDX documents will make easy and painless their adoption because they would fit more and more development communities habits and guidelines. This will help to spread the standard, leading SPDX to reach its goals.
To achieve the project goals, three main components had to be implemented: parsers, builders and writers. Since this project consisted in extending an already available capability (parsing and creating SPDX documents), every piece of code was intended to fit the established code styles.
Parsers are responsible for taking the information from format-specific files and doing some shallow validations, such as presence or absence of required fields.
After parsers have done their job, the parsed information is passed into the builders, which are responsible for doing deeper validations, such as verifying the correctness of the field formats, the building order, etc; and finally storing all the information in the library models.
In the other direction of the process, writers are responsible for taking SPDX document information from this library models and eventually creating a format-specific file with that information in it.
Parsers were created from scratch and it was needed just one set of them to be able to parse JSON, YAML and XML. To handle the new three formats, specific-format interfaces were created to load the files using the more suitable library for it (json
Python module, PyYAML
or xmltodict
). After the file is load that way, all the information have the same structure regardless it comes from JSON, YAML or XML, so it can be handled the same way.
Writers were created the same way. The information from models is stored so that it has the same representation and then the format-specific library is responsible for creating the file.
Since builders do not differ much across formats, a lot of legacy code was reused by inheritance.
There are several examples that show the new parsing and creating capabilities on action.
Related Pull Request: https://github.com/spdx/tools-python/pull/96
Besides the main project tasks, some additional work was done: adding full license expression support, and legacy bugs fixing. Here it is a summary.
This library lacks the capability to parse complex license expressions. It is just posible to parse expressions with a unique operator (AND
or OR
), but not any combination of them or WITH
exceptions.
This additional work consists in integrating the license-expression library to add full support for license expressions. This work is not finished yet.
Related Pull Request: https://github.com/spdx/tools-python/pull/111
Several legacy bugs and issues were encountered while working on the project and they were fixed, not just to enable the project development, but also to make the spdx-tools even better. The following is a list of them.
The list of extracted licenses contained duplicate objects when they came from RDF documents. Extracted licenses were being added as many times as they were encountered in other license types, such as concluded or declared licenses. Now extracted liceses are parsed and added only once, when the extracted licenses document section is parsed.
Related issue: https://github.com/spdx/tools-python/issues/97
Related Pull Request: https://github.com/spdx/tools-python/pull/98
Due to an explicit creation of a Version
object in a default paramenter method, Version
fields that are supposed to be integers, were being stored as strings, causing problems in type-sensitive formats, such as JSON. Now that Version
object is created with the suitable method to handle string-like integers.
Related issue: https://github.com/spdx/tools-python/issues/102
Related Pull Request: https://github.com/spdx/tools-python/pull/103
When parsing RDF documents, some fields were being stored as rdflib
(library used to parse RDF files) objects. This was causing difficulties when converting from RDF to other formats. Now all information from RDF files is stored as Python types or this library models.
Related issue: https://github.com/spdx/tools-python/issues/91
Related Pull Request: https://github.com/spdx/tools-python/pull/110
RDF parsers do not handle the projectUri
field, which is part of the artifactOf
section and writers do not even write the whole artifactOf
section. This work is not finished yet.
Related issue: https://github.com/spdx/tools-python/issues/104
Related Pull Request: https://github.com/spdx/tools-python/pull/115
Some of the pull requests linked above have been already merged and some others are expected to be merged after the Google Summer of Code 2019 ends.