Skip to content

Software Development Plan, December 2017

Gaurav Vaidya edited this page Jan 12, 2018 · 6 revisions

This is an archived copy of the software development plan Gaurav wrote up in December 2017. The plan itself is now being maintained as a Github project, and the overall list of project aims and goals has been moved to this Wiki, and should be edited there.

This software development plan is intended as to list all the software components we would like to build as part of the Phyloreferencing project. It will provide us with a comprehensive overview of current and future development efforts, organize development into phases, and allow us to precisely scope each component. Given that this is a research project, these plans (and especially the time estimates) should be generally considered extremely approximate and may change dramatically based at what we learn at each stage of development.

Software components

These have been entirely transferred to https://github.com/phyloref/organization/projects/1.

These project goals can be broken down into a series of software components. I summarize them here, and then provide a thorough description of each software component in the following sections. I measure progress on each component through two metrics:

  • Specific aims and goals are high-level requirements taken from the project proposal or added subsequently by project personnel. They are intended to be used as a checklist to show how close we are to meeting the overall expectations of our project.
  • User stories describe individual features that are necessary to make these software components usable and valuable to users, and show how well we understand upcoming requirements. They are intended to be converted to Github issues and form the basis of unit tests.
Component Status Specific aims User stories Complete by
Phyloreferencing test suite In development 0/5 0/1 January 2018
Phyloreferencing Python library In development 3/6 2/5 January 2018
Phyloreferencing Java library Prototyping 2/4 0/3 January 2018
Phylogeny to OWL conversion tool In development 1/4 1/4 January 2018
Phyloreferencing specification Prototyping 4/22 0/12 March 2018
Phyloreference curation tool → January 2018 0/4 0/1 March 2018
Phyloreference interface for Regnum → March 2018 0/0 0/3 July 2018
Phyloreference navigator and query tool → March 2018 0/6 0/14 July 2018

The component statuses listed above are in the following sequence:

  1. Not started: the estimated start date is listed.
  2. Planning: the component is being scoped, either in this document or as a separate document or blog post.
  3. Prototyping: a prototype software tool is being developed that acts as a proof-of-concept and suggests the best way to develop this tool.
  4. In development: the prototype is being expanded to a fully documented, well-designed software program.
  5. Testing: a potential release is ready and is being tested by users inside and outside the project. A comprehensive test suite might be added at this stage.
  6. Complete: all planned development has taken place, the software has been tested and has been used by users outside of the project.

Goals and specific aims from the project proposal

Entirely moved to https://github.com/phyloref/organization/wiki/Project-aims-and-goals

The following goals of the Phyloreferencing project were listed in the project proposal (pg 3):

  1. A specification for encoding phyloreferences and phylogenies in OWL.

    1. We will specify and test templates and a supporting ontology for constructing phyloreferences in OWL, guided by phylogenetic definitions used in the literature.
    2. In parallel, we will develop a model and an automatic transformation tool for representing phylogenies in OWL such that phyloreferences can be resolved by OWL reasoning.
  2. An OWL ontology of vetted phyloreferences.

    1. To ground-proof the specification, we will create a tool to transform the published phylogenetic definitions contained in the RegNum database to an OWL ontology of phyloreferences.
    2. We will also supplement the RegNum content with phylogenetic definitions culled from the angiosperm (flowering plants) systematics literature.
  3. A proof-of-concept application for utility and correctness of phyloreferences.

    1. Using a comprehensive phylogenetic tree for angiosperms and the previously curated phyloreferences, we will create an online application that uses OWL reasoning to allow users to query the tree using phyloreferences, and to find phyloreferences based on chosen nodes of the tree.
  4. A proof-of-concept application for navigating large-scale data resources.

    1. We will extend the proof-of-concept application to allow users to query and navigate EOL with phyloreferences, using the full synthetic Open Tree of Life.

The project proposal also includes a list of 12 specific aims (pg 13). These are:

  1. (1a) Development of phyloreferencing ontology
  2. (1b) Specification for ontology-based phyloreference construction
  3. (1c) Tool for converting phylogenies into OWL ontologies
  4. (2a) Ontology-based interface for RegNum
  5. (2b) Literature extraction of phylogenetic definitions
  6. (2c) Tool for transforming RegNum content to OWL
  7. (3) Webapp for querying of large tree with phyloreferences
  8. (4a) Algorithm to map between tree terminal nodes and taxonomies
  9. (4b) Algorithm to map between tree internal nodes and taxonomies
  10. Test cases, software testing, query result vetting
  11. Development of online instructional module
  12. Development of Museum exhibit

I listed the first nine nine specific aims under the descriptions of individual user components below. The final three specific aims are unrelated to any one software tool, and so are not included in this development plan.

Deliverables: Documents

Phyloreferencing specification

A specification that describes how phyloreferences can be defined conceptually and how they can be implemented in OWL. It includes examples of phyloreferences and discusses the limitations of this approach.

User stories (0/12):

  • As a software developer, I need:
    • The minimum information necessary to describe a phyloreference
    • Detailed algorithms for how matching and reasoning is intended to work at a conceptual level and in OWL
    • Concrete examples that demonstrate what phyloreferences look like
    • Clear figures showing the OWL and data model
  • As a user of phylogenetic clade definitions, I expect that this specification will provide implementation details for:
    • Specifiers
    • Matching specifiers with phylogenetic nodes
    • Node-based phyloreferences (minimum-clade definitions)
    • Branch-based phyloreferences (maximum-clade definitions)
    • Apomorphy-based phyloreferences
    • Qualifying phrases
  • As an ontologist, I would like to know the strengths and weaknesses of an OWL-based approach as compared to directly implementing phyloreferences in software.
  • As a ontologist with limited time, I would like the model created in this specification to be computational efficient, so that it can be used and manipulated easily and will work on large phylogenies.

Specific aims (4/22):

  • (Specific Aim 1a) Development of phyloreferencing ontology
    • As a written specification
    • As an OWL ontology
  • (Specific Aim 1b) Specification for ontology-based phyloreference construction
  • As per the requirements of phyloreferences (the project proposal, pg 1):
    • Because it consists of uniquely identified ontology terms and properties, it is unambiguous.
    • Although it expresses a pattern of shared ancestry, it can be defined and communicated independent of a concrete phylogeny.
    • It may be named (labeled), but a name only aids communication and carries no semantics.
    • Its semantics are interpretable by any off-the-shelf OWL reasoner implementation, and do not require custom, bespoke tools.
    • To promote reuse and consistency of frequently used phyloreferences, they can be compiled and published in the form of community-vetted OWL ontologies.
    • Tools and algorithms exist that use the ontologies from which the terms used within a phyloreference are drawn to compute quantitative metrics between phyloreferences, including distance and semantic similarity
  • As per the requirements set out in the Research and Development Plan (project proposal, pg 7):
    • Any element of a phylogeny is referenceable, including nodes, branches, clades and tips.
    • Phyloreferences are unambiguous. Given a phyloreference and a phylogeny, the matching of the phyloreference should be “axiomatic” (self-evidence or unquestionable).
    • Phyloreferences are fully computable, can be resolved by a machine.
    • Phyloreferences are portable. The definition of a phyloreference is neutral with respect to the phylogeny against which one chooses to apply it.
  • As per the description of deliverables (project proposal, pg 8-9):
    • Determine if phyloreferences should resolve to a node, a branch or a set of nodes
    • Determine how to use specifiers, both taxa and apomorphies
    • Determine how efficiently we can carry this out in OWL2-DL and whether we can simplify it to OWL2-EL.
    • Determine how efficiently we can carry this out with Tbox axioms instead of Abox axioms.
    • Evaluate how phyloreferences would work on unrooted trees, polytomies, anastomosing phylogenies (?) and phylogenetic networks.
  • The semantics of phyloreference clade definitions should be “transparent, unambiguous, and readable by any OWL-aware tool, including machine reasoners”.
  • As per the use cases in the project proposal, pg 2-3:
    • Provide alternate definitions of differently defined clades within the Campanulaceae
    • Provide a simple, one-sentence phylogenetic clade definition in OWL
    • Provide definitions for clades within and around polyphyletic species
    • Provide definitions for unnamed phylotypes for “dark taxa”, such as bacteria and viruses

Deliverables: Software tools

Phyloreference navigator and query tool

Our overall goal is the development of a graphical tool that can be used to demonstrate the value of phyloreferencing when compared with current approaches. It will be a single, online tool that will allow users to:

  1. Construct phyloreferences by indicating internal and external specifiers,
  2. Provide validation of phyloreferences to identify common mistakes or help determine why a failing phyloreference has failed,
  3. Match custom phylogenies against constructed phyloreferences or against the phyloreferences included into the Phyloreferencing test suite, and
  4. Match custom and included phyloreferences against the Open Tree of Life.

While we have had some success is reasoning on-the-fly over small-to-medium sized phylogenies, demonstration phylogenies and phyloreferences will probably be prereasoned.

Gaurav’s stories (0/14):

  • As a biologist looking to contextualize a particular phylogeny, I would like to be able to annotate it with all known matching phyloreferences in our database.
  • As a biologist looking to make it easier for readers to navigate my large phylogenies, I would like to be able to annotate phyloreferences on large phylogenies.
  • As a biologist looking to navigate and annotate the Open Tree of Life, I would like to be able to:
    • apply phyloreferences on the Open Tree of Life synthetic tree.
  • As a person familiar with some species but not their higher taxonomy, I would like to be able to:
    • navigate it by using specifiers directly, e.g. “highlight clade containing Panthera tigris but excluding Panthera leo”.
    • identify pages on the Encyclopedia of Life corresponding to the tips matching by a particular phyloreference, i.e. which species are included within a particular clade
    • identify pages on the Encyclopedia of Life corresponding to the internal node matched by a phyloreference, i.e. which taxon does this clade correspond most closely to
  • As a person unfamiliar with phyloreferences:
    • Constructing phyloreferences should be an easy process to get started, i.e. with step-by-step guidance
    • Constructing phyloreferences should be easy and quick to carry out once you are familiar with the system
  • As an ontologist, I would like to be able to:
    • view the OWL representation of a phyloreference with syntax highlighting, likely in Manchester or Turtle formats
    • export the OWL representation of a phylogeny as RDF/XML
    • export the OWL representation of a phyloreference as RDF/XML
  • As a user with limited time, I would like:
    • phylogenies to be annotated very quickly, or a submission system that e-mails me once my job is complete
    • approaches involving pre-reasoning phyloreferences to be compared with approaches in which reasoning occurs on-the-fly, possibly in OWL2-EL rather than OWL2-DL

Specific aims met (0/6):

  • (Specific Aim 3) Webapp for querying of large tree with phyloreferences (specifically, a large tree of angiosperms)
  • (Specific Aim 4a) Algorithm to map between tree terminal nodes and taxonomies
  • (Specific Aim 4b) Algorithm to map between tree internal nodes and taxonomies
  • (Goal 3a) Using a comprehensive phylogenetic tree for angiosperms and the previously curated phyloreferences, we will create an online application that uses OWL reasoning to allow users to query the tree using phyloreferences, and to find phyloreferences based on chosen nodes of the tree.
  • (Goal 4a) We will extend the proof-of-concept application to allow users to query and navigate EOL with phyloreferences, using the full synthetic Open Tree of Life.
  • Demonstrate the results of querying the tree by using it to navigate the Encyclopedia of Life

Phyloreferencing test suite

The phyloreferencing test suite consists of individual test suites consisting of groups of phyloreferences and phylogenies. The phylogenies have been annotated to indicate which nodes we expect will be resolved by each phyloreference. The test suite resolves the phyloreferences on the provided phylogenies and tests whether they resolve to the expected nodes. Since it acts as a test suite for the ontology itself, the definitive copy of the ontology should be stored with the test suite. The test suite will provide us with three main outputs:

  1. A set of curated, documented phyloreferences side-by-side with phylogenies upon which they resolve,
  2. A set of phyloreferences that we can include in the Phyloreferencing navigator and query tool, providing pre-created phyloreferences that can be resolved against custom phylogenies or against the Open Tree of Life synthetic tree, and
  3. A regression testing platform that will detect if later changes to our software or ontology cause previously defined phyloreferences to fail or to resolve incorrectly.

User stories (0/5):

  • As someone curious about phyloreferences, I would:
    • like a “worked example” showing how phyloreferences can be constructed, documented and published.
    • like information on how many phyloreferences are part of our test suite, which branches of the tree of life are covered, and whether all current phyloreferences passed all tests
  • As an ontologist, I would like to:
    • download the entire OWL test suite as an ontology in RDF/XML
    • download a small subset of the OWL test suite as an ontology in RDF/XML
  • As a developer, I would like every existing curated phyloreference to be tested every time our ontology changes.

Specific aims met (0/1):

  • (Goal 1a) We will specify and test templates and a supporting ontology for constructing phyloreferences in OWL, guided by phylogenetic definitions used in the literature.

Phyloreference curation tool

This curation tool will allow journal articles containing phylogenetic clade definitions to be curated and added to the test suite. It will essentially act as a graphical editor to the JSON files in the Phyloreferencing test suite. Ideally, it will allow interactive editing of phyloreferences with immediate information on whether the phyloreference matched and why it failed.

User stories (0/4):

  • As a busy curator, I would like:
    • Phyloreferences to be added quickly and with minimal metadata
    • An interactive mode in which phyloreferences can be immediately executed and fixed
    • Checklists to ensure that all necessary metadata were incorporated
  • As a developer, I would like to use this tool to prototype the Phyloreference navigator and query tool and to provide initial feedback on phyloreference-related user experience

Specific aims met (0/1):

  • (Specific Aim 2b) Literature extraction of phylogenetic definitions

Phylogeny to OWL conversion tool

phylo2owl will allow phylogenies to be converted into OWL, keeping as many annotations and labels as possible.

User stories (1/4):

  • As a user of command line tools,
    • I should be able to install phylo2owl from the command line using pip install.
    • I should be able to read documentation about this tool by running man phylo2owl.
    • I should be able to read documentation about this tool by running phylo2owl --help.
  • As a software engineer, I would like to prevent code reuse by refactoring phylo2owl to use the Phyloreferencing Python library.

Specific aims met (1/4):

  • (Goal 1b) In parallel, we will develop a model and an automatic transformation tool for representing phylogenies in OWL such that phyloreferences can be resolved by OWL reasoning.
  • (Specific Aim 1c) Tool for converting phylogenies into OWL ontologies
  • Should be able to convert Newick, Nexus and NeXML input files.
  • Should have its own test suite

Phyloreference interface for Regnum

This tool provides close integration between Regnum, an online database of phylogenetic clade definitions, and our toolset. In particular, it is designed to facilitate the conversion of phyloreferences into OWL representations in the Phyloreference curation tool, where phyloreferences could be curated for inclusion into the test suite. This will probably involve making changes to Regnum to bring them more in line with the Phyloreferencing specification, which will be based on the Phenex phenotype curation tool and the Hymenoptera Anatomy Ontology (HAO) project.

User stories (0/0):

  • None so far

Specific aims met (0/3):

  • (Specific Aim 2a) Ontology-based interface for RegNum
  • (Specific Aim 2c) Tool for transforming RegNum content to OWL
    • (Goal 2a) To ground-proof the specification, we will create a tool to transform the published phylogenetic definitions contained in the RegNum database to an OWL ontology of phyloreferences.
  • (Goal 2b) We will also supplement the RegNum content with phylogenetic definitions culled from the angiosperm (flowering plants) systematics literature.

Deliverables: Software libraries

Phyloreferencing Java library

Because of its integration with OWLAPI, the Java library can reason over ontologies to resolve phyloreferences. The library can therefore provide wrappers around that functionality, including identifying phyloreferences in an OWL ontology, resolving those phyloreferences to nodes, and reporting nodes that failed to resolve correctly. It may later include improved debugging tools and optimizations that allow for processing large phylogenies efficiently. While this library currently includes a command-line tool, jphyloref, this is not currently intended as a final product except as a part of other tools.

User stories (2/4):

  • As a software developer, I should be able to:
    • Load and reason over an OWL ontology in RDF/XML.
    • Identify nodes resolved by phyloreferences.
    • Through abstraction, work directly with Phylogenies, Phyloreferences, Specifiers and other phyloreferencing elements.
    • Determine why a phyloreference failed to resolve: check all its specifiers, identify cases where a “phyloreference” matched multiple nodes, and other common failure causes.

Specific aims met (0/3):

  • Provides a complete level of abstraction to phyloreferences in OWL ontologies
  • Is thoroughly documented
  • Includes a test suite

Phyloreferencing Python library

Given the difficulty of carrying out OWL reasoning in Python, this library is designed to be used to read and understand JSON files describing a test case with annotated phylogenies and phyloreferences. The phylogenies can be read as Newick, Nexus or NeXML formats, thanks to DendroPy. It contains the code required to convert phylogenies and phyloreferences into OWL representations, so that eventually phylo2owl will essentially be just a command line wrapper for this library.

User stories (2/5):

  • A software developer should be able to read a Phyloreference test suite written in JSON into memory.
  • A software developer should be able to export the Phyloreference test suite as OWL in JSON-LD.
  • A software developer should be able to list all the phylogenies and phyloreferences in a test suite.
  • A software developer should be able to obtain a list of nodes within a phylogeny (even if via DendroPy).
  • A software developer should be able to determine which nodes have been annotated as expected Phyloreference targets.

Specific aims met (3/6):

  • Reading JSON files describing a test case with annotated phylogenies and phyloreferences.
  • Reading phylogenies in Newick, Nexus or NeXML formats, thanks to DendroPy.
  • Writing out JSON-LD files that contain:
    • The phylogenies, converted into an OWL representation, and
    • The phyloreferences, converted into OWL expressions.
  • Is thoroughly documented
  • Includes a test suite
Clone this wiki locally