Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/tag info codegen #169

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

bpeake-illuscio
Copy link
Contributor

@bpeake-illuscio bpeake-illuscio commented Dec 30, 2020

Hello!

I have taken a crack here at #147. I'll break this summary into sections where I discuss the reasoning behind certain choices.

Overview of Changes

Innolitics Git Submodule

issue #147 mentions using the JSON dump found in the Innolitics Dicom Standard repo. Rather than downloading and copy/pasting the attributes.json, I decided to add the entire repo as a git submodule.

The submodule lives in /pkg/tag/dicom-standard, and here arre the tradeoffs of using this approach, IMO:

Pros

  • The source of our data is very explicit, down to the exact commit we are pulling from
  • Updating to the latest version becomes as easy as updating the submodule.

Cons

  • Git submodules are not super-ubiquitous, so it does add a small amount of complexity.
  • The Innolitics repo is a little heavy. However, this is offset by submodules not being fetched on clone by default, so only developers who are updating the codegen feature will need to fully fetch it.

Overall I feel the pros are worth the cons.

Codegen migrated from Python to Go

I took the opportunity to move the codegen feature from Python to Go. This lowers the barrier to entry for any developers that wish to make PRs around the codgen feature, but may not know python. Go compiles quickly enough that I do not see this being an issue, especially with how few and far between regeneration of the code is going to be need.

The codgen tool now lives in /pkg/tag/codgen.

I'll go into more about the structure of this new codegen below.

Added Keyword-Indexed Map

Added tagDictByKeyword that indexes tag.Info objects by Info.Name, allowing us to do a map-lookup when calling tag.FindByName, rather than iterating over tagDict.

Added Codegen to Makefile

You can now type make codegen to run go generate. Since there were already make commands for tests I figured I would add one for this too.

Challenges

The main challenge was that the Innolitics JSON dump only contains tags for the DICOM file spec, and does not include tags for 0x0000 Group, which is used by the DIMSE / Dicom-Net spec.

Fo this PR, I have retained this group from the original CSV currently being used for code generation, and paired it down to just items in group 0x0000. However, this meant reading from multiple sources of truth, and the architecture of the code reflects that.

New Codgen Architecture.

Interfaces

The new codegen module relies on two interfaces to do the Heavy lifting:

TagReader

// Interface to implement for reading tags from a source.
type TagReader interface {
	// Name of reader for error messages.
	Name() string
	// Yield next tag or io.EOF if done.
	Next() (TagInfo, error)
	// Implements io.Closer() for closing underlying reader.
	Close() error
}

This interface reads tag information from a source file. A master implementation of TagReader is defined to pull from multiple TagReaders and return parsed tags until all child readers are exhausted. This means that if we add multiple sources, we can just define a new TagReader and register it.

CodeWriter

// CodeWriter is an interface for writing to a dicom tag codegen file.
type CodeWriter interface {
	// Name to use in error messages related to this writer.
	Name() string

	// WriteLeading writes the opening part of a file. Called once before any calls to
	// WriteTag()
	WriteLeading() error

	// WriteTag writes codegen for a given tag. It may be called many times between
	// WriteLeading and WriteTrailing. WriteTag should return io.EOF when all tags
	// are successfully written.
	WriteTag(info TagInfo) error

	// WriteTrailing writes all codegen after WriteTag calls are complete.
	WriteTrailing() error

	// Close closes any underlying open resources.
	Close() error
}

The CodeWriter interface is responsible for writing the codegen for a single go file. A Master implementation of this interface distributes each event to all child CodeWriters.

Basic Flow

The above interfaces allow the main flow to be simple:

  1. A master TagReader is created with child readers for each source file we are parsing.
  2. A master CodeWriter is created with child writers for each taget codegen file we are writing.
  3. The master CodeWriter writes any leading code it's children need to put down before tag information is written.
  4. We iterate over the master TagReader, passing each tag to the master CodeWriter as it is parsed to generate the code for that tag.
  5. When iteration is complete, the master writer writes any trailing code needed to complete the codegen files and closes all resources.

This architecture allows us to add new generated files or new tag metadata sources by defining and registering new readers and writers.

Open Questions

  1. I noticed the original also had a number of attributes like ACR_NEMA_2C_CoefficientsSDDN. These are not present in the Innolitics jump and I am afraid I am not quite sure where they come from. I can copy and paste them all from the old CSV into the current dimse.csv, but I would like to understand more about what those values are, why their keywords conform to a different convention, etc.
  2. I have not done any sort of in-depth diff of any other consts that may have disappeared. Are we worried about breaking changes if there are missing values from the old CSV that are not in the current?

Looking Forward

If you like this approach, I think I would like to start adding some more information into the tag.Info struct that is currently not being captured or fully parsed. Some possibilities include:

  • Pydicom also collates a collection of well-known private tags from this source. We could do the same.
  • More granular ValueMultiplicity information ala Feature/Value Multiplicity Info #166
  • Right now retired tags are skipped but many users may be working with older dicoms, so I would propose adding a Retired bool field to tag.Info, and including them.
  • Some tags have multiple possible Value Representations. For instance (0018,9810) ZeroVelocityPixelValue has a VR of 'US or SS'. Right now I am pulling the second value, as that aligned with what the previous CSV did, but I think we should consider capturing both for completeness, especially if users want to use this lib to validate DICOM writes.
  • Right now we are using the Name field to hold the Keyword defined for the tag, but the spec defines both machine-useable Keyword and a human-readable Name. It would be nice to capture both.

Let me know what you think of this PR. If you like it, I think all of the above will be pretty low-hanging fruit and I am happy to knock them out in a series of PRs that build on this one.

@suyashkumar suyashkumar self-requested a review January 4, 2021 01:45
@suyashkumar suyashkumar self-assigned this Jan 4, 2021
@suyashkumar suyashkumar added the enhancement New feature or request label Jan 4, 2021
@suyashkumar
Copy link
Owner

Thank you @bpeake-illuscio @peake100, apologies for this slipping through the cracks. I think much of this is on the right track. I'm going to take a closer look, and may fork your branch to add commits to make various changes and updates in the next couple of weeks (unless you'd like to make them yourself, but I figure you may be busy no pressure at all).

Thank you again for all your contributions!

@peake100
Copy link

Hey! Thanks so much for reaching out. I'm no longer at Illuscio, so this is my current user.

I probably don't have the bandwidth for this right now, but I appreciate you reaching out and am happy to field questions about what I did if you have any!

@suyashkumar
Copy link
Owner

Sounds good, and totally understood! Will do :) Thanks again!

suyashkumar pushed a commit that referenced this pull request May 29, 2024
This PR is resolving #147.

It updates the generate_tag_definitions.py script to read the DICOM standard JSON dump from the Innolitics repo. They provide a few extra fields, so we introduce them in Tag definition, namely, Keyword and Retired. Keyword represents what Name used to represent, see comments on the field definitions for details.

Innolitics also captures all possible VRs for tags that allow multiple VR values, so I updated the writer verification code to consider that. That ended up resolving #299 as well.

This is a breaking change because some tag variables have been deleted or renamed. In particular, the command tags (group 0000) have been all deleted since Innolitics doesn't provide them. I thought that was okay because these are only related to DIMSE and should never be present in a DICOM file. I can put them back somehow if you think we should keep them, but maybe we could sort that out if we ever work on #181. But mostly, tags were added or unchanged.

I ended up writing this from scratch and not branching from #169 because I thought that was a little bit too complicated for this. But we could reconsider that design if there are requests for supporting private data dictionaries.

Misc: fixes some warnings (i.e. removed deprecated ioutil, etc.), added a helper Uint32 method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants