Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary nesting of annotation features #286

Open
johann-petrak opened this issue Feb 13, 2023 · 13 comments
Open

Avoid unnecessary nesting of annotation features #286

johann-petrak opened this issue Feb 13, 2023 · 13 comments

Comments

@johann-petrak
Copy link

Currently the user choices are not represented directly as features within the annotation for some annotator, but as key/values in a dict which is the value of a sinlge "label" feature in the annotation.

Would it make sense to make each of them directly an annotation feature?

@davidwilby
Copy link
Contributor

davidwilby commented Feb 13, 2023

Thanks @johann-petrak, would you be able to add an example please so that we can follow exactly what you're looking for in the JSON structure.

@johann-petrak
Copy link
Author

From the documentation: https://gatenlp.github.io/gate-teamware/development/manageradminguide/documents_annotations_management.html#exporting-documents

Apparently the user response (annotatation) gets represented as a dict in the annotations array. Each dict looks like this:

          {
             "type":"Document",
             "start":0,
             "end":10,
             "id":0,
             "features":{
                "label":{
                   "text":"Annotation text",
                   "radio":"val3",
                   "checkbox":[
                      "val2",
                      "val4"
                   ]
                }
             }
          }

This representation is compatible with the Python gatenlp json representation of an annotation.

However as you can see the information added by the user is all contained in the dict which is the value for the "label" feature.
This means that when importing the document into Python gatenlp, in order to get the values provided by the user, we have to get them via that dict.

I think when we originally talked about this I was suggesting to make them instead directly accessible as features like so:

          {
             "type":"Document",
             "start":0,
             "end":10,
             "id":0,
             "features":{
                   "text":"Annotation text",
                   "radio":"val3",
                   "checkbox":[
                      "val2",
                      "val4"
                   ]
             }
          }

This is also more in line how this worked in Java GATE in the past.

Is there anything that would speak against doing it like that?

@ianroberts
Copy link
Member

Can there be other items in the "features" that come from the original document rather than from the annotator?

@johann-petrak
Copy link
Author

Can there be other items in the "features" that come from the original document rather than from the annotator?

This is basically just the equivalent of GATE annotations in an annotation set, so if the gatenlp format gets imported, there might be sets with annotations with features already.

Not sure what the current plan/implementation is for dealing with such existing annotations. But since annotations are grouped in one set per annotator, it should be easy to avoid clashes.
It may make it even easier to avoid any clashes if all per-user annotation sets in TW follow some naming scheme like "user:xxxx" or "annotator:aaaaa" or similar.
Then TW can do its thing with regard to e.g. re-using existing annotation information from those sets, while unrelated annotations can be kept in separate sets.

But this is perhaps a different topic - I just thought it may be more convenient to avoid the additional layer and avoid having a map as the value of a single feature.

@johann-petrak
Copy link
Author

Is there a plan to include this in an upcoming release soon?
This is a change that will not be backwards compatible, so I think the earlier data format issues are fixed the better.

@davidwilby
Copy link
Contributor

Hi @johann-petrak, can you help me understand please? Is this an issue with Teamware's implementation of the GATE annotation format or a problem with that format itself?

Regarding when features will be completed, issues are prioritised at regular meetings and you can see the priority order on Teamware's project board.

@johann-petrak
Copy link
Author

Hmmm, sorry, maybe it is that I do not understand things properly I think I should have another look at the documentation first, to understand how the information gathered in the annotator guy is organized.

What I was originally on about is that all the information interesting to me is grouped into a single feature called "label" which is dictionary-valued in the example while I was expecting the content of the dictionary to be directly represented as features.

Where does that name "label" come from, in other words, what is the intended name and value of the feature(s) teamware creates? Is this documented somewhere?

@ianroberts
Copy link
Member

So we just want to change

"features": {
"label": a_data
}

to

"features": a_data

right?

@davidwilby
Copy link
Contributor

davidwilby commented Apr 21, 2023

Notes from meeting 21/4/23:

To do:

@davidwilby
Copy link
Contributor

right?

We might just want to check this function as well.

def get_annotations(request, project_id):

@twinkarma
Copy link
Collaborator

Check that if you upload the BDOC format

{
  "name": 32,
  "text": "Document text",
  "features": {
    "text2": "Document text 2",
    "feature1": "Feature text"
  },
  "offset_type":"p",
  "annotation_sets": {...}
}


the output when exporting as the GATE format is still:

{
  "name": 32,
  "text": "Document text",
  "features": {
    "text2": "Document text 2",
    "feature1": "Feature text"
  },
  "offset_type":"p",
  "annotation_sets": {...}
}


and NOT nested again e.g.:

{
  "name": 32,
  "text": "Document text",
  "features": {
    "features": {
      "text2": "Document text 2",
      "feature1": "Feature text"
    },
   "offset_type":"p",
  },
  "annotation_sets": {...}
}


@twinkarma
Copy link
Collaborator

Also if someone upload documents with existing "annotation_sets" field, annotations done in Teamware should append to the dictionary rather than just overwriting it.

@twinkarma
Copy link
Collaborator

I've split the tasks in into three issues #346, #347 and #348

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants