Avoid unnecessary nesting of annotation features #286

johann-petrak · 2023-02-13T16:46:09Z

Currently the user choices are not represented directly as features within the annotation for some annotator, but as key/values in a dict which is the value of a sinlge "label" feature in the annotation.

Would it make sense to make each of them directly an annotation feature?

davidwilby · 2023-02-13T16:50:31Z

Thanks @johann-petrak, would you be able to add an example please so that we can follow exactly what you're looking for in the JSON structure.

johann-petrak · 2023-02-13T17:28:07Z

From the documentation: https://gatenlp.github.io/gate-teamware/development/manageradminguide/documents_annotations_management.html#exporting-documents

Apparently the user response (annotatation) gets represented as a dict in the annotations array. Each dict looks like this:

          {
             "type":"Document",
             "start":0,
             "end":10,
             "id":0,
             "features":{
                "label":{
                   "text":"Annotation text",
                   "radio":"val3",
                   "checkbox":[
                      "val2",
                      "val4"
                   ]
                }
             }
          }

This representation is compatible with the Python gatenlp json representation of an annotation.

However as you can see the information added by the user is all contained in the dict which is the value for the "label" feature.
This means that when importing the document into Python gatenlp, in order to get the values provided by the user, we have to get them via that dict.

I think when we originally talked about this I was suggesting to make them instead directly accessible as features like so:

          {
             "type":"Document",
             "start":0,
             "end":10,
             "id":0,
             "features":{
                   "text":"Annotation text",
                   "radio":"val3",
                   "checkbox":[
                      "val2",
                      "val4"
                   ]
             }
          }

This is also more in line how this worked in Java GATE in the past.

Is there anything that would speak against doing it like that?

ianroberts · 2023-02-13T17:50:09Z

Can there be other items in the "features" that come from the original document rather than from the annotator?

johann-petrak · 2023-03-06T13:30:36Z

Can there be other items in the "features" that come from the original document rather than from the annotator?

This is basically just the equivalent of GATE annotations in an annotation set, so if the gatenlp format gets imported, there might be sets with annotations with features already.

Not sure what the current plan/implementation is for dealing with such existing annotations. But since annotations are grouped in one set per annotator, it should be easy to avoid clashes.
It may make it even easier to avoid any clashes if all per-user annotation sets in TW follow some naming scheme like "user:xxxx" or "annotator:aaaaa" or similar.
Then TW can do its thing with regard to e.g. re-using existing annotation information from those sets, while unrelated annotations can be kept in separate sets.

But this is perhaps a different topic - I just thought it may be more convenient to avoid the additional layer and avoid having a map as the value of a single feature.

johann-petrak · 2023-04-06T08:19:48Z

Is there a plan to include this in an upcoming release soon?
This is a change that will not be backwards compatible, so I think the earlier data format issues are fixed the better.

davidwilby · 2023-04-12T12:44:52Z

Hi @johann-petrak, can you help me understand please? Is this an issue with Teamware's implementation of the GATE annotation format or a problem with that format itself?

Regarding when features will be completed, issues are prioritised at regular meetings and you can see the priority order on Teamware's project board.

johann-petrak · 2023-04-13T09:44:00Z

Hmmm, sorry, maybe it is that I do not understand things properly I think I should have another look at the documentation first, to understand how the information gathered in the annotator guy is organized.

What I was originally on about is that all the information interesting to me is grouped into a single feature called "label" which is dictionary-valued in the example while I was expecting the content of the dictionary to be directly represented as features.

Where does that name "label" come from, in other words, what is the intended name and value of the feature(s) teamware creates? Is this documented somewhere?

ianroberts · 2023-04-21T11:37:21Z

So we just want to change

gate-teamware/backend/models.py

Lines 1001 to 1003 in 427821e

    
           "features": { 
        
               "label": a_data 
        
           }

to

"features": a_data

right?

davidwilby · 2023-04-21T11:37:42Z

Notes from meeting 21/4/23:

There is no standard GATE annotation JSON format, GATE (Java) and Python GATENLP do different things
https://gatenlp.github.io/python-gatenlp/formats
bdocjs format
- https://gatenlp.github.io/gateplugin-Format_Bdoc/bdoc_document.html
- this is what is currently called 'gate' format in teamware

To do:

flatten out label feature to make bdoc compatible by default for JSON export
ensure not compromising CSV export - TK: not a problem because of internal representation
correct documentation for gate and raw https://gatenlp.github.io/gate-teamware/development/manageradminguide/documents_annotations_management.html#exporting-documents
check that exported formats are correct, apparently they are the same when output
avoid overwriting existing annotation sets on export from teamware

davidwilby · 2023-04-21T11:39:26Z

right?

We might just want to check this function as well.

gate-teamware/backend/rpc.py

Line 675 in 427821e

def get_annotations(request, project_id):

twinkarma · 2023-04-21T11:44:25Z

Check that if you upload the BDOC format

{
  "name": 32,
  "text": "Document text",
  "features": {
    "text2": "Document text 2",
    "feature1": "Feature text"
  },
  "offset_type":"p",
  "annotation_sets": {...}
}

the output when exporting as the GATE format is still:

{
  "name": 32,
  "text": "Document text",
  "features": {
    "text2": "Document text 2",
    "feature1": "Feature text"
  },
  "offset_type":"p",
  "annotation_sets": {...}
}

and NOT nested again e.g.:

{
  "name": 32,
  "text": "Document text",
  "features": {
    "features": {
      "text2": "Document text 2",
      "feature1": "Feature text"
    },
   "offset_type":"p",
  },
  "annotation_sets": {...}
}

twinkarma · 2023-04-21T11:46:54Z

Also if someone upload documents with existing "annotation_sets" field, annotations done in Teamware should append to the dictionary rather than just overwriting it.

twinkarma · 2023-04-21T21:37:22Z

I've split the tasks in into three issues #346, #347 and #348

johann-petrak added the feature request label Feb 13, 2023

twinkarma mentioned this issue Apr 21, 2023

Remove unnecessary nesting when exporting annotation features #347

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unnecessary nesting of annotation features #286

Avoid unnecessary nesting of annotation features #286

johann-petrak commented Feb 13, 2023

davidwilby commented Feb 13, 2023 •

edited

Loading

johann-petrak commented Feb 13, 2023

ianroberts commented Feb 13, 2023

johann-petrak commented Mar 6, 2023

johann-petrak commented Apr 6, 2023

davidwilby commented Apr 12, 2023

johann-petrak commented Apr 13, 2023

ianroberts commented Apr 21, 2023

davidwilby commented Apr 21, 2023 •

edited

Loading

davidwilby commented Apr 21, 2023

twinkarma commented Apr 21, 2023

twinkarma commented Apr 21, 2023

twinkarma commented Apr 21, 2023

Avoid unnecessary nesting of annotation features #286

Avoid unnecessary nesting of annotation features #286

Comments

johann-petrak commented Feb 13, 2023

davidwilby commented Feb 13, 2023 • edited Loading

johann-petrak commented Feb 13, 2023

ianroberts commented Feb 13, 2023

johann-petrak commented Mar 6, 2023

johann-petrak commented Apr 6, 2023

davidwilby commented Apr 12, 2023

johann-petrak commented Apr 13, 2023

ianroberts commented Apr 21, 2023

davidwilby commented Apr 21, 2023 • edited Loading

davidwilby commented Apr 21, 2023

twinkarma commented Apr 21, 2023

twinkarma commented Apr 21, 2023

twinkarma commented Apr 21, 2023

davidwilby commented Feb 13, 2023 •

edited

Loading

davidwilby commented Apr 21, 2023 •

edited

Loading