Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding relations #158

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

adding relations #158

wants to merge 24 commits into from

Conversation

kyleclo
Copy link
Collaborator

@kyleclo kyleclo commented Oct 11, 2022

This PR extends this library functionality substantially -- Adding a new Annotation type called Relation. A Relation is a link between 2 annotations (e.g. a Citation linked to its Bib Entry). The input Annotations are called key and value.

A few things needed to change to support Relations:

Annotation Names

Relations store references to Annotation objects. But we didn't want Relation.to_json() to also .to_json() those objects. We only want to store minimal identifiers of the key and value. Something short like bib_entry-5 or sentence-13. We call these short strings names.

To do this, we added to Annotation class, an optional attribute called field: str which stores this name. It's automatically populated when you run Document.annotate(new_field = list_of_annotations); each of those input annotations will have the new field name stored under .field.

We also added a method name() that returns the name of a particular Annotation object that is unique at the document-level. Names are a minimal class that basically stores .field and .id.

In short, now after you annotate a Document with annotations, you can do stuff like:

doc.tokens[15].name   ==   AnnotationName(field='tokens', id=15)
str(annotation_name)  ==   'tokens-15'
AnnotationName.from_str('tokens-15')  ==  AnnotationName(field='tokens', id=15)

Lookups based on names

To support reconstructing a Relation object given the names of key and value, we need the ability to lookup those involved Annotations. We introduce a new method to enable this:

annotation_name = AnnotationName.from_str('paragraphs-99')
a = document.locate_annotation( annotation_name )   -->  returns the specific Annotation object
assert a.id == 99
assert a.field == 'paragraphs'

to and from JSON

Finally, we need some way to serializing to JSON and reconstructing from JSON. For serialization, now that we have Names, this makes the JSON quite minimal:

{'key': <name_of_key>, 'value': <name_of_value>, ...other stuff that all Annotation objects have,  like Metadata...}

Reconstructing a Relation from JSON is more tricky because it's meaningless without a Document object. The Document object must also store the specific Annotations correctly so we can correctly perform the lookup based on these Names.

The API for this is similar, but you must also pass in the Document object:

relation = Relation.from_json(my_relation_dict, my_document_containing_necessary_fields)

@kyleclo kyleclo requested review from soldni, cmwilhelm and rauthur and removed request for soldni and rauthur October 14, 2022 01:07
Copy link
Member

@soldni soldni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall design seems good to me! I don't quite understand why we need AnnotationName classes though. What does the extra overhead of this class get us?

Comment on lines 110 to 111
# TODO[kylel] - when does this ever get called? infinite loop?
return self.__getattribute__(field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This SO answer has a good explainer. I don't think you want to keep this here--- __getattribute__ is called before __getattr__.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -142,47 +178,38 @@ def __deepcopy__(self, memo):

@property
def type(self) -> str:
logging.warning(msg='`.type` to be deprecated in future versions. Use `.metadata.type`')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would not use the root logger. I would add

Logger = logger.getLogger(__file__)

after imports, and then call Logger.warning instead.

Comment on lines 110 to 111
# TODO[kylel] - when does this ever get called? infinite loop?
return self.__getattribute__(field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably can remove it; this stack overflow post has a good explanation, but tl;dr is that __getattribute__ supersedes __getattr__ if defined, so you will never be in a situation where __getattribute__ should be called after __getattr__.

I suspect that the function of this call here is to raise an error if field is not defined; if that is the case, I would explicitly raise an AttributeError instead with a more descriptive error message.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines 117 to 121
boxes: List[Box],
id: Optional[int] = None,
doc: Optional['Document'] = None,
field: Optional[str] = None,
metadata: Optional[Metadata] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR specifically, but I think we should document these arguments in a docstring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea we'll need that


return doc

def locate_annotation(self, name: AnnotationName) -> Annotation:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure on why we need AnnotationName. If a name is just a combo of relation name + id, can we just expect a tuple of the two here? or require two args? Overall, having a class to just hold annotation names would just result in computational overhead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment below

@kyleclo kyleclo changed the title [wip] adding relations adding relations Oct 21, 2022
@kyleclo
Copy link
Collaborator Author

kyleclo commented Oct 21, 2022

@soldni

The overall design seems good to me! I don't quite understand why we need AnnotationName classes though. What does the extra overhead of this class get us?

Without the class, we would need to code somewhere how IDs are constructed in the library. For now, it's field_name - integer_id, but it's possible in the future this will need to be extended.

As well, we need some way to parse this ID for use in lookup of that specific element within a Document. I don't want field, id = obj.split('-') everywhere throughout the code as it gets hard to maintain in case we ever change something. The class allows us to have methods .field and .id for use here.

def from_str(cls, s: str) -> 'AnnotationName':
field, id = s.split('-')
id = int(id)
return AnnotationName(field=field, id=id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return AnnotationName(field=field, id=id)
return cls(field=field, id=id)

Prevents issues when inheriting (if we ever decide to).

@@ -34,6 +35,22 @@ def warn_deepcopy_of_annotation(obj: "Annotation") -> None:
warnings.warn(msg, UserWarning, stacklevel=2)


class AnnotationName:
"""Stores a name that uniquely identifies this Annotation within a Document"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__slots__ = ("field", "id")

Speeds up class creation by roughly 20%:

>>> %timeit AnnotationName('relation', 1)
140 ns ± 0.391 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

# after adding slots

>>> %timeit AnnotationName('relation', 1)
115 ns ± 0.163 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

@soldni
Copy link
Member

soldni commented Oct 21, 2022

@soldni

The overall design seems good to me! I don't quite understand why we need AnnotationName classes though. What does the extra overhead of this class get us?

Without the class, we would need to code somewhere how IDs are constructed in the library. For now, it's field_name - integer_id, but it's possible in the future this will need to be extended.

As well, we need some way to parse this ID for use in lookup of that specific element within a Document. I don't want field, id = obj.split('-') everywhere throughout the code as it gets hard to maintain in case we ever change something. The class allows us to have methods .field and .id for use here.

@kyleclo Sounds good! added two small suggestions to improve it, but otherwise ok to merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants