Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Unify names of generic relations to media classes #4599

Closed

Conversation

sarayourfriend
Copy link
Collaborator

Fixes

Fixes #4553 by @sarayourfriend

Description

This is still WIP because I need to test a bit more thoroughly. So far things appear to work with manual testing, but I haven't run unit tests, and I suspect there are edge cases in the model admins that I have missed.

Heads up through, @dhruvkb, if you want to take a look at the approach I've taken. It's a bit abstract, and I could definitely see wanting to use a more concrete, straightforward, declarative approach that just manually creates the back references onto the media classes.

The approach overrides the metaclass for Django models with one specifically for models related to a media type. The abstract models reference that metaclass, along with a related name, and then subclasses are processed to get assigned to the media class as {related_name}_class, and the related name from the abstract class is passed down to the Meta of the final models (which Django doesn't do on its own, only a small set of fields are inherited from Meta by default).

I initially tried this with __init_subclasses__ but you can kind of see what that results in with the admin models and forms... but a bit more unwieldy because of the related name. It ended up being a lot of specific, fiddly implementation details on each of the related abstract classes, which the metaclass was able to tidy up into a single implementation pretty nicely. Of course, it is a bit magical, so if we want to just, say, hand-maintain a nested dictionary of these relations, kind of like the media_type_config fixture in the tests (which could be replaced or reduced by some of this work), and prefer that for its straightforwardness (if being slightly more clunky and requiring more manual configuration of relation names) then I can probably be convinced.

My goal was to prevent any SQL migrations, and indeed, sqlmigrate shows the current migration as a complete noop.

Anyway, let me know what you think, Dhruv, if you get a chance to look at this while it's still a draft. Very curious to see what you'll think of this introspective/meta-programming based approach, and whether you can think of an easier way to do it or think a more concrete and less meta approach would be better.

Testing Instructions

TBD.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (./ov just catalog/generate-docs for catalog
    PRs) or the media properties generator (./ov just catalog/generate-docs media-props
    for the catalog or ./ov just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Copy link

github-actions bot commented Jul 4, 2024

This PR has migrations. Please rebase it before merging to ensure that conflicting migrations are not introduced.

@openverse-bot openverse-bot added 🧱 stack: api Related to the Django API migrations Modifications to Django migrations 🟨 priority: medium Not blocking but should be addressed soon 🧰 goal: internal improvement Improvement that benefits maintainers, not users 💻 aspect: code Concerns the software code in the repository labels Jul 4, 2024
Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is quite novel and interesting, and the final code undoubtedly looks much cleaner.

However, I feel like metaprogramming is a bit of a huge sledgehammer to wield against what was a relatively small nut of being able to connect media models with each other.

Using common related names like sensitive_media and report (instead of media-specific ones like sensitive_image and audio_report) already simplifies the code massively by removing almost all cases where we needed to handle media types separately.

What I'm trying to say is that I don't feel like the metaprogramming part adds much to this PR and the simplifications would remain even if that particular aspect of this PR was dropped.

I am making these statements based on a quick pass through the PR changes and so it is possible that I am not realising the full importance of the metaprogramming aspect. So I will definitely look into the PR more closely soon. But I wanted to leave my first impressions so that you can also have time to explore other directions if you are in an exploratory mood in the coming days.

@sarayourfriend
Copy link
Collaborator Author

Yeah, totally agree metaprogramming is a pretty intense approach for this. I suppose the motivation of using metaprogramming to solve this, as opposed to explicitly configuring the related names on each real model (and additional explicit __init_subclass__ or other type methods) was to make the wiring of these happen automatically, so new media related models (including other subclasses for new future media types) behave identically without needing to repeat the wiring everywhere.

The alternatives I considered were:

  • as mentioned, __init_subclass__ on each of the related media models. It is still metaprogramming, just maybe metaprogramming "lite", and required manually assigning related names on fields, because __init_subclass__ happens after the Django metaclass initialisation, and so after _meta is constructed, and critically after contribute_to_class is called on each of the fields... meaning __init_subclass__ cannot be the source of default_related_name on the subclasses' Meta (it doesn't happen early enough in the lifecycle). So ultimately this required a lot of manual configuration still, for every abstract related model and for each of the subclasses (which needed Meta.default_related_name defined individually or to manually inherit Meta from the parent class).
  • The same as above, but instead of __init_subclass__ on each related model, create a caching @property \ @classmethod on AbstractMedia for each of the related media models, which would just call <Related media model>.__subclassess__() and find the one for the media type of the abstract media subclass (and then cache it so the runtime cost of accessing the classes this way is amortised). As with __init_subclass__, this still requires manually configuring the related_names on the related models fields or Meta class.

What I like about the metaclass approach is that it fully centralises the configuration of the related name and the backreferences of the actual classes from the media models. so that subclasses of the related models don't have to do anything (other than define media_class, which most already did anyway). It removes the need to manually configure anything beyond the abstract classes, except for edge cases like the moderator reference, where the default related name clashed.

So, the metaclass approach looks pretty intense at face value, but it does, I think, ultimately simplify the configuration and make it less fragile or prone to random accidental variations (for example, when new related models are added).


(Here's an aside about our data model, inspired by thinking about the different approaches and where this complexit comes from).

To be honest though, I think it's worth us considering whether the per-media type segmentation of the models (and tables) is entirely useful, and whether there is a generic approach that would actually eliminate the need to repeat structures for each media type, especially when considering aspects of the data model like the complexities I touched on in #1290, where our current model is so focused on individual works, it doesn't allow for the actually complex relationships that sometimes exist. Audio having Audioset as a distinct audio-specific implementation is interesting, for example. We already have images in our catalogue that are parts of real, meaningful sets (which could, for example, be extremely useful for related media queries). Mostly these are scans of pages from the same books (Biodiversity Heritage Library being the source that immediately comes to mind), but there are other forms of images-in-series or collections... not to mention the idea of being able to create Openverse-user curated collections, which we've touched on frequently.

So where metaprogramming in this PR seems like it could be overly complex or abstract, it might also (or instead) point to a problem with our data model, which pre-encodes this level of abstraction and complexity that might neither be necessary or even real in a fundamental sense, and for which meaningful differences between media types could be accounted in less complex ways.

Notably, other systems for modelling the metadata of individual works all work under the assumption of linked data built off core metadata shared by most works regardless of the kind. For example DublinCore and the more complex RDF (which both came up in #4430) work from the same set of basic terms shared by all works, and enabling much more flexible modeling and metadata storage than we currently have, demonstrated by Fedora Repository1 as an example. Our per-media-type approach is not necessarily the best way of doing things generally, and other systems that pre-date ours and which have been manually tested and applied to many millions of digital and analogue works do things a little differently.

Crucially broken in our system, for example, is how a multimedia work or collection of works would be described. We don't have any way of modelling audio that are related to images, for example, that wouldn't require creating a separate container for those works from containers of audio (which we have as audio set) and containers of images (which we do not have).

Anyway, that's not really important to the question of what the best way of solving this problem of generic relations between these Python classes... I just thought it was interesting to consider why these relations exist, whether they're necessary and whether they're even real (in the sense of, does this division actually exist when works are reduced down to an identifier pointing to a binary blob, with a few to many of potentially infinite number of data points associated). If we step back from the model we've implemented and especially if we look at how others have tried to solve this problem, our approach is rather distinct. Whether that's good, purposeful, or serves our use case in a meaningful way that is worth the trade-offs... good for us to think about!

Footnotes

  1. Unrelated to the Linux distro.

Comment on lines -280 to +279
related_name="deleted_audio",
related_name="deleted_media",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, these actually do not need to be defined anymore! I accidentally left these in when I was trying one of the other approaches I mentioned that still required manually configuring the related names. The metaclass approach intentionally removes the need for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the approach that I was actually preferring over the metaclass because it is much more explicit and clearer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't solve the back-references to the class though. You can go through the related set but if it's a one-to-one, there's no good way to do it, as far as I can tell, that won't cause an implicit query.

The complexity comes from trying to solve both problems:

  • Make it possible to use generic references to instance relations
  • Make it possible to use generic references to _model_s of those relations

Comment on lines +21 to +22
class OpenverseMediaModel(Protocol):
media_type: MediaType
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused, left over from another experimental approach I tried (one that didn't leverage metaclass kwargs and instead generated a new metaclass per-abstract related media model, or used a decorator, both of which were a little ugly or didn't work).

Anyway, it's meant to be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave a comment in the code saying it is meant to be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be removed from this PR. I'll do so when I get back around to making updates to this based on feedback for the overall approach.

Comment on lines +150 to +156
@property
def index(self) -> str:
return settings.MEDIA_INDEX_MAPPING[self.media_type]

@property
def filtered_index(self) -> str:
return f"{self.index}-filtered"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be cached classmethod properties.

@@ -275,17 +289,29 @@ def save(self, *args, **kwargs):
super().save(*args, **kwargs)


class AbstractMediaDecision(OpenLedgerModel):
# Do not subclass or it will cause an unnecessary migration due to the change in class name
def MediaModeratorForeignKey(related_name: str):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a static method on the AbstractMediaDecision instead.

@krysal
Copy link
Member

krysal commented Aug 2, 2024

This is looking quite interesting. Similar to Dhruv's opinion, I like how much code this saves, and the resulting one looks clean. I also worry about the metaprogramming part, it diverges quite a lot from our current approach. If we can reach an intermediate stage that would be wonderful. Perhaps with your second alternative:

The same as above, but instead of init_subclass on each related model, create a caching @Property \ @classmethod on AbstractMedia for each of the related media models, which would just call .subclassess() and find the one for the media type of the abstract media subclass (and then cache it so the runtime cost of accessing the classes this way is amortised). As with init_subclass, this still requires manually configuring the related_names on the related models fields or Meta class.


So where metaprogramming in this PR seems like it could be overly complex or abstract, it might also (or instead) point to a problem with our data model, which pre-encodes this level of abstraction and complexity that might neither be necessary or even real in a fundamental sense, and for which meaningful differences between media types could be accounted in less complex ways.

Regarding data modeling, I find our current approach still simple, as we currently have a few central tables. Actually, most of the complexities I usually find come more from maintaining a sort of mirror with the catalog DB. Of course, I'd be interested in exploring new ways to store/represent the data, but I don't see this as a problem for our current use case but with expanding the use cases that we could potentially include.

Crucially broken in our system, for example, is how a multimedia work or collection of works would be described. We don't have any way of modelling audio that are related to images, for example, that wouldn't require creating a separate container for those works from containers of audio (which we have as audio set) and containers of images (which we do not have).

The key is that the current models weren't exactly planned for collections, so it's understandable that they cannot be fit currently. It's important to note this because relational models are designed with very specific requirements so they can work better. It's not set in stone, of course, but trying to change it one way or another has several implications.

If we step back from the model we've implemented and especially if we look at how others have tried to solve this problem, our approach is rather distinct.

I'd like to know more about these cases you mentioned so I can understand them better! And see how they relate to what we are trying to resolve. I agree we can reevaluate all of these, but we also need to define the requirements/constraints beforehand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users migrations Modifications to Django migrations 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API
Projects
Archived in project
4 participants