-
Notifications
You must be signed in to change notification settings - Fork 29
feat(vcs): new data model #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: vcs-staging
Are you sure you want to change the base?
feat(vcs): new data model #192
Conversation
* Updated the data model to accommodate the new generic approach to VCS integration. This involves renaming the `github_...` tables to `vcs_...`, adding a new column to the relevant tables to identify which provider the records relate to, and more. * Added an Alembic migration, including moving the repository data from `oauthclient_remoteaccount` to the `vcs_repositories` table, which is a complex and long-running operation. This will be supplemented by a manual migration guide for instances like Zenodo where a several-minute full DB lock is not feasible. The difference between whether to use the automated migration or the manual one will be clarified in the docs. * Added a repo-user m-to-m mapping table. By not storing repos in the Remote Accounts table, we need a different way of associating users with the repos they have access to. This table is synced using code that will be included in other PRs. * This PR contains only the data model changes themselves and not the associated functional changes needed to do anything useful. * This commit on its own is UNRELEASABLE. We will merge multiple commits related to the VCS upgrade into the `vcs-staging` branch and then merge them all into `master` once we have a fully release-ready prototype. At that point, we will create a squash commit.
9f1e07b
to
449f41d
Compare
79ba5a6
to
fc8faf7
Compare
fc8faf7
to
66c42c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@palkerecsenyi I dont see something worrying but it is also difficult without having the tests and the functional usage of the model... We might need to revisit that across the next PRs. @slint any major thing you see?
"provider_id", | ||
name="uq_vcs_repositories_provider_provider_id", | ||
), | ||
# Index("ix_vcs_repositories_provider_provider_id", "provider", "provider_id"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I commented this because I wasn't 100% sure about the indexes/uniques. I'm fairly certain I've arranged them correctly for the new models but I'm not super experienced with these so I'm not sure.
"""Which VCS provider the repository is hosted by (and therefore the context in which to consider the provider_id)""" | ||
|
||
description = db.Column(db.String(10000), nullable=True) | ||
html_url = db.Column(db.String(10000), nullable=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are exported from the extra_data column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, previously the extra_data
column was of the format:
{
"last_sync":"2025-10-15T12:30:01.027133+00:00",
"repos": {
"123": {
"id": "123",
"full_name": "org/repo",
"description": "An example repository",
"default_branch": "main"
}
}
}
description
was already stored so this is just simply moving that.
The html_url
was calculated in the Jinja templates but given the diversity of how URLs can work between providers (and even within providers with different configs) I decided to make them be stored explicitly.
|
||
description = db.Column(db.String(10000), nullable=True) | ||
html_url = db.Column(db.String(10000), nullable=False) | ||
license_spdx = db.Column(db.String(255), nullable=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a totally new value or was it fetched also in the past?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was previously extracted from the repository payload stored in the webhook event during the RDM-level metadata extraction. It felt more natural to me to keep it in the model itself as it's an inherent property of the repo rather than being specific to a release event, and it didn't make sense to only have it available on the RDM level while other repo metadata (description, default branch, etc.) is available on the invenio-vcs
level.
Yes indeed it's quite an annoying way to review sadly. If it helps, you can see the non-fragmented diff of all the code on the For example the |
Closes #188
Updated the data model to accommodate the new generic approach to VCS integration. This involves renaming the
github_...
tables tovcs_...
, adding a new column to the relevant tables to identify which provider the records relate to, and more.Added an Alembic migration, including moving the repository data from
oauthclient_remoteaccount
to thevcs_repositories
table, which is a complex and long-running operation. This will be supplemented by a manual migration guide for instances like Zenodo where a several-minute full DB lock is not feasible. The difference between whether to use the automated migration or the manual one will be clarified in the docs.Edit: see here for the upgrade guide for large instances.
We can improve the performance of this migration when perf(models): change
extra_data
toJSONB
invenio-oauthclient#360 is merged (assuming users run the migration in that PR before this one). But that's not essential.Added a repo-user m-to-m mapping table. By not storing repos in the Remote Accounts table, we need a different way of associating users with the repos they have access to. This table is synced using code that will be included in other PRs.
This PR contains only the data model changes themselves and not the associated functional changes needed to do anything useful.
This commit on its own is UNRELEASABLE. We will merge multiple commits related to the VCS upgrade into the
vcs-staging
branch and then merge them all intomaster
once we have a fully release-ready prototype. At that point, we will create a squash commit.