feat(vcs): new data model #192

palkerecsenyi · 2025-09-25T10:10:42Z

Closes #188

Updated the data model to accommodate the new generic approach to VCS integration. This involves renaming the github_... tables to vcs_..., adding a new column to the relevant tables to identify which provider the records relate to, and more.
Added an Alembic migration, including moving the repository data from oauthclient_remoteaccount to the vcs_repositories table, which is a complex and long-running operation. This will be supplemented by a manual migration guide for instances like Zenodo where a several-minute full DB lock is not feasible. The difference between whether to use the automated migration or the manual one will be clarified in the docs.
- Edit: see here for the upgrade guide for large instances.
- We can improve the performance of this migration when perf(models): change extra_data to JSONB invenio-oauthclient#360 is merged (assuming users run the migration in that PR before this one). But that's not essential.
Added a repo-user m-to-m mapping table. By not storing repos in the Remote Accounts table, we need a different way of associating users with the repos they have access to. This table is synced using code that will be included in other PRs.
This PR contains only the data model changes themselves and not the associated functional changes needed to do anything useful.
This commit on its own is UNRELEASABLE. We will merge multiple commits related to the VCS upgrade into the vcs-staging branch and then merge them all into master once we have a fully release-ready prototype. At that point, we will create a squash commit.

* Updated the data model to accommodate the new generic approach to VCS integration. This involves renaming the `github_...` tables to `vcs_...`, adding a new column to the relevant tables to identify which provider the records relate to, and more. * Added an Alembic migration, including moving the repository data from `oauthclient_remoteaccount` to the `vcs_repositories` table, which is a complex and long-running operation. This will be supplemented by a manual migration guide for instances like Zenodo where a several-minute full DB lock is not feasible. The difference between whether to use the automated migration or the manual one will be clarified in the docs. * Added a repo-user m-to-m mapping table. By not storing repos in the Remote Accounts table, we need a different way of associating users with the repos they have access to. This table is synced using code that will be included in other PRs. * This PR contains only the data model changes themselves and not the associated functional changes needed to do anything useful. * This commit on its own is UNRELEASABLE. We will merge multiple commits related to the VCS upgrade into the `vcs-staging` branch and then merge them all into `master` once we have a fully release-ready prototype. At that point, we will create a squash commit.

zzacharo

@palkerecsenyi I dont see something worrying but it is also difficult without having the tests and the functional usage of the model... We might need to revisit that across the next PRs. @slint any major thing you see?

zzacharo · 2025-10-15T12:41:31Z

invenio_vcs/models.py

+            "provider_id",
+            name="uq_vcs_repositories_provider_provider_id",
+        ),
+        # Index("ix_vcs_repositories_provider_provider_id", "provider", "provider_id"),


I think I commented this because I wasn't 100% sure about the indexes/uniques. I'm fairly certain I've arranged them correctly for the new models but I'm not super experienced with these so I'm not sure.

invenio_vcs/models.py

zzacharo · 2025-10-15T13:04:50Z

invenio_vcs/models.py

+    """Which VCS provider the repository is hosted by (and therefore the context in which to consider the provider_id)"""
+
+    description = db.Column(db.String(10000), nullable=True)
+    html_url = db.Column(db.String(10000), nullable=False)


these are exported from the extra_data column?

Yes, previously the extra_data column was of the format:

{ "last_sync":"2025-10-15T12:30:01.027133+00:00", "repos": { "123": { "id": "123", "full_name": "org/repo", "description": "An example repository", "default_branch": "main" } } }

description was already stored so this is just simply moving that.

The html_url was calculated in the Jinja templates but given the diversity of how URLs can work between providers (and even within providers with different configs) I decided to make them be stored explicitly.

zzacharo · 2025-10-15T13:07:01Z

invenio_vcs/models.py

+
+    description = db.Column(db.String(10000), nullable=True)
+    html_url = db.Column(db.String(10000), nullable=False)
+    license_spdx = db.Column(db.String(255), nullable=True)


Is this a totally new value or was it fetched also in the past?

This was previously extracted from the repository payload stored in the webhook event during the RDM-level metadata extraction. It felt more natural to me to keep it in the model itself as it's an inherent property of the repo rather than being specific to a release event, and it didn't make sense to only have it available on the RDM level while other repo metadata (description, default branch, etc.) is available on the invenio-vcs level.

palkerecsenyi · 2025-10-15T13:12:09Z

@palkerecsenyi I dont see something worrying but it is also difficult without having the tests and the functional usage of the model... We might need to revisit that across the next PRs. @slint any major thing you see?

Yes indeed it's quite an annoying way to review sadly. If it helps, you can see the non-fragmented diff of all the code on the master branch of my fork which is kept up-to-date with the fragmented PRs.

For example the models.py file: master...palkerecsenyi:invenio-vcs:master#diff-a232ee65b447a8d90fbac12501761c411764f3570061d1b18e3e8181668fcc39

palkerecsenyi changed the title ~~WIP: feat(vcs): new data model~~ feat(vcs): new data model Sep 25, 2025

palkerecsenyi force-pushed the data-layer branch from 9f1e07b to 449f41d Compare September 25, 2025 10:11

palkerecsenyi mentioned this pull request Aug 15, 2025

Make invenio-github support other VCS providers #188

Open

13 tasks

palkerecsenyi linked an issue Sep 25, 2025 that may be closed by this pull request

Make invenio-github support other VCS providers #188

Open

13 tasks

palkerecsenyi force-pushed the data-layer branch from 79ba5a6 to fc8faf7 Compare October 9, 2025 09:10

chore: pydoc

66c42c0

palkerecsenyi force-pushed the data-layer branch from fc8faf7 to 66c42c0 Compare October 9, 2025 15:57

zzacharo reviewed Oct 15, 2025

View reviewed changes

WIP: models: JSONB for errors column

bf91a21

WIP: chore: license

24cfce3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(vcs): new data model #192

feat(vcs): new data model #192

Uh oh!

palkerecsenyi commented Sep 25, 2025 •

edited

Loading

Uh oh!

zzacharo left a comment

Uh oh!

zzacharo Oct 15, 2025

Uh oh!

palkerecsenyi Oct 15, 2025

Uh oh!

Uh oh!

zzacharo Oct 15, 2025

Uh oh!

palkerecsenyi Oct 15, 2025

Uh oh!

zzacharo Oct 15, 2025

Uh oh!

palkerecsenyi Oct 15, 2025

Uh oh!

palkerecsenyi commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(vcs): new data model #192

Are you sure you want to change the base?

feat(vcs): new data model #192

Uh oh!

Conversation

palkerecsenyi commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zzacharo left a comment

Choose a reason for hiding this comment

Uh oh!

zzacharo Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

palkerecsenyi Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zzacharo Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

palkerecsenyi Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

zzacharo Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

palkerecsenyi Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

palkerecsenyi commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

palkerecsenyi commented Sep 25, 2025 •

edited

Loading