Skip to content

Conversation

palkerecsenyi
Copy link
Member

@palkerecsenyi palkerecsenyi commented Sep 25, 2025

Closes #188


  • Updated the data model to accommodate the new generic approach to VCS integration. This involves renaming the github_... tables to vcs_..., adding a new column to the relevant tables to identify which provider the records relate to, and more.

  • Added an Alembic migration, including moving the repository data from oauthclient_remoteaccount to the vcs_repositories table, which is a complex and long-running operation. This will be supplemented by a manual migration guide for instances like Zenodo where a several-minute full DB lock is not feasible. The difference between whether to use the automated migration or the manual one will be clarified in the docs.

  • Added a repo-user m-to-m mapping table. By not storing repos in the Remote Accounts table, we need a different way of associating users with the repos they have access to. This table is synced using code that will be included in other PRs.

  • This PR contains only the data model changes themselves and not the associated functional changes needed to do anything useful.

  • This commit on its own is UNRELEASABLE. We will merge multiple commits related to the VCS upgrade into the vcs-staging branch and then merge them all into master once we have a fully release-ready prototype. At that point, we will create a squash commit.

@palkerecsenyi palkerecsenyi changed the title WIP: feat(vcs): new data model feat(vcs): new data model Sep 25, 2025
* Updated the data model to accommodate the new generic approach to VCS
integration. This involves renaming the `github_...` tables to
`vcs_...`, adding a new column to the relevant tables to identify which
provider the records relate to, and more.

* Added an Alembic migration, including moving the repository data from
`oauthclient_remoteaccount` to the `vcs_repositories` table, which is a
complex and long-running operation. This will be supplemented by a
manual migration guide for instances like Zenodo where a several-minute
full DB lock is not feasible. The difference between whether to use the
automated migration or the manual one will be clarified in the docs.

* Added a repo-user m-to-m mapping table. By not storing repos in the
Remote Accounts table, we need a different way of associating users with
the repos they have access to. This table is synced using code that will
be included in other PRs.

* This PR contains only the data model changes themselves and not the
associated functional changes needed to do anything useful.

* This commit on its own is UNRELEASABLE. We will merge multiple commits
related to the VCS upgrade into the `vcs-staging` branch and then merge
them all into `master` once we have a fully release-ready prototype. At
that point, we will create a squash commit.
Copy link
Member

@zzacharo zzacharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@palkerecsenyi I dont see something worrying but it is also difficult without having the tests and the functional usage of the model... We might need to revisit that across the next PRs. @slint any major thing you see?

"provider_id",
name="uq_vcs_repositories_provider_provider_id",
),
# Index("ix_vcs_repositories_provider_provider_id", "provider", "provider_id"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I commented this because I wasn't 100% sure about the indexes/uniques. I'm fairly certain I've arranged them correctly for the new models but I'm not super experienced with these so I'm not sure.

"""Which VCS provider the repository is hosted by (and therefore the context in which to consider the provider_id)"""

description = db.Column(db.String(10000), nullable=True)
html_url = db.Column(db.String(10000), nullable=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are exported from the extra_data column?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, previously the extra_data column was of the format:

{
  "last_sync":"2025-10-15T12:30:01.027133+00:00",
  "repos": {
    "123": {
      "id": "123",
      "full_name": "org/repo",
      "description": "An example repository",
      "default_branch": "main"
    }
  }
}

description was already stored so this is just simply moving that.

The html_url was calculated in the Jinja templates but given the diversity of how URLs can work between providers (and even within providers with different configs) I decided to make them be stored explicitly.


description = db.Column(db.String(10000), nullable=True)
html_url = db.Column(db.String(10000), nullable=False)
license_spdx = db.Column(db.String(255), nullable=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a totally new value or was it fetched also in the past?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously extracted from the repository payload stored in the webhook event during the RDM-level metadata extraction. It felt more natural to me to keep it in the model itself as it's an inherent property of the repo rather than being specific to a release event, and it didn't make sense to only have it available on the RDM level while other repo metadata (description, default branch, etc.) is available on the invenio-vcs level.

@palkerecsenyi
Copy link
Member Author

@palkerecsenyi I dont see something worrying but it is also difficult without having the tests and the functional usage of the model... We might need to revisit that across the next PRs. @slint any major thing you see?

Yes indeed it's quite an annoying way to review sadly. If it helps, you can see the non-fragmented diff of all the code on the master branch of my fork which is kept up-to-date with the fragmented PRs.

For example the models.py file: master...palkerecsenyi:invenio-vcs:master#diff-a232ee65b447a8d90fbac12501761c411764f3570061d1b18e3e8181668fcc39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make invenio-github support other VCS providers

2 participants