-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: POC of a new Catalog Github Module #4
Comments
Progressively opting into GitHub processing,
|
If you don't need discovery, then you would use this, locations:
- type: github-organization
location: https://github.com/thefrontside
- type: github-organization
location: http://github.com/microstates |
Processor & Provider
Questions
TODOProviders
Processors
|
Motivation
The current @backstage/plugin-catalog-backend-module-github is a mix of processors that evolved gradually because existing processors didn't satisfy all of the use cases. The result is a mishmash of functionality. It takes a non-trivial effort to figure out what each processor does and its limitations. As a result, each organization integrating with Github creates its version of GitHub processors. Instead, we want to have a consistent, predictable, and flexible plugin.
In this issue, I will define requirements for a POC for a new Github Plugin. We will use this POC to create an RFC in Backstage to introduce a more robust Github integration for Backstage.
Detailed Design
The new plugin will use architecture principles and a new naming convention.
Architecture Principles
A location and its URL is a root of a processing pipeline
Backstage catalog's ingestion pipeline aggregates and relates information from external systems. Backstage is responsible for processing data from a growing number of external integrations. As the number of integrations grows, so does the latency in the ingestion pipeline. An efficient ingestion pipeline aims to keep data up to date with as little latency as possible. To keep the processing latency down, the developers writing processors must design their processors to allow Backstage to optimize the processing. Backstage can optimize processing with caching and parallelization. Caching in Backstage processors is scoped to a location. Likewise, paralyzation is performed by concurrently processing locations. To reduce latency in the ingestion pipeline, developers must ensure that their processors can cache and paralyze processing based on a location. One sure way to increase the performance of your ingestion pipeline is by designing your ingestion to utilize locations.
Consider the following use case: we want to ingest all of the repositories of a Github Organization and show who's contributing to these repositories. We could write a processor that fetched a list of all repositories for the organization, iterated over returned repositories, and fetched all contributors for each repository. We would then emit each repository, relationship between repository and users, followed by inverse relationships to mark what repositories a user is contributing to.
This is a lot of work that needs to happen in a single processing job. If we encounter an error, the entire job can fail. If we handle the error gracefully, the entire job will get delayed. To improve the performance and resilience of this job, we can break it up into multiple smaller jobs by emitting a location for each repository.
The result is new locations in the catalog that can be paralyzed by the processing engine and processing of each location can be cached.
Naming Conventions
Discovery processors emit locations
Locations being such an important part of an efficient processing pipeline, it's important that we highlight where locations are created. Having a dedicated processor for emitting locations makes that very clear. The convention that I'm proposing is to designate the Discovery prefix to mean processors that emit locations. For example,
GithubOrganizationDiscoveryProcessor
would emit Github Organization locations. Likewise,GithubRepositoryDiscoveryProcessor
would emit repositories that are owned by the organization or user.Relevant Links
The text was updated successfully, but these errors were encountered: