Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: POC of a new Catalog Github Module #4

Open
taras opened this issue Mar 23, 2022 · 3 comments
Open

WIP: POC of a new Catalog Github Module #4

taras opened this issue Mar 23, 2022 · 3 comments

Comments

@taras
Copy link
Member

taras commented Mar 23, 2022

Motivation

The current @backstage/plugin-catalog-backend-module-github is a mix of processors that evolved gradually because existing processors didn't satisfy all of the use cases. The result is a mishmash of functionality. It takes a non-trivial effort to figure out what each processor does and its limitations. As a result, each organization integrating with Github creates its version of GitHub processors. Instead, we want to have a consistent, predictable, and flexible plugin.

In this issue, I will define requirements for a POC for a new Github Plugin. We will use this POC to create an RFC in Backstage to introduce a more robust Github integration for Backstage.

Detailed Design

The new plugin will use architecture principles and a new naming convention.

Architecture Principles

A location and its URL is a root of a processing pipeline

Backstage catalog's ingestion pipeline aggregates and relates information from external systems. Backstage is responsible for processing data from a growing number of external integrations. As the number of integrations grows, so does the latency in the ingestion pipeline. An efficient ingestion pipeline aims to keep data up to date with as little latency as possible. To keep the processing latency down, the developers writing processors must design their processors to allow Backstage to optimize the processing. Backstage can optimize processing with caching and parallelization. Caching in Backstage processors is scoped to a location. Likewise, paralyzation is performed by concurrently processing locations. To reduce latency in the ingestion pipeline, developers must ensure that their processors can cache and paralyze processing based on a location. One sure way to increase the performance of your ingestion pipeline is by designing your ingestion to utilize locations.

Consider the following use case: we want to ingest all of the repositories of a Github Organization and show who's contributing to these repositories. We could write a processor that fetched a list of all repositories for the organization, iterated over returned repositories, and fetched all contributors for each repository. We would then emit each repository, relationship between repository and users, followed by inverse relationships to mark what repositories a user is contributing to.

single processing job

This is a lot of work that needs to happen in a single processing job. If we encounter an error, the entire job can fail. If we handle the error gracefully, the entire job will get delayed. To improve the performance and resilience of this job, we can break it up into multiple smaller jobs by emitting a location for each repository.

many processing jobs

The result is new locations in the catalog that can be paralyzed by the processing engine and processing of each location can be cached.

Naming Conventions

Discovery processors emit locations

Locations being such an important part of an efficient processing pipeline, it's important that we highlight where locations are created. Having a dedicated processor for emitting locations makes that very clear. The convention that I'm proposing is to designate the Discovery prefix to mean processors that emit locations. For example, GithubOrganizationDiscoveryProcessor would emit Github Organization locations. Likewise, GithubRepositoryDiscoveryProcessor would emit repositories that are owned by the organization or user.

Relevant Links

@taras
Copy link
Member Author

taras commented Mar 31, 2022

Progressively opting into GitHub processing,

  1. Start by adding your Github instance to your app-config.yaml
    locations:
      - type: github-organization-discovery
        location: https://github.com
    You added location, but nothing ingested because you didn't add the process to the pipeline
  2. catalog.ts and add GithubOrganizationDiscoveryProcessor to your pipeline
  3. GithubOrganizationDiscoveryProcessor matches on type: github-organization-discovery and uses location to retrieve all organizations and emit location for each organization type: github-organization, location: https://github.com/{organization_name}
    Organization Discovery Processor is emitting locations for organizations, but entities for these locations are not being emitted because entity processor is not added to the pipeline.
  4. To include each organization from Github in Backstage's catalog, add GithubOrganizationLocationProcessor to your pipeline.
  5. For each Organization url, GithubOrganizationLocationProcessor will be called and it will match on type: github-organization and take the url to emit an entity with kind GithubOrganization. Now you have an OrganizationEntity for each organization being emitted.

@taras
Copy link
Member Author

taras commented Mar 31, 2022

If you don't need discovery, then you would use this,

locations:
  - type: github-organization
    location: https://github.com/thefrontside
  - type: github-organization
    location: http://github.com/microstates

@minkimcello
Copy link
Contributor

minkimcello commented Apr 13, 2022

Processor & Provider

  • It seems the ingestion pipeline isn't being replaced by entity providers, they're trying to recommend that people stop using discovery processors as providers and they should actually start utilizing entity providers. As shown in the docs already, entity providers are not a new thing - they're just not being used, right?

  • An entity provider should be doing two things:

    1. Update the database per webhook event
      • This can be done in the catalog plugin:
        // configure webhook on github.com
        // use URL backstage.frontside.services/api/catalog/github/webhook
        router.post("/github/webhook", async (req, res) => {
          if (req.secret == webhook_secret) {
            // forward request to smee.io for local development
            if (req.body == "issue_created") {
              await applyMutation(issue);
            }
            if (req.body == "user_added_org") {
              await applyMutation(org);
            }
            res.send(200);
          } else {
            res.send(403);
          }
        })
        • Webhooks can be configured on organizations, repositories, and github apps
          • Webhooks configured on organization/app level apply to all of its child repositories so developers do not need to update each repo's webhook configurations individually for it to work with the provider
    2. At the time of connection, it should do a full crawl of organizations and update the database
      • It should continue to do the full crawl on a scheduler but much less frequently (like once a day) in case it misses any webhook events

Questions

  • Should the provider log out webhook settings of the supplied integration? Or, if possible, should we make specific webhook settings a requirement? Figure out later.

  • In big organizations, we're going to have a large number of webhook events triggering the provider. How can we put them into a queue to avoid conflicts?

  • Where do we draw the line between providers and processors? Should the provider be the "gateway" to the internet? And processors process entities emitted from the database and other processors? Yes

    Like this:
    Untitled Diagram
    Whereas at the moment, it's like this:
    Screen Shot 2022-04-13 at 8 58 52 AM

  • When a processor emits an entity, does it update in the database? Or does only the provider, through its mutation function, write to the database? They all get put into the database but different tables

TODO

Providers

  • Create GithubMultiOrgEntityProvider

    • Takes github integration from app-config (PAT or app)
    • Current Github integration needs to be modified so that it doesn't require an organization in the url
  • Create GithubOrgEntityProvider

    GithubOrgEntityProvider({ orgUrl: "https://github.com/thefrontside" })
    • Expects org to be provided

Processors

  • GithubOrganizationProcessor

    • emits: orgs, teams, users, repos
  • GithubRepositoryProcessor

    • emits: issues, commits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants