WIP: POC of a new Catalog Github Module

Question

WIP: POC of a new Catalog Github Module

taras opened this issue 2 years ago · comments

Motivation

The current @backstage/plugin-catalog-backend-module-github is a mix of processors that evolved gradually because existing processors didn't satisfy all of the use cases. The result is a mishmash of functionality. It takes a non-trivial effort to figure out what each processor does and its limitations. As a result, each organization integrating with Github creates its version of GitHub processors. Instead, we want to have a consistent, predictable, and flexible plugin.

In this issue, I will define requirements for a POC for a new Github Plugin. We will use this POC to create an RFC in Backstage to introduce a more robust Github integration for Backstage.

Detailed Design

The new plugin will use architecture principles and a new naming convention.

Architecture Principles

A location and its URL is a root of a processing pipeline

Backstage catalog's ingestion pipeline aggregates and relates information from external systems. Backstage is responsible for processing data from a growing number of external integrations. As the number of integrations grows, so does the latency in the ingestion pipeline. An efficient ingestion pipeline aims to keep data up to date with as little latency as possible. To keep the processing latency down, the developers writing processors must design their processors to allow Backstage to optimize the processing. Backstage can optimize processing with caching and parallelization. Caching in Backstage processors is scoped to a location. Likewise, paralyzation is performed by concurrently processing locations. To reduce latency in the ingestion pipeline, developers must ensure that their processors can cache and paralyze processing based on a location. One sure way to increase the performance of your ingestion pipeline is by designing your ingestion to utilize locations.

Consider the following use case: we want to ingest all of the repositories of a Github Organization and show who's contributing to these repositories. We could write a processor that fetched a list of all repositories for the organization, iterated over returned repositories, and fetched all contributors for each repository. We would then emit each repository, relationship between repository and users, followed by inverse relationships to mark what repositories a user is contributing to.

This is a lot of work that needs to happen in a single processing job. If we encounter an error, the entire job can fail. If we handle the error gracefully, the entire job will get delayed. To improve the performance and resilience of this job, we can break it up into multiple smaller jobs by emitting a location for each repository.

The result is new locations in the catalog that can be paralyzed by the processing engine and processing of each location can be cached.

Naming Conventions

Discovery processors emit locations

Locations being such an important part of an efficient processing pipeline, it's important that we highlight where locations are created. Having a dedicated processor for emitting locations makes that very clear. The convention that I'm proposing is to designate the Discovery prefix to mean processors that emit locations. For example, GithubOrganizationDiscoveryProcessor would emit Github Organization locations. Likewise, GithubRepositoryDiscoveryProcessor would emit repositories that are owned by the organization or user.

Relevant Links

Discussion on naming convention in Backstage Discussions

Taras Mankovski · Answer 1 · Fri Apr 01 2022 01:42:11 GMT+0800 (China Standard Time)

Progressively opting into GitHub processing,

Start by adding your Github instance to your app-config.yaml
```
locations:
  - type: github-organization-discovery
    location: https://github.com
```
You added location, but nothing ingested because you didn't add the process to the pipeline
catalog.ts and add GithubOrganizationDiscoveryProcessor to your pipeline
GithubOrganizationDiscoveryProcessor matches on type: github-organization-discovery and uses location to retrieve all organizations and emit location for each organization type: github-organization, location: https://github.com/{organization_name}
Organization Discovery Processor is emitting locations for organizations, but entities for these locations are not being emitted because entity processor is not added to the pipeline.
To include each organization from Github in Backstage's catalog, add GithubOrganizationLocationProcessor to your pipeline.
For each Organization url, GithubOrganizationLocationProcessor will be called and it will match on type: github-organization and take the url to emit an entity with kind GithubOrganization. Now you have an OrganizationEntity for each organization being emitted.

Taras Mankovski · Answer 2 · Fri Apr 01 2022 02:07:21 GMT+0800 (China Standard Time)

If you don't need discovery, then you would use this,

locations:
  - type: github-organization
    location: https://github.com/thefrontside
  - type: github-organization
    location: http://github.com/microstates

Min Kim · Answer 3 · Wed Apr 13 2022 21:00:30 GMT+0800 (China Standard Time)

Processor & Provider

It seems the ingestion pipeline isn't being replaced by entity providers, they're trying to recommend that people stop using discovery processors as providers and they should actually start utilizing entity providers. As shown in the docs already, entity providers are not a new thing - they're just not being used, right?
An entity provider should be doing two things:
1. Update the database per webhook event
  - This can be done in the catalog plugin:
```
// configure webhook on github.com
// use URL backstage.frontside.services/api/catalog/github/webhook
router.post("/github/webhook", async (req, res) => {
  if (req.secret == webhook_secret) {
    // forward request to smee.io for local development
    if (req.body == "issue_created") {
      await applyMutation(issue);
    }
    if (req.body == "user_added_org") {
      await applyMutation(org);
    }
    res.send(200);
  } else {
    res.send(403);
  }
})
```
    - Webhooks can be configured on organizations, repositories, and github apps
      - Webhooks configured on organization/app level apply to all of its child repositories so developers do not need to update each repo's webhook configurations individually for it to work with the provider
2. At the time of connection, it should do a full crawl of organizations and update the database
  - It should continue to do the full crawl on a scheduler but much less frequently (like once a day) in case it misses any webhook events

Questions

~~Should the provider log out webhook settings of the supplied integration? Or, if possible, should we make specific webhook settings a requirement?~~ Figure out later.
In big organizations, we're going to have a large number of webhook events triggering the provider. How can we put them into a queue to avoid conflicts?
Where do we draw the line between providers and processors? Should the provider be the "gateway" to the internet? And processors process entities emitted from the database and other processors? Yes

Like this:

Whereas at the moment, it's like this:
When a processor emits an entity, does it update in the database? Or does only the provider, through its mutation function, write to the database? They all get put into the database but different tables

TODO

Providers

Create GithubMultiOrgEntityProvider
- Takes github integration from app-config (PAT or app)
- Current Github integration needs to be modified so that it doesn't require an organization in the url

Create GithubOrgEntityProvider

GithubOrgEntityProvider({ orgUrl: "https://github.com/thefrontside" })

Expects org to be provided

Processors

GithubOrganizationProcessor
- emits: orgs, teams, users, repos
GithubRepositoryProcessor
- emits: issues, commits