Skip to main content

GitHub

This page contains the setup guide and reference information for the GitHub source connector.

Prerequisites

  • List of GitHub Repositories (and access for them in case they are private)

For Airbyte Cloud:

For Airbyte Open Source:

Setup guide

Step 1: Set up GitHub

Create a GitHub Account.

Airbyte Open Source additional setup steps

Log into GitHub and then generate a personal access token. To load balance your API quota consumption across multiple API tokens, input multiple tokens separated with ,.

Step 2: Set up the GitHub connector in Airbyte

For Airbyte Cloud:

  1. Log into your Airbyte Cloud account.
  2. Click Sources and then click + New source.
  3. On the Set up the source page, select GitHub from the Source type dropdown.
  4. Enter a name for the GitHub connector.
  5. To authenticate:
  • For Airbyte Cloud: Authenticate your GitHub account to authorize your GitHub account. Airbyte will authenticate the GitHub account you are already logged in to. Please make sure you are logged into the right account.

  • For Airbyte Open Source: Authenticate with Personal Access Token. To generate a personal access token, log into GitHub and then generate a personal access token. Enter your GitHub personal access token. To load balance your API quota consumption across multiple API tokens, input multiple tokens separated with ,.

  1. GitHub Repositories - Enter a list of GitHub organizations/repositories, e.g. airbytehq/airbyte for single repository, airbytehq/airbyte airbytehq/another-repo for multiple repositories. If you want to specify the organization to receive data from all its repositories, then you should specify it according to the following example: airbytehq/*.
caution

Repositories with the wrong name or repositories that do not exist or have the wrong name format will be skipped with WARN message in the logs.

  1. Start date (Optional) - The date from which you'd like to replicate data for streams. For streams which support this configuration, only data generated on or after the start date will be replicated.
  • These streams will only sync records generated on or after the Start Date: comments, commit_comment_reactions, commit_comments, commits, deployments, events, issue_comment_reactions, issue_events, issue_milestones, issue_reactions, issues, project_cards, project_columns, projects, pull_request_comment_reactions, pull_requests, pull_requeststats, releases, review_comments, reviews, stargazers, workflow_runs, workflows.

  • The Start Date does not apply to the streams below and all data will be synced for these streams: assignees, branches, collaborators, issue_labels, organizations, pull_request_commits, pull_request_stats, repositories, tags, teams, users

  1. Branch (Optional) - List of GitHub repository branches to pull commits from, e.g. airbytehq/airbyte/master. If no branches are specified for a repository, the default branch will be pulled. (e.g. airbytehq/airbyte/master airbytehq/airbyte/my-branch).

For Airbyte Open Source:

  1. Navigate to the Airbyte Open Source dashboard. Click Sources and then click + New source.
  2. On the Set up the source page, select GitHub from the Source type dropdown.
  3. Enter a name for the GitHub connector.

Supported sync modes

The GitHub source connector supports the following sync modes:

Supported Streams

This connector outputs the following full refresh streams:

This connector outputs the following incremental streams:

Entity-Relationship Diagram (ERD)

Notes

  1. Only 4 streams (comments, commits, issues and review comments) from the listed above streams are pure incremental meaning that they:

    • read only new records;
    • output only new records.
  2. Streams workflow_runs and worflow_jobs is almost pure incremental:

    • read new records and some portion of old records (in past 30 days) docs;
    • the workflow_jobs depends on the workflow_runs to read the data, so they both follow the same logic docs;
    • output only new records.
  3. Other 19 incremental streams are also incremental but with one difference, they:

    • read all records;
    • output only new records. Please, consider this behaviour when using those 19 incremental streams because it may affect you API call limits.
  4. Sometimes for large streams specifying very distant start_date in the past may result in keep on getting error from GitHub instead of records (respective WARN log message will be outputted). In this case Specifying more recent start_date may help. The "Start date" configuration option does not apply to the streams below, because the GitHub API does not include dates which can be used for filtering:

  • assignees
  • branches
  • collaborators
  • issue_labels
  • organizations
  • pull_request_commits
  • pull_request_stats
  • repositories
  • tags
  • teams
  • users

Limitations & Troubleshooting

Expand to see details about GitHub connector limitations and troubleshooting.

Connector limitations

Rate limiting

You can use a personal access token to make API requests. Additionally, you can authorize a GitHub App or OAuth app, which can then make API requests on your behalf. All of these requests count towards your personal rate limit of 5,000 requests per hour (15,000 requests per hour if the app is owned by a GitHub Enterprise Cloud organization ).

REST API and GraphQL API rate limits are counted separately
tip

In the event that limits are reached before all streams have been read, it is recommended to take the following actions:

  1. Utilize Incremental sync mode.
  2. Set a higher sync interval.
  3. Divide the sync into separate connections with a smaller number of streams.

Refer to GitHub article Rate limits for the REST API.

Permissions and scopes

If you use OAuth authentication method, the OAuth2.0 application requests the next list of scopes: repo, read:org, read:repo_hook, read:user, read:discussion, read:project, workflow. For personal access token you need to manually select needed scopes.

Your token should have at least the repo scope. Depending on which streams you want to sync, the user generating the token needs more permissions:

  • For syncing Collaborators, the user which generates the personal access token must be a collaborator. To become a collaborator, they must be invited by an owner. If there are no collaborators, no records will be synced. Read more about access permissions here.
  • Syncing Teams is only available to authenticated members of a team's organization. Personal user accounts and repositories belonging to them don't have access to Teams features. In this case no records will be synced.
  • To sync the Projects stream, the repository must have the Projects feature enabled.

Troubleshooting

  • Check out common troubleshooting issues for the GitHub source connector on our Airbyte Forum

Reference

Config fields reference

Field
Type
Property name
object
credentials
array<string>
repositories
string
repository
string
start_date
string
api_url
string
branch
array<string>
branches
integer
max_waiting_time

Changelog

Expand to review
VersionDatePull RequestSubject
1.8.112024-09-2145742Update dependencies
1.8.102024-09-1445557Update dependencies
1.8.92024-09-0745320Update dependencies
1.8.82024-08-2344592Fix state handling for stream WorkflowRuns
1.8.72024-08-3145061Update dependencies
1.8.62024-08-2444703Update dependencies
1.8.52024-08-1744227Update dependencies
1.8.42024-08-1243749Update dependencies
1.8.32024-08-1042671Update dependencies
1.8.22024-08-2042966Bump cdk version and enable RFR for all non-incremental streams
1.8.12024-07-2042342Update dependencies
1.8.02024-07-1641677Update to 3.4.0 CDK
1.7.132024-07-1341746Update dependencies
1.7.122024-07-1041354Update dependencies
1.7.112024-07-0941221Update dependencies
1.7.102024-07-0641000Update dependencies
1.7.92024-06-2540289Update dependencies
1.7.82024-06-2240128Update dependencies
1.7.72024-06-1739513Update deprecated state handling method
1.7.62024-06-0439078[autopull] Upgrade base image to v1.2.1
1.7.52024-05-2938341Add max_waiting_time to configuration
1.7.42024-05-2138341Update CDK authenticator package
1.7.32024-05-2038299Fixed spec typo
1.7.22024-04-1936636Updating to 0.80.0 CDK
1.7.12024-04-1236636schema descriptions
1.7.02024-03-1936267Pin airbyte-cdk version to ^0
1.6.52024-03-1235986Handle rate limit exception as config error
1.6.42024-03-0835915Fix per stream error handler; Make use the latest CDK version
1.6.32024-02-1535271Update branches schema
1.6.22024-02-1234933Update Airbyte CDK for integration tests
1.6.12024-02-0935087Manage dependencies with Poetry.
1.6.02024-02-0234700Continue Sync on Stream failure
1.5.72024-01-2934598Fix MultipleToken sleep time
1.5.62024-01-2634503Fix MultipleToken rotation logic
1.5.52023-12-2633783Fix retry for 504 error in GraphQL based streams
1.5.42023-11-2032679Return AirbyteMessage if max retry exeeded for 202 status code
1.5.32023-10-2331702Base image migration: remove Dockerfile and use the python-connector-base image
1.5.22023-10-1331386Handle ContributorActivity continuous ACCEPTED response
1.5.12023-10-1231307Increase backoff_time for stream ContributorActivity
1.5.02023-10-1131300Update Schemas: Add date-time format to fields
1.4.62023-10-0431056Migrate spec properties' repository and branch type to <array>
1.4.52023-10-0231023Increase backoff for stream Contributor Activity
1.4.42023-10-0230971Mark start_date as optional.
1.4.32023-10-0230979Fetch archived records in Project Cards
1.4.22023-09-3030927Provide actionable user error messages
1.4.12023-09-3030839Update CDK to Latest version
1.4.02023-09-2930823Add new stream issue Timeline Events
1.3.12023-09-2830824Handle empty response in stream ContributorActivity
1.3.02023-09-2530731Add new stream ProjectsV2
1.2.12023-09-2230693Handle 404 error in TeamMemberShips
1.2.02023-09-2230647Add support for self-hosted GitHub instances
1.1.12023-09-2130654Rewrite source connection error messages
1.1.02023-08-0330615Add new stream Contributor Activity
1.0.42023-08-0329031Reverted advancedAuth spec changes
1.0.32023-08-0128910Updated advancedAuth broken references
1.0.22023-07-1128144Add archived_at property to Organizations schema parameter
1.0.12023-05-2225838Deprecate "page size" input parameter
1.0.02023-05-1925778Improve repo(s) name validation on UI
0.5.02023-05-1625793Implement client-side throttling of requests
0.4.112023-05-1226025Added more transparent depiction of the personal access token expired
0.4.102023-05-1526075Add more specific error message description for no repos case.
0.4.92023-05-0124523Add undeclared columns to spec
0.4.82023-04-1900000Fix repo name validation
0.4.72023-03-2424457Add validation and transformation for repositories config
0.4.62023-03-2424398Fix caching for get_starting_point in stream "Commits"
0.4.52023-03-2324417Add pattern_descriptors to fields with an expected format
0.4.42023-03-1724255Add field groups and titles to improve display of connector setup form
0.4.32023-03-0422993Specified date formatting in specification
0.4.22023-03-0323467Added user friendly messages, added AirbyteTracedException config_error, updated SAT
0.4.12023-01-2722039Set AvailabilityStrategy for streams explicitly to None
0.4.02023-01-2021457Use GraphQL for issue_reactions stream
0.3.122023-01-1821481Handle 502 Bad Gateway error with proper log message
0.3.112023-01-0621084Raise Error if no organizations or repos are available during read
0.3.102022-12-1520523Revert changes from 0.3.9
0.3.92022-12-1419978Update CDK dependency; move custom HTTPError handling into AvailabilityStrategy classes
0.3.82022-11-1019299Fix events and workflow_runs datetimes
0.3.72022-10-2018213Skip retry on HTTP 200
0.3.62022-10-1117852Use default behaviour, retry on 429 and all 5XX errors
0.3.52022-10-0717715Improve 502 handling for comments stream
0.3.42022-10-0417555Skip repository if got HTTP 500 for WorkflowRuns stream
0.3.32022-09-2817287Fix problem with "null" cursor_field for WorkflowJobs stream
0.3.22022-09-2817304Migrate to per-stream state.
0.3.12022-09-2116947Improve error logging when handling HTTP 500 error
0.3.02022-09-0916534Add new stream WorkflowJobs
0.2.462022-08-1715730Validate input organizations and repositories
0.2.452022-08-1115420"User" object can be "null"
0.2.442022-08-0114795Use GraphQL for pull_request_comment_reactions stream
0.2.432022-07-2615049Bugfix schemas for streams deployments, workflow_runs, teams
0.2.422022-07-1214613Improve schema for stream pull_request_commits added "null"
0.2.412022-07-0314376Add Retry for GraphQL API Resource limitations
0.2.402022-07-0114338Revert: "Rename field mergeable to is_mergeable"
0.2.392022-06-3014274Rename field mergeable to is_mergeable
0.2.382022-06-2713989Use GraphQL for reviews stream
0.2.372022-06-2113955Fix "secondary rate limit" not retrying
0.2.362022-06-2013926Break point added for workflows_runs stream
0.2.352022-06-1613763Use GraphQL for pull_request_stats stream
0.2.342022-06-1413707Fix API sorting, fix get_starting_point caching
0.2.332022-06-0813558Enable caching only for parent streams
0.2.322022-06-0713531Fix different result from get_starting_point when reading by pages
0.2.312022-05-2413115Add incremental support for streams WorkflowRuns
0.2.302022-05-0912294Add incremental support for streams CommitCommentReactions, IssueCommentReactions, IssueReactions, PullRequestCommentReactions, Repositories, Workflows
0.2.292022-05-0412482Update input configuration copy
0.2.282022-04-2111893Add new streams TeamMembers, TeamMemberships
0.2.272022-04-0211678Fix "PAT Credentials" in spec
0.2.262022-03-3111623Re-factored incremental sync for Reviews stream
0.2.252022-03-3111567Improve code for better error handling
0.2.242022-03-309251Add Streams Workflow and WorkflowRuns
0.2.232022-03-1711212Improve documentation and spec for Beta
0.2.222022-03-1010878Fix error handling for unavailable streams with 404 status code
0.2.212022-03-0410749Add new stream ProjectCards
0.2.202022-02-1610385Add new stream Deployments, ProjectColumns, PullRequestCommits
0.2.192022-02-0710211Add human-readable error in case of incorrect organization or repo name
0.2.182021-02-0910193Add handling secondary rate limits
0.2.172021-02-029999Remove BAD_GATEWAY code from backoff_time
0.2.162021-02-029868Add log message for streams that are restricted for OAuth. Update oauth scopes.
0.2.152021-01-269802Add missing fields for auto_merge in pull request stream
0.2.142021-01-219664Add custom pagination size for large streams
0.2.132021-01-209619Fix logging for function should_retry
0.2.112021-01-179492Remove optional parameter Accept for reaction`s streams to fix error with 502 HTTP status code in response
0.2.102021-01-037250Use CDK caching and convert PR-related streams to incremental
0.2.92021-12-299179Use default retry delays on server error responses
0.2.82021-12-078524Update connector fields title/description
0.2.72021-12-068518Add connection retry with GitHub
0.2.62021-11-248030Support start date property for PullRequestStats and Reviews streams
0.2.52021-11-218170Fix slow check connection for organizations with a lot of repos
0.2.42021-11-117856Resolve $ref fields in some stream schemas
0.2.32021-10-066833Fix config backward compatability
0.2.22021-10-056761Add oauth worflow specification
0.2.12021-09-226223Add option to pull commits from user-specified branches
0.2.02021-09-195898 and 6227Don't minimize any output fields & add better error handling
0.1.112021-09-155949Add caching for all streams
0.1.102021-09-095860Add reaction streams
0.1.92021-09-025788Handling empty repository, check method using RepositoryStats stream
0.1.82021-09-015757Add more streams
0.1.72021-08-275696Handle negative backoff values
0.1.62021-08-185456Add MultipleTokenAuthenticator
0.1.52021-08-185456Fix set up validation
0.1.42021-08-135136Support syncing multiple repositories/organizations
0.1.32021-08-035156Extended existing schemas with users property for certain streams
0.1.22021-07-134708Fix bug with IssueEvents stream and add handling for rate limiting
0.1.12021-07-074590Fix schema in the pull_request stream
0.1.02021-07-064174New Source: GitHub