Skip to content

Commit

Permalink
Singer/Meltano: Add example github-to-cratedb
Browse files Browse the repository at this point in the history
It uses the `meltano-target-cratedb` Singer component.
https://github.com/crate-workbench/meltano-target-cratedb
  • Loading branch information
amotl committed Dec 8, 2023
1 parent c9a59ec commit ed998d6
Show file tree
Hide file tree
Showing 12 changed files with 503 additions and 1 deletion.
72 changes: 72 additions & 0 deletions .github/workflows/test-singer-meltano.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
name: Python SQLAlchemy

on:
pull_request:
branches: ~
paths:
- '.github/workflows/test-singer-meltano.yml'
- 'framework/singer-meltano/**'
- 'requirements.txt'
push:
branches: [ main ]
paths:
- '.github/workflows/test-singer-meltano.yml'
- 'framework/singer-meltano/**'
- 'requirements.txt'

# Allow job to be triggered manually.
workflow_dispatch:

# Run job each night after CrateDB nightly has been published.
schedule:
- cron: '0 3 * * *'

# Cancel in-progress jobs when pushing to the same branch.
concurrency:
cancel-in-progress: true
group: ${{ github.workflow }}-${{ github.ref }}

jobs:
test:
name: "
Python: ${{ matrix.python-version }}
CrateDB: ${{ matrix.cratedb-version }}
on ${{ matrix.os }}"
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ 'ubuntu-latest' ]
python-version: [ '3.10', '3.11' ]
cratedb-version: [ 'nightly' ]

services:
cratedb:
image: crate/crate:nightly
ports:
- 4200:4200
- 5432:5432

steps:

- name: Acquire sources
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
architecture: x64
cache: 'pip'
cache-dependency-path: |
requirements.txt
framework/singer-meltano/requirements.txt
framework/singer-meltano/requirements-dev.txt
- name: Install utilities
run: |
pip install -r requirements.txt
- name: Validate framework/singer-meltano
run: |
ngr test --accept-no-venv framework/singer-meltano
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
.DS_Store
.idea
.env
.venv*
__pycache__
.coverage
coverage.xml
mlruns/
archive/
logs.log
2 changes: 2 additions & 0 deletions framework/singer-meltano/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.meltano
output
45 changes: 45 additions & 0 deletions framework/singer-meltano/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Meltano Examples

Concise examples about working with [CrateDB] and [Meltano], for conceiving and
running flexible ELT tasks. All the recipes are using [meltano-target-cratedb]
for reading and writing data from/to CrateDB.

## What's inside

- `singerfile-to-cratedb`: Acquire data from Singer File, and load it into
CrateDB database table.

- `github-to-cratedb`: Acquire repository metadata from GitHub API, and load
it separated per entity into 32 CrateDB database tables.

## Prerequisites

Before running an examples within the subdirectories, make sure to install
Meltano and its dependencies.

```shell
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

## Usage

Then, explore the individual Meltano projects, either invoke them from within
their directories, or by using the `--cwd` option from the root folder.

```shell
meltano --cwd github-to-cratedb install
meltano --cwd github-to-cratedb run tap-github target-cratedb
```

## Software Tests
```shell
pip install -r requirements-dev.txt
poe check
```


[CrateDB]: https://cratedb.com/product
[Meltano]: https://meltano.com/
[meltano-target-cratedb]: https://github.com/crate-workbench/meltano-target-cratedb
82 changes: 82 additions & 0 deletions framework/singer-meltano/github-to-cratedb/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Meltano GitHub -> CrateDB example

## About

Acquire repository metadata from GitHub API, and insert into CrateDB database
tables, using [meltano-target-cratedb].

It follows the canonical example demonstrated at the [Meltano Getting Started Tutorial].

## Configuration

### tap-github

For accessing the GitHub API, you will need an authentication token. It
can be acquired at [GitHub Developer Settings » Tokens].

To configure the recipe, please store it into the `TAP_GITHUB_AUTH_TOKEN`
environment variable, either interactively, or by creating a dotenv
configuration file `.env`.

```shell
TAP_GITHUB_AUTH_TOKEN='ghp_hmQR3XTFWkfIcuyjRTBuVrRt6mnL1j2mMPT8'
```

Then, in `meltano.yml`, identify the `tap-github` section in `plugins.extractors`,
and adjust the value of `config.repositories` to correspond to the repository
you intend to scrape.

### target-cratedb

Within `loaders` section `target-cratedb`, adjust `config.sqlalchemy_url` to
match your database connectivity settings.


## Usage

Install dependencies.
```shell
meltano install
```

Invoke data transfer to JSONL files.
```shell
meltano run tap-github target-jsonl
cat github-to-cratedb/output/commits.jsonl
```

Invoke data transfer to CrateDB database.
```shell
meltano run tap-github target-cratedb
```

## Screenshot

Enjoy the release notes.
```sql
SELECT repo, tag_name, body FROM melty.releases ORDER BY tag_name DESC;
```

![image](https://github.com/crate-workbench/cratedb-toolkit/assets/453543/ac37c9cc-8e42-4c7c-84aa-64498bf48f4d)

## Troubleshooting

If you see such errors on stdout, please verify your GitHub authentication
token stored within the `TAP_GITHUB_AUTH_TOKEN` environment variable.
```python
singer_sdk.exceptions.RetriableAPIError: 401 Client Error: b'{"message":"This endpoint requires you to be authenticated.","documentation_url":"https://docs.github.com/graphql/guides/forming-calls-with-graphql#authenticating-with-graphql"}' (Reason: Unauthorized) for path: /graphql cmd_type=elb consumer=False name=tap-github producer=True stdio=stderr string_id=tap-github
```

## Development
In order to link the sandbox to a development installation of [meltano-target-cratedb],
configure the `pip_url` of the component like this:
```yaml
pip_url: --editable=/path/to/sources/meltano-target-cratedb
```


[GitHub Developer Settings » Tokens]: https://github.com/settings/tokens
[Meltano Getting Started Tutorial]: https://docs.meltano.com/getting-started/part1
[meltano-target-cratedb]: https://github.com/crate-workbench/meltano-target-cratedb
[tap-github]: https://hub.meltano.com/extractors/tap-github/
[target-jsonl]: https://hub.meltano.com/loaders/target-jsonl/
51 changes: 51 additions & 0 deletions framework/singer-meltano/github-to-cratedb/meltano.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# A Meltano project is just a directory on your filesystem containing text-based files.
# At a minimum, a Meltano project must contain a project file named `meltano.yml`,
# which contains your project configuration, and tells Meltano that a particular
# directory is a Meltano project.
---
version: 1
default_environment: dev
send_anonymous_usage_stats: false
project_id: f14797b9-9d1c-414c-851c-c91e08ddbc2e

environments:
- name: dev
- name: staging
- name: prod

plugins:

# Configure data source.
# In Singer jargon, it is an "extractor", wrapped into a "tap".
extractors:

- name: tap-github
variant: cratedb
namespace: cratedb
pip_url: git+https://github.com/crate-workbench/tap-github.git@cratedb
# Note: Configure your GitHub repository here.
config:
start_date: '2023-12-01'
repositories:
- crate-workbench/cratedb-toolkit

# Configure data sinks.
# In Singer jargon, it is a "loader", wrapped into a "target".
loaders:

- name: target-jsonl
variant: andyh1203
pip_url: target-jsonl

- name: target-cratedb
namespace: cratedb
variant: cratedb
# Acquire from PyPI.
pip_url: meltano-target-cratedb
# Acquire from GitHub.
# pip_url: git+https://github.com/crate-workbench/meltano-target-cratedb.git

# Note: Configure your database server and credentials here.
config:
sqlalchemy_url: crate://crate@localhost/
add_record_metadata: true
Loading

0 comments on commit ed998d6

Please sign in to comment.