Skip to content

Commit

Permalink
Release 0.1.2 (!59)
Browse files Browse the repository at this point in the history
  • Loading branch information
AJ Steers committed Apr 12, 2021
2 parents 6b3f940 + 1ebf474 commit 22b7536
Show file tree
Hide file tree
Showing 19 changed files with 579 additions and 38 deletions.
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,22 @@ The next few lines form the template for unreleased changes.
### Fixed
-->

## v0.1.2

Fixes bug in state handling, adds improvevements to documentation.

### Documentation

- Streamlined Dev Guide (!56)
- Added Code Samples page, including dynamic schema discovery examples (#33, !56)
- Added links to external sdk-based taps (#32, !56)
- Added static/dynamic property documentation (#86, !56)
- Added "implementation" docs for debugging and troubleshooting (#71, !41)

### Fixed

- Fixes bug in `Stream.get_starting_timestamp()` using incorrect state key (#94, !58)

## v0.1.1

Documentation and cookiecutter template improvements.
Expand Down
6 changes: 3 additions & 3 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@
- The project-wide max line length is `89`.
- In the future we will add support for linting
[pre-commit hooks](https://gitlab.com/meltano/singer-sdk/-/issues/12) as well.
4. Set intepreter to match poetry's virtualenv:
4. Set interpreter to match poetry's virtualenv:
- Run `poetry install` from the project root.
- Run `poetry shell` and copy the path from command output.
- In VS Code, run `Python: Select intepreter` and paste the intepreter path when prompted.
- In VS Code, run `Python: Select interpreter` and paste the interpreter path when prompted.

## Workspace Develoment Strategies for Singer SDK
## Workspace Development Strategies for Singer SDK

### Universal Code Formatting

Expand Down
2 changes: 1 addition & 1 deletion docs/cli_commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Enabling CLI Execution

Poetry allows you to test command line invokation direction in the virtualenv using the
Poetry allows you to test command line invocation direction in the virtualenv using the
prefix `poetry run`.

- Note: CLI mapping is performed in `pyproject.toml` and shims are recreated during `poetry install`:
Expand Down
172 changes: 172 additions & 0 deletions docs/code_samples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# Code Samples

Below you will find a collection of code samples which can be used for inspiration.

## Project Samples

Below are full project samples, contributed by members in the community. Use these for inspiration
or to get more information on what an SDK-based tap will look like.

- [tap-bamboohr by Auto IDM](https://gitlab.com/autoidm/tap-bamboohr)
- [tap-confluence by @edgarrmondragon](https://github.com/edgarrmondragon/tap-confluence)
- [tap-investing by @DouweM](https://gitlab.com/DouweM/tap-investing)
- [tap-parquet by AJ](https://github.com/dataops-tk/tap-parquet)
- [tap-powerbi-metadata by Slalom](https://github.com/dataops-tk/tap-powerbi-metadata)

To add your project to this list, please
[submit an issue](https://gitlab.com/meltano/meltano/-/issues/new?issue%5Bassignee_id%5D=&issue%5Bmilestone_id%5D=).

## Reusable Code Snippets

These are code samples taken from other projects. Use these as a reference if you get stuck.

### A simple Tap class definition with two streams

```python
class TapCountries(Tap):
"""Sample tap for Countries GraphQL API. This tap has no
config options and does not require authentication.
"""
name = "tap-countries"
config_jsonschema = PropertiesList([]).to_dict()

def discover_streams(self) -> List[Stream]:
"""Return a list containing the two stream types."""
return [
CountriesStream(tap=self),
ContinentsStream(tap=self),
]
```

### Define a simple GraphQL-based stream with schema defined in a file

```python
class ContinentsStream(GraphQLStream):
"""Continents stream from the Countries API."""

name = "continents"
primary_keys = ["code"]
replication_key = None # Incremental bookmarks not needed

# Read JSON Schema definition from a text file:
schema_filepath = SCHEMAS_DIR / "continents.json"

# GraphQL API endpoint and query text:
url_base = "https://countries.trevorblades.com/"
query = """
continents {
code
name
}
"""
```

### Dynamically discovering `schema` for a stream

Here is an example which parses schema from a CSV file:

```python
FAKECSV = """
Header1,Header2,Header3
val1,val2,val3
val1,val2,val3
val1,val2,val3
"""

@property
class ParquetStream(Stream):
def schema(self):
"""Dynamically detect the json schema for the stream.
This is evaluated prior to any records being retrieved.
"""
properties: List[Property] = []
for header in FAKECSV.split("\n")[0].split(",")
# Assume string type for all fields
properties.add(header, StringType())
return PropertiesList(*properties).to_dict()
```

Here is another example from the Parquet tap. This sample uses a
custom `get_jsonschema_type()` function to return the data type.

```python
class ParquetStream(Stream):
"""Stream class for Parquet streams."""

#...

@property
def schema(self) -> dict:
"""Dynamically detect the json schema for the stream.
This is evaluated prior to any records being retrieved.
"""
properties: List[Property] = []
# Get a schema object using the parquet and pyarrow libraries
parquet_schema = pq.ParquetFile(self.filepath).schema_arrow

# Loop through each column in the schema object
for i in range(len(parquet_schema.names)):
# Get the column name
name = parquet_schema.names[i]
# Translate from the Parquet type to a JSON Schema type
dtype = get_jsonschema_type(str(parquet_schema.types[i]))

# Add the new property to our list
properties.append(Property(name, dtype))

# Return the list as a JSON Schema dictionary object
return PropertiesList(*properties).to_dict()
```

### Initialize a collection of tap streams with differing types

```python
class TapCountries(Tap):
# ...
def discover_streams(self) -> List[Stream]:
"""Return a list containing one each of the two stream types."""
return [
CountriesStream(tap=self),
ContinentsStream(tap=self),
]
```

Or equivalently:

```python

# Declare list of types here at the top of the file
STREAM_TYPES = [
CountriesStream,
ContinentsStream,
]

class TapCountries(Tap):
# ...
def discover_streams(self) -> List[Stream]:
"""Return a list with one each of all defined stream types."""
return [
stream_type(tap=self)
for stream_type in STREAM_TYPES
]
```

### Run the standard built-in tap tests

```python
# Import the tests
from singer_sdk.testing import get_standard_tap_tests

# Import our tap class
from tap_parquet.tap import TapParquet

SAMPLE_CONFIG = {
# ...
}

def test_sdk_standard_tap_tests():
"""Run the built-in tap tests from the SDK."""
tests = get_standard_tap_tests(TapParquet, config=SAMPLE_CONFIG)
for test in tests:
test()
```
72 changes: 66 additions & 6 deletions docs/dev_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,20 +19,80 @@ Create taps with `singer-sdk` requires overriding just two or three classes:
- `OAuthJWTAuthenticator` - This class performs an JWT (Java Web Token) authentication
flow.

### Detailed Class Reference

For a detailed reference, please see the [SDK Reference Guide](./reference.md)

### Singer SDK Implementation Details

For more detailed information about the Singer SDK implementation, please see the
[Singer SDK Implementation Details](./implementation/README.md) section.

## Building a New Tap

The best way to get started is by building a new project from the
[cookiecutter tap template](../cookiecutter/tap-template).

## Detailed Class Reference
## Additional Resources

For a detailed reference, please see the [SDK Reference Guide](./reference.md)
### Code Samples

For a list of code samples solving a variety of different scenarios, please see our [Code Samples](./code_samples.md) page.

## CLI Samples
### CLI Samples

For a list of sample CLI commands you can run, [click here](./cli_commands.md).

## Singer SDK Implementation Details
## Python Tip: Two Ways to Define Properties

For more detailed information about the Singer SDK implementation, please see the
[Singer SDK Implementation Details](./implementation/README.md) section.
In Python, properties within classes like Stream and Tap can generally be overridden
in two ways: _statically_ or _dynamically_. For instance, `primary_keys` and
`replication_key` should be declared statically if their values are known ahead of time
(during development), and they should be declared dynamically if they vary from one
environment to another or if they can change at runtime.

### Static example

Here's a simple example of static definitions based on the
[cookiecutter template](../cookiecutter/tap-template/). This example defines the
primary key and replication key as fixed values which will not change.

```python
class SimpleSampleStream(Stream):
primary_keys = ["id"]
replication_key = None
```

### Dynamic property example

Here is a similar example except that the same properties are calculated dynamically based
on user-provided inputs:

```python
class DynamicSampleStream(Stream):
@property
def primary_keys(self):
"""Return primary key dynamically based on user inputs."""
return self.config["primary_key"]

@property
def replication_key(self):
"""Return replication key dynamically based on user inputs."""
result = self.config.get("replication_key")
if not result:
self.logger.warning("Danger: could not find replication key!")
return result
```

Note that the first static example was more concise while this second example is more extensible.

### In summary

- Use the static syntax whenever you are dealing with stream properties that won't change
and use dynamic syntax whenever you need to calculate the stream's properties or discover them dynamically.
- For those new to Python, note that the dynamic syntax is identical to declaring a function or method, with
the one difference of having the `@property` decorator directly above the method definition. This one change
tells Python that you want to be able to access the method as a property (as in `pk = stream.primary_key`)
instead of as a callable function (as in `pk = stream.primary_key()`).

For more examples, please see the [Code Samples](./code_samples.md) page.
17 changes: 17 additions & 0 deletions docs/implementation/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,20 @@
# Singer SDK Implementation Details

This section documents certain behaviors and expectations of the Singer SDK framework.

1. [CLI](./cli.md)
2. [Discovery](./discovery.md)
3. [Metadata](./discovery.md)
4. [Metrics](./discovery.md)
5. [State](./state.md)

## How to use the implementation reference material

_**Note:** You should not need to master all of the details here in order
to build your tap, and the behaviors described here should be automatic
and/or intuitive. For general guidance on tap development, please instead refer to our
[Dev Guide](../dev_guide.md)._

The specifications provided in this section are documented primarily to support
advanced use cases, behavior overrides, backwards compatibility with legacy taps,
debugging unexpected behaviors, or contributing back to the SDK itself.
20 changes: 17 additions & 3 deletions docs/implementation/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,18 @@

The Singer SDK automatically adds Tap CLI handling.

## Handling Plugin Config
## Configuration (`--config`)

The SDK supports one or more `--config` inputs when run from the CLI.

- If one of the supplied inputs is `--config ENV` (or `--config=ENV` according to the user's preference), the environment variable parsing rules will be applied to ingest config values from environment variables.
- One or more files can also be sent to `--config`. If multiple files are sent, they will be processed in sequential order.
If one or more files conflict for a given setting, the latter provided files will override earlier provided files.
- This behavior allows to you easily inject environment overrides by adding `--config=path/to/overrides.json` at the end of the CLI command text.
- This behavior allows to you easily inject environment overrides by adding `--config=path/to/overrides.json` at the end of the CLI command text.
- If `--config=ENV` is set and one or more files conflict with an environment variable setting, the environment variable setting will always have precedence, regardless of ordering.
- One benefit of this approach is that credentials and other secrets can be stored completely separately from general settings: either by having two distinct `config.json` files or by using environment variables for secure settings and `config.json` files for the rest.

## Parsing Config from Environment Variables
### Parsing Environment Variables

When `--config=ENV` is specified, the SDK will automatically capture and pass along any
values from environment variables which match the exact name of a setting, along with a
Expand All @@ -22,3 +22,17 @@ prefix determined by the plugin name.
> For example: For a sample plugin named `tap-my-example` and settings named "username" and "access_key", the SDK will automatically scrape
> the settings from environment variables `TAP_MY_EXAMPLE_USERNAME` and
> `TAP_MY_EXAMPLE_ACCESS_KEY`, if they exist.
## Input Catalog (`--catalog`)

If provided, an input catalog will be ingested and passed along to the tap during
initialization. This is called 'input_catalog' to distinguish from the discovered catalog.
If applicable, the provided input catalog will be merged with the
[discovered catalog](./discovery.md) in the `Tap.apply_catalog()` method. Developers can
override this method to achieve custom handling.

### Input Catalog Stream Selection

The most common use case for providing a catalog input is for field selection.
_**The SDK does not yet implement the stream selection feature**_ and we are tracking that
future development work [here](https://gitlab.com/meltano/singer-sdk/-/issues/7).
22 changes: 22 additions & 0 deletions docs/implementation/discovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# [Singer SDK Implementation Details](/.README.md) - Catalog Discovery

All taps developed using the SDK will automatically support `discovery` as a base
capability, which is the process of generating and emitting a catalog that describes the
available streams and stream types.

The catalog generated is automatically populated by a small number of developer inputs. Most
importantly:

- `Tap.discover_streams()` - Should return a list of available "discovered" streams.
- `Stream.schema` or `Stream.schema_filepath` - The JSON Schema definition of each stream,
provided either directly as a Python `dict` or indirectly as a `.json` filepath.
- `Stream.primary_keys` - a list of strings indicating the primary key(s) of the stream.
- `Stream.replication_key` - a single string indicating the name of the stream's replication
key (if applicable).

## See Also

- See the [Dev Guide](../dev_guide.md) and [Code Samples](../code_samples.md) for more
information on working with dynamic stream schemas.
- [Singer Spec: Discovery (meltano.com)](https://meltano.com/docs/singer-spec.html#discovery-mode)
- [Singer Spec: Discovery (singer-io)](https://github.com/singer-io/getting-started/blob/master/docs/DISCOVERY_MODE.md)
Loading

0 comments on commit 22b7536

Please sign in to comment.