Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: String format is not validated in configuration #1451

Open
edgarrmondragon opened this issue Feb 23, 2023 · 6 comments
Open

bug: String format is not validated in configuration #1451

edgarrmondragon opened this issue Feb 23, 2023 · 6 comments
Labels

Comments

@edgarrmondragon
Copy link
Collaborator

Singer SDK Version

0.21.0

Python Version

3.10

Bug scope

Configuration (settings parsing, validation, etc.)

Operating System

Any

Description

A config schema like

# samples/config_format/tap.py

from singer_sdk import Tap

class MyTap(Tap):
    name = "my-tap"
    config_jsonschema = {
        "type": "object",
        "properties": {
            "start_date": {
                "type": "string",
                "format": "date-time",
                "description": "The date and time to start syncing data from.",
            },
        },
    }

    def discover_streams(self):
        return []

if __name__ == "__main__":
    MyTap().cli()

doesn't result in a validation error when passed a string in an incorrect format

$ MY_TAP_START_DATE='notadate' poetry run python samples/config_format/tap.py --config ENV
2023-02-22 18:28:02,960 | INFO     | my-tap               | my-tap v[could not be detected], Meltano SDK v0.21.0
2023-02-22 18:28:02,960 | INFO     | my-tap               | Parsing env var for settings config...
2023-02-22 18:28:02,961 | INFO     | singer_sdk.configuration._dict_config | Parsing 'start_date' config from env variable 'MY_TAP_START_DATE'.
2023-02-22 18:28:02,961 | INFO     | my-tap               | Config validation passed with 0 warnings.

That means that the bad string is caught too late, when the start_date needs to be parsed for incremental replication.

Backwards compatibility

We could just enable format checks by updating the validator instance in

validator = JSONSchemaValidator(config_jsonschema)
validator.validate(self._config)

The problem with doing that, even if we don't add the format extras, is that we'd break for example any users passing 2021-01-01 as an instance of date-time.

One option would be to make singer_sdk.typing.DateTimeType be dumped as a union:

{
  "anyOf": [
    {"type": "string", "format": "date-time"},
    {"type": "string", "format": "date"}
  ]
}

Code

No response

@edgarrmondragon edgarrmondragon added kind/Bug Something isn't working valuestream/SDK labels Feb 23, 2023
@aaronsteers
Copy link
Contributor

aaronsteers commented Feb 23, 2023

@edgarrmondragon - I hand't really thought through this before you logged here. Regarding failing early on config inputs, I'm 100% in agreement that this makes sense. A few caveats below on implementation challenges and backwards compatibility.

  1. I confirmed that date-time format fails validation of date-only values here: https://www.jsonschemavalidator.net/s/NpWURgml
  2. That said, many/most targets can (and do) ingest date-only values into date-time typed properties and nodes. Some simple targets may also have difficult parsing anyOf when constructing the target table.
  3. I'd be inclined to add a new DateOrDateTypeType type for use in applications like the canonical start_date, possibly defined using a new AnyOfType helper, as in Property("start_date", type=AnyOfType(DateType, DataTypeType, required=false).
    • This approach has a benefit of adding the desired "config validation" behavior (with convenient shorthand), while not breaking any sync behaviors that may be working fine today.
  4. I'm also in favor of always validating format constraints of config, just not for record data.

Regarding config validation versus record

If a tap is upgraded to the latest version of the SDK, the added check could cause certain config inputs to fail - but this would happen presumably at a time when the user is upgrading the tap or target on their side. Whether expecting to have to retest or not, a user could update 2022-01-01 to 2022-01-01T00:00:00 or similar.

Because our config declarations do not need to be instantiated by the target as columns, I don't worry about anyOf being handled by targets.

Re: start_date config specifically

If we're focusing just about the start_date config input, I think the best path is to allow either format; I think many Singer getting started guides even give samples with date-only sample values - and it would not be good for these examples to fail.

@edgarrmondragon
Copy link
Collaborator Author

2. That said, many/most targets can (and do) ingest date-only values into date-time typed properties and nodes. Some simple targets may also have difficult parsing anyOf when constructing the target table.

@aaronsteers Agreed. Dumping DateTimeType as an anyOf would be problematic for most targets.


I'm also in favor of always validating format constraints of config, just not for record data.

Me too, but my only concern with that would be packages that neither pin a specific SDK version, nor have an upper bound set to a existing release. We also don't enforce that best practice in our templates:

(might be good to revisit https://gitlab.com/meltano/sdk/-/merge_requests/208)

In those cases, a simple meltano install could break taps at the config validation stage in the next sync.


If we're focusing just about the start_date config input, I think the best path is to allow either format; I think many Singer getting started guides even give samples with date-only sample values - and it would not be good for these examples to fail.

Yeah, that's what motivated my anyOf proposal for DateTimeType above but I had not considered that would cause issues when used in stream schemas. Do you think allowing both formats without forcing users to update to Property("start_date", type=AnyOfType(DateType, DataTypeType, required=false) is a requirement?

@edgarrmondragon
Copy link
Collaborator Author

See #1457 for a related issue in record data validation. It seems like we're already validating format:

self._validator = Draft7Validator(schema, format_checker=FormatChecker())

@stale
Copy link

stale bot commented Jul 18, 2023

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@stale stale bot added the stale label Jul 18, 2023
@edgarrmondragon
Copy link
Collaborator Author

Still relevant

@stale stale bot removed the stale label Jul 21, 2023
Copy link

stale bot commented Jul 20, 2024

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@stale stale bot added the stale label Jul 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants