[SIP-15A] Proposal for inferring temporal formatting and parsing #18385
john-bodley
started this conversation in
Ideas
Replies: 1 comment
-
Issue-Label Bot is automatically applying the label Links: app homepage, dashboard and code for this bot. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
[SIP-15A] Proposal for inferring temporal formatting and parsing
Motivation
In SIP-15 we surfaced examples were lexicographical sorting could result in incorrect time intervals when the optional configuration fields weren't specified. Initially the proposal was to ensure all temporal fields were cast to a timestamp and filter comparisons were between the appropriate timestamp (or similar) types, i.e., for Presto:
Additionally if all temporal fields were transformed to a timestamp the time grains would also work correctly (the current logic assumes that said type can be successfully cast to a timestamp which is not always the case).
@mistercrunch correctly pointed out out that not all query planners would be able to rewrite these filters to take advantage of indexes (if present) resulting in sub-performant queries and thus the status quo of ensuring the left-hand-side (LHS) of the filter comparison remains unchanged and that the right-hand-side (RHS) formats the Python
datetime
object appropriately, i.e.,There are two problems with the current functionality:
One must explicitly define the format of the temporal column for non date/date-time like types, i.e., strings and numbers as there are multiple temporal encodings, e.g.
%Y-%m-%d
,%Y-%d-%m
, epoch timestamp (in seconds), epoch timestamp (in milliseconds), etc.The time grains incorrectly assume that the temporal column can simply be cast to timestamp (or equivalent) type.
Really this can be seen as two conversions:
datetime
object into the appropriate database type for filtering.Note that the
convert_dttm
handles both of these already for date/date-time like types so the problem really lies with string like and numeric types which have temporal encoding.Proposed Change
We propose the following to address the two problems outlined above:
Format Inferencing
Rather than having to explicitly define the format for all non date/date-time temporal columns it would be great to infer this from a sample. There are a few Python libraries (
arrow
,datetutil
,maya
, etc.) which can parse non a priori defined date-time formats, i.e.,Sadly none of these libraries will provide the underlying format though there are ways of inferring it. It's also worth pointing out that a single value could be expressed by multiple formats, e.g.
2018-01-01
could be either%Y-%m-%d
or%Y-%d-%m
. Taking a sample of values should help further restrict the set of plausible formats.Regarding epoch timestamps which can be defined via integers (representing seconds or milliseconds) or floats, to differentiate between these one could use basis logic like www.epochconverter.com where if there are less than 12 digits the timestamp is assumed to be in seconds, 12 - 14 as milliseconds, and 15+ as microseconds (see here for detail and the rare occurrences where this fails).
We propose the following solution:
Add a database specific column name/type to format mapping for non-explicit temporal columns, i.e., we use the
ds
column of typeVARCHAR
with the%Y-%m-%d
format to represent a date-stamp.Whenever a SQLAlchemy column is marked as temporal and the column type is not explicitly a temporal type and no formatting is present then:
Rather than using a free-form text box the
python_date_format
field should represent a selector with the various types where either the mapped or best format is selected. This allows users to override the format if the inferencing was incorrect. Why a drop-down and not pre-populated free-form text? Mostly because some formats are not overly human readable and have an example/more detailed description would help.Deprecate the
database_expression
field. This logic should be obtainable via i) using thepython_date_format
field, ii) using a custom expression, or iii) ensuring the type mapping exists in the engine spec.Time Grains
Given that the format of the type is already inferred we simply need to provide at the engine level in db_engine_specs.py functionality (by ways of a SQL expression) to map from a string or numerical type to a timestamp which will then be wrapped inside of the SQL expressions representing the
time_grain_functions
, i.e., for the example of a date encoded as a string, for the Presto engine we would use theDATE_PARSE(string, format)
UDF. Note there is already some logic here regarding converting various types to a timestamp.For reference here's a few patterns for date-time formats:
It seems SQLAlchemy doesn't provide any abstraction and thus the only viable solution may be to explicitly define a mapping of the various date-time formats similar to this.
ISO 8601
For string like temporal columns they must adhere to the ISO 8601. The reason being is strings use lexicographical ordering thus we need to ensure the representation coincides with the chronological ordering which is the case for the ISO 8601 format.
For example say you had dates of the form
%m/%d/%Y
(MM/DD/YYYY in ISO 8601 syntax) and we were fixed on not converting types to timestamps then the date filtering would fail,as opposed to dates of the form
%Y-%m-%d
(YYYY-MM-DD in ISO 8601 syntax),Note if a string column doesn't adhere to the ISO 8601 format one will have to use a SQL expression in order to convert the column to either a date or timestamp (and possibly forgo leveraging the index), i.e., for Presto this would be defined as:
type
:TIMESTAMP
expression
:DATE_PARSE(ds, '%m/%d/%Y')
New or Changed Public Interfaces
N/A.
New dependencies
Depends on what Python package we use for inferring the date-time format.
Migration Plan and Compatibility
A migration would be needed to:
table_columns.python_date_format
column.Note the migration and change to type interfering should be rolled out in conjunction with SIP-15 given that by remedying the lexicographical sorting issue outlined in SIP-15 we would fundamentally be changing the time filters for existing charts. Please refer to here for more details.
Rejected Alternatives
See SIP-15.
to: @agrawaldevesh @betodealmeida @mistercrunch @michellethomas @villebro
Beta Was this translation helpful? Give feedback.
All reactions