Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry and Suggestion Regarding Time Partition Validation Against Server Time #752

Closed
mrchypark opened this issue Apr 9, 2024 · 6 comments · Fixed by #792
Closed

Inquiry and Suggestion Regarding Time Partition Validation Against Server Time #752

mrchypark opened this issue Apr 9, 2024 · 6 comments · Fixed by #792
Assignees

Comments

@mrchypark
Copy link

Dear Development Team,

I hope this message finds you well. I am reaching out to discuss a particular aspect of the validate_time_partition function that has recently come to my attention. This function plays a critical role in validating the time partitioning of incoming data against the server's current time. The relevant code snippet for this functionality is as follows:

if parsed_timestamp.date() == Utc::now().naive_utc().date()
    && parsed_timestamp.hour() == Utc::now().naive_utc().hour()
    && parsed_timestamp.minute() == Utc::now().naive_utc().minute() {
    Ok(true)
} else {
    Err(anyhow!(format!(
        "field {} and server time are not same",
        time_partition.unwrap()
    )))
}

The rationale behind this stringent time comparison intrigues me, especially in the context of a recent endeavor I embarked on. My goal is to incorporate the date into stream names and bulk-insert historical data accordingly. The introduction of time partitioning seemed like a beacon of possibility for this project. However, I encountered an unexpected challenge due to the function's requirement for the exact alignment of data timestamps with the server's current time.

Understanding the importance of this validation process is crucial for me. If it serves a specific purpose or security measure, I would greatly appreciate an explanation to better comprehend its necessity. On the other hand, if this strict comparison is not integral to the system's functionality or integrity, might I suggest either removing this constraint or introducing a toggleable mode? Such a modification could enhance flexibility, especially for scenarios like database migrations, where aligning every data point with the exact server time is not feasible.

The potential for this feature to significantly impact the parsability and migration of existing databases is substantial. It could open up new avenues for efficiently managing historical data, which is paramount for organizations looking to evolve their data infrastructure.

In light of this, I kindly ask for your consideration of this matter. An option to adjust the time validation requirement would be immensely beneficial for developers facing similar challenges. Your support in this regard would not only facilitate our current project but also enrich the tool's adaptability for a wide range of use cases.

Thank you very much for your time and understanding. I look forward to your feedback and am open to discussing this further if needed.

Warm regards,

@nitisht nitisht self-assigned this Apr 9, 2024
@nitisht
Copy link
Member

nitisht commented Apr 9, 2024

Thanks for the detailed info about your question @mrchypark .We did try this without the check that you're referring to - but we found issues while testing. I'll explain in the details about the challenges - in a separate comment.

@sp0cket
Copy link

sp0cket commented Apr 26, 2024

This validation will cause significant problems during log collection. For example, if Fluent Bit is configured to report logs once per second, it will result in the time validation failing for the last second of the previous minute at the 0th second of each minute. In extreme cases, if logs are uploaded once per minute, all log times will fail validation under this rule.

@nikhilsinhaparseable
Copy link
Contributor

Hi @mrchypark and @sp0cket

Thanks for your valuable comments on this topic.
We had an internal discussion based on which we came up with below solution, please provide the feedback for the same -

  1. Optional header X-P-Time-Partition needs to be sent in the log creation API to allow time partitioning
  2. If provided, all the events in the batch will be validated at the time of ingestion for below -
    a. if time-partition field does not exist in any of the events in the batch, reject the whole batch
    b. if time-partition field in any of the events in the batch, cannot be parsed as valid datetime (eg. 2024-04-04T00:41:00.000Z), reject the whole batch
    c. if time-partition field in any of the events in the batch is more than a month older than the server timestamp, reject the whole batch
  3. Once all these validations are passed, we can start the ingestion
  4. Now since events can be out of order, we have to take each event in the batch individually to ingest and upload to storage with the prefix provided in the time partition field
  5. Hence, if the time-partition header is provide, prefix will be created based on the time partition field in the event, else prefix will be created based on the server time
  6. We can also allow custom partition other than time field with another header X-P-Custom-Partition which can have comma separated field names from the log event
  7. The custom partition fields can be integer or string types (that too not more than 7-10 field length - just to keep the prefixes as short as possible)
  8. Server cannot check the data sent for these fields, user has to wisely select the fields (eg. not to partition based on id fields, rather choose a field which has less no. of distinct values in the application like status_code, os, level etc)
  9. We can allow maximum of 3 such fields to be provided in the custom partition header
  10. Now since no validation can be made for out of order events, multiple parquets can be created under same prefix
    but that can reduce the query performance because of possibility of sorting issues between the parquets

Example with custom partition -
If X-P-Custom-Partition = Os, StatusCode

Events will be partitioned with below prefixes -

date=2024-04-30\hr=05\minute=03\Os=Linux\StatusCode=200\domain.port.parquet
date=2024-04-30\hr=05\minute=03\Os=Linux\StatusCode=401\domain.port.parquet
date=2024-04-30\hr=05\minute=03\Os=Windows\StatusCode=500\domain.port.parquet
date=2024-04-30\hr=05\minute=03\Os=Windows\StatusCode=400\domain.port.parquet

@mrchypark
Copy link
Author

Hi, thank you for sharing your proposal.

There are two main reasons why user-provided time partitioning is necessary for me:

  1. Migration of existing data
  2. Reliable storage for data collection that may experience delays

I provide solutions involving IoT devices in remote environments, where data transmission delays can range from a few minutes to several days. During such delays, edge devices store the data and then transmit all accumulated data once connectivity issues are resolved.

Regarding the second scenario, I find that a one-month time constraint could be appropriate depending on the situation.

While the first scenario might be a bit challenging, it seems feasible to move forward.

I also appreciate the addition of custom partitioning. Congratulations on completing the distributed version. I am a fan of the parseable approach. I look forward to more great projects in the future.

@sp0cket
Copy link

sp0cket commented May 3, 2024

Our situation is similar to @mrchypark , with data migration and latency issues as well. In our scenario, data migration is primarily used to collect offline vehicle fault logs (similar to a black box on an airplane) and aggregate them into a central database. The one-month limitation has been a significant improvement, but I'm not sure if there might be some exceptional cases. I'm wondering if there's a possibility to provide some compatibility for relatively old data, even at the expense of performance, to address these extreme scenarios.

nitisht pushed a commit that referenced this issue May 5, 2024
This PR updates the time partition logic to check if time partition 
field value is not less than a month old then partition based on
 time partition field also, if static schema is provided
check if static schema has time partition field at the time of log creation

This PR is partly related to #752
@nikhilsinhaparseable
Copy link
Contributor

@sp0cket we can extend the one month time constraint to make this number configurable.
User has to provide additional header X-P-Time-Partition-Limit at the time of log creation value of which will be expected to be as unsigned integer with ending 'd' eg. 90d (for 90 days).
If not provided, default of 30 days constraint will be applied, else user can use this field to allow log older than 30 days as well to be ingested.

nikhilsinhaparseable added a commit to nikhilsinhaparseable/parseable that referenced this issue May 5, 2024
additional header to be provided
X-P-Time-Partition-Limit with a value of unsigned integer with ending 'd'
eg. 90d for 90 days
if not provided, default constraint of 30 days will be applied
using this, user can ingest logs older than 30 days as well

fixes parseablehq#752
nitisht pushed a commit that referenced this issue May 6, 2024
additional header to be provided
X-P-Time-Partition-Limit with a value of unsigned integer with 
ending 'd'. eg. 90d for 90 days
if not provided, default constraint of 30 days will be applied
using this, user can ingest logs older than 30 days as well

fixes #752
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants