-
-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry and Suggestion Regarding Time Partition Validation Against Server Time #752
Comments
Thanks for the detailed info about your question @mrchypark .We did try this without the check that you're referring to - but we found issues while testing. I'll explain in the details about the challenges - in a separate comment. |
This validation will cause significant problems during log collection. For example, if Fluent Bit is configured to report logs once per second, it will result in the time validation failing for the last second of the previous minute at the 0th second of each minute. In extreme cases, if logs are uploaded once per minute, all log times will fail validation under this rule. |
Hi @mrchypark and @sp0cket Thanks for your valuable comments on this topic.
Example with custom partition - Events will be partitioned with below prefixes - date=2024-04-30\hr=05\minute=03\Os=Linux\StatusCode=200\domain.port.parquet |
Hi, thank you for sharing your proposal. There are two main reasons why user-provided time partitioning is necessary for me:
I provide solutions involving IoT devices in remote environments, where data transmission delays can range from a few minutes to several days. During such delays, edge devices store the data and then transmit all accumulated data once connectivity issues are resolved. Regarding the second scenario, I find that a one-month time constraint could be appropriate depending on the situation. While the first scenario might be a bit challenging, it seems feasible to move forward. I also appreciate the addition of custom partitioning. Congratulations on completing the distributed version. I am a fan of the parseable approach. I look forward to more great projects in the future. |
Our situation is similar to @mrchypark , with data migration and latency issues as well. In our scenario, data migration is primarily used to collect offline vehicle fault logs (similar to a black box on an airplane) and aggregate them into a central database. The one-month limitation has been a significant improvement, but I'm not sure if there might be some exceptional cases. I'm wondering if there's a possibility to provide some compatibility for relatively old data, even at the expense of performance, to address these extreme scenarios. |
This PR updates the time partition logic to check if time partition field value is not less than a month old then partition based on time partition field also, if static schema is provided check if static schema has time partition field at the time of log creation This PR is partly related to #752
@sp0cket we can extend the one month time constraint to make this number configurable. |
additional header to be provided X-P-Time-Partition-Limit with a value of unsigned integer with ending 'd' eg. 90d for 90 days if not provided, default constraint of 30 days will be applied using this, user can ingest logs older than 30 days as well fixes parseablehq#752
additional header to be provided X-P-Time-Partition-Limit with a value of unsigned integer with ending 'd'. eg. 90d for 90 days if not provided, default constraint of 30 days will be applied using this, user can ingest logs older than 30 days as well fixes #752
Dear Development Team,
I hope this message finds you well. I am reaching out to discuss a particular aspect of the
validate_time_partition
function that has recently come to my attention. This function plays a critical role in validating the time partitioning of incoming data against the server's current time. The relevant code snippet for this functionality is as follows:The rationale behind this stringent time comparison intrigues me, especially in the context of a recent endeavor I embarked on. My goal is to incorporate the date into stream names and bulk-insert historical data accordingly. The introduction of time partitioning seemed like a beacon of possibility for this project. However, I encountered an unexpected challenge due to the function's requirement for the exact alignment of data timestamps with the server's current time.
Understanding the importance of this validation process is crucial for me. If it serves a specific purpose or security measure, I would greatly appreciate an explanation to better comprehend its necessity. On the other hand, if this strict comparison is not integral to the system's functionality or integrity, might I suggest either removing this constraint or introducing a toggleable mode? Such a modification could enhance flexibility, especially for scenarios like database migrations, where aligning every data point with the exact server time is not feasible.
The potential for this feature to significantly impact the parsability and migration of existing databases is substantial. It could open up new avenues for efficiently managing historical data, which is paramount for organizations looking to evolve their data infrastructure.
In light of this, I kindly ask for your consideration of this matter. An option to adjust the time validation requirement would be immensely beneficial for developers facing similar challenges. Your support in this regard would not only facilitate our current project but also enrich the tool's adaptability for a wide range of use cases.
Thank you very much for your time and understanding. I look forward to your feedback and am open to discussing this further if needed.
Warm regards,
The text was updated successfully, but these errors were encountered: