-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT]Difference in accuracy between the results fromspark and the hive #12370
Comments
Thanks for reporting this issue. I am able to reproduce it. Please allow some time to provide a solution. Mean while could you please check what is the timezone where you are running hive shell from the terminal? timedatectl When i check the parquet file data it is having in micro seconds format. _hoodie_commit_time: string
_hoodie_commit_seqno: string
_hoodie_record_key: string
_hoodie_partition_path: string
_hoodie_file_name: string
id: int32
time: timestamp[us, tz=UTC]
----
_hoodie_commit_time: [["20241202103846073"]]
_hoodie_commit_seqno: [["20241202103846073_0_0"]]
_hoodie_record_key: [["20241202103846073_0_0"]]
_hoodie_partition_path: [[""]]
_hoodie_file_name: [["730743c3-73b3-473b-82f8-fa242d7e78b4-0_0-13-56_20241202103846073.parquet"]]
id: [[1]]
time: [[2024-11-28 12:00:00.123456Z]] |
Thanks for replay
The difference in time is expected behavior, because spark does not parse timestamp according to UTC format, mainly the difference between precision, is there a way to make hive output accurate to microseconds? |
The main issue is that Hive converts the timestamp value from UTC to your machine's timezone. Could you please set the timezone to UTC and see if it works? export HS2_OPTS="-Duser.timezone=$HS2_USER_TZ -Dhive.local.time.zone=$HIVE_LOCAL_TZ" |
It works, but precision is also wrong, is there a way to make hive output accurate to microseconds? |
If you create any Hive table with microseconds, it will work, but on the Hudi side, it is not functioning. I will discuss this with the team internally and create a bug report. |
Created Hudi Jira - https://issues.apache.org/jira/browse/HUDI-8677 |
Tips before filing an issue
Have you gone through our FAQs?
The url is invalid and page not found :(
Join the mailing list to engage in conversations and get faster support at [email protected].
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
When I use spark to write a table with timestamp type, there is a difference in accuracy between the results found using spark and the hive results.
Is this behavior expected, and are there any plans to improve it in the future?
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
Hudi version :
hudi-0.15
Spark version :
3.4.2
Hive version :
3.1.3
Hadoop version :
3.1
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
The text was updated successfully, but these errors were encountered: