[SUPPORT]Difference in accuracy between the results fromspark and the hive #12370

suxiaogang223 · 2024-11-29T02:52:03Z

Tips before filing an issue

Have you gone through our FAQs?
The url is invalid and page not found :(
Join the mailing list to engage in conversations and get faster support at [email protected].
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

When I use spark to write a table with timestamp type, there is a difference in accuracy between the results found using spark and the hive results.

create table test_timestamp(id int, time timestamp)using hudi;
insert into test_timestamp values(1,timestamp('2024-11-28 12:00:00.123456'));

select time from test_timestamp;
-- result
2024-11-28 12:00:00.123456

-- result from hive
2024-11-28 04:00:00.123

Is this behavior expected, and are there any plans to improve it in the future?
To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version :
hudi-0.15
Spark version :
3.4.2
Hive version :
3.1.3
Hadoop version :
3.1
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

The text was updated successfully, but these errors were encountered:

rangareddy · 2024-12-02T11:26:31Z

Hi @suxiaogang223

Thanks for reporting this issue. I am able to reproduce it. Please allow some time to provide a solution.

Mean while could you please check what is the timezone where you are running hive shell from the terminal?

timedatectl

When i check the parquet file data it is having in micro seconds format.

_hoodie_commit_time: string
_hoodie_commit_seqno: string
_hoodie_record_key: string
_hoodie_partition_path: string
_hoodie_file_name: string
id: int32
time: timestamp[us, tz=UTC]
----
_hoodie_commit_time: [["20241202103846073"]]
_hoodie_commit_seqno: [["20241202103846073_0_0"]]
_hoodie_record_key: [["20241202103846073_0_0"]]
_hoodie_partition_path: [[""]]
_hoodie_file_name: [["730743c3-73b3-473b-82f8-fa242d7e78b4-0_0-13-56_20241202103846073.parquet"]]
id: [[1]]
time: [[2024-11-28 12:00:00.123456Z]]

suxiaogang223 · 2024-12-04T03:44:05Z

Hi @suxiaogang223

Thanks for reporting this issue. I am able to reproduce it. Please allow some time to provide a solution.

Mean while could you please check what is the timezone where you are running hive shell from the terminal?
timedatectl
When i check the parquet file data it is having in micro seconds format.
_hoodie_commit_time: string
_hoodie_commit_seqno: string
_hoodie_record_key: string
_hoodie_partition_path: string
_hoodie_file_name: string
id: int32
time: timestamp[us, tz=UTC]
----
_hoodie_commit_time: [["20241202103846073"]]
_hoodie_commit_seqno: [["20241202103846073_0_0"]]
_hoodie_record_key: [["20241202103846073_0_0"]]
_hoodie_partition_path: [[""]]
_hoodie_file_name: [["730743c3-73b3-473b-82f8-fa242d7e78b4-0_0-13-56_20241202103846073.parquet"]]
id: [[1]]
time: [[2024-11-28 12:00:00.123456Z]]
This inconsistency is happened due to hive query engine.

Thanks for replay
The result is

timedatectl
               Local time: Wed 2024-12-04 11:40:17 CST
           Universal time: Wed 2024-12-04 03:40:17 UTC
                 RTC time: Wed 2024-12-04 03:40:16
                Time zone: Asia/Shanghai (CST, +0800)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

The difference in time is expected behavior, because spark does not parse timestamp according to UTC format, mainly the difference between precision, is there a way to make hive output accurate to microseconds?

rangareddy · 2024-12-04T09:29:24Z

Hi @suxiaogang223

The main issue is that Hive converts the timestamp value from UTC to your machine's timezone. Could you please set the timezone to UTC and see if it works?

export HS2_OPTS="-Duser.timezone=$HS2_USER_TZ -Dhive.local.time.zone=$HIVE_LOCAL_TZ"

https://community.cloudera.com/t5/Support-Questions/Can-we-change-default-hive-hbase-timestamp-from-UTC-to-other/m-p/336202

suxiaogang223 · 2024-12-05T03:48:24Z

Hi @suxiaogang223

The main issue is that Hive converts the timestamp value from UTC to your machine's timezone. Could you please set the timezone to UTC and see if it works?
export HS2_OPTS="-Duser.timezone=$HS2_USER_TZ -Dhive.local.time.zone=$HIVE_LOCAL_TZ"
https://community.cloudera.com/t5/Support-Questions/Can-we-change-default-hive-hbase-timestamp-from-UTC-to-other/m-p/336202

It works, but precision is also wrong, is there a way to make hive output accurate to microseconds?

rangareddy · 2024-12-05T09:26:18Z

Hi @suxiaogang223

If you create any Hive table with microseconds, it will work, but on the Hudi side, it is not functioning. I will discuss this with the team internally and create a bug report.

rangareddy · 2024-12-09T10:18:12Z

Created Hudi Jira - https://issues.apache.org/jira/browse/HUDI-8677

ad1happy2go added schema-and-data-types priority:critical production down; pipelines stalled; Need help asap. data-consistency phantoms, duplicates, write skew, inconsistent snapshot labels Dec 3, 2024

ad1happy2go added this to Hudi Issue Support Dec 3, 2024

github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Dec 3, 2024

ad1happy2go moved this from ⏳ Awaiting Triage to 🏁 Triaged in Hudi Issue Support Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT]Difference in accuracy between the results fromspark and the hive #12370

[SUPPORT]Difference in accuracy between the results fromspark and the hive #12370

suxiaogang223 commented Nov 29, 2024 •

edited

Loading

rangareddy commented Dec 2, 2024 •

edited

Loading

suxiaogang223 commented Dec 4, 2024

rangareddy commented Dec 4, 2024

suxiaogang223 commented Dec 5, 2024

rangareddy commented Dec 5, 2024

rangareddy commented Dec 9, 2024

[SUPPORT]Difference in accuracy between the results fromspark and the hive #12370

[SUPPORT]Difference in accuracy between the results fromspark and the hive #12370

Comments

suxiaogang223 commented Nov 29, 2024 • edited Loading

rangareddy commented Dec 2, 2024 • edited Loading

suxiaogang223 commented Dec 4, 2024

rangareddy commented Dec 4, 2024

suxiaogang223 commented Dec 5, 2024

rangareddy commented Dec 5, 2024

rangareddy commented Dec 9, 2024

suxiaogang223 commented Nov 29, 2024 •

edited

Loading

rangareddy commented Dec 2, 2024 •

edited

Loading