You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running a PySpark script in AWS Glue ETL. It is reading from a Postgres database table via a JDBC connection and writing the dataframe to Hudi. This DataFrame contains 7 columns. Three of the columns are type Long, with LogicalType "timestamp-micros".
Does the setting "hoodie.parquet.outputtimestamptype" just not work? Is it not possible to output timestamp-milliseconds with Spark?
To Reproduce
Steps to reproduce the behavior:
Ranga Reddy on the channel attempted to recreate this issue by setting a dataframe schema with the TimestampType class. He inserted some rows that had timestamps up to microseconds. The setting hoodie.parquet.outputtimestamptype was set to TIMESTAMP_MILLIS, but when writing to Hudi, the logicalType of the timestamp was still TIMESTAMP_MICROS even though the schema was set and the outputtimestamptype setting was added too.
Run Ranga's code sample:
import org.apache.spark.SparkConf
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.internal.SQLConf
object TestTimeStamp extends App {
val name = this.getClass.getSimpleName.replace("$", "")
val sparkConf = new SparkConf().setAppName(name).setIfMissing("spark.master", "local[2]")
val spark = SparkSession.builder.appName(name).config(sparkConf)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
.config("spark.sql.hive.convertMetastoreParquet", "false")
.getOrCreate()
val tableName = name
val basePath = f"file:///tmp/warehouse/$tableName"
val schema = StructType(Array(
StructField("field1", IntegerType, nullable = false),
StructField("field2", StringType, nullable = true),
StructField("field3", TimestampType, nullable = false)
))
val data = Seq(
Row(1, "A", java.sql.Timestamp.valueOf("2023-10-01 10:00:00.540040")),
Row(2, "B", java.sql.Timestamp.valueOf("2023-10-01 11:30:00.240030")),
Row(3, "C", java.sql.Timestamp.valueOf("2023-10-01 12:45:00.140022"))
)
spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
spark.sql("SET spark.sql.parquet.outputTimestampType=TIMESTAMP_MILLIS")
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
// Hudi write options
val hudiOptions = Map(
"hoodie.table.name" -> tableName,
"hoodie.datasource.write.recordkey.field" -> "field1",
"hoodie.datasource.write.precombine.field" -> "field2",
"hoodie.parquet.outputtimestamptype" -> "TIMESTAMP_MILLIS"
//"hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled" -> "true"
)
// Write the DataFrame to Hudi
df.write.format("hudi").options(hudiOptions).mode("overwrite").save(basePath)
df.show(truncate = false)
spark.read.format("hudi").load(basePath).show(false)
spark.stop()
}
Expected behavior
I expect the parquet schema for field3 to be TIMESTAMP_MILLIS instead of TIMESTAMP_MICROS. This is what the schema output should be:
Describe the problem you faced
Hey team,
Here's a link to the thread on the Apache Hudi Slack channel where I posted this issue:
https://apache-hudi.slack.com/archives/C4D716NPQ/p1731532187806959
I'm running a PySpark script in AWS Glue ETL. It is reading from a Postgres database table via a JDBC connection and writing the dataframe to Hudi. This DataFrame contains 7 columns. Three of the columns are type Long, with LogicalType "timestamp-micros".
I used these settings in the hoodie config:
Added this in the spark config also:
conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
but it still outputs "timestamp-micros" for field3, field4 and field7:
I tried converting it to timestamp-millis by manually setting the schema and generating a new dataframe from it:
I've tried casting it to milliseconds within the timestamp and this does not work either:
new_df = new_df.withColumn("field3", to_timestamp(col("field3"), 'yyyy-MM-dd HH:mm:ss.SSS'))
It truncates the data in the field from microseconds to milliseconds in the data, but it does not convert the datatypes for those columns eg.
2007-03-11 15:46:41.540000 -----> 2007-03-11 15:46:41.5400
Does the setting "hoodie.parquet.outputtimestamptype" just not work? Is it not possible to output timestamp-milliseconds with Spark?
To Reproduce
Steps to reproduce the behavior:
Ranga Reddy on the channel attempted to recreate this issue by setting a dataframe schema with the TimestampType class. He inserted some rows that had timestamps up to microseconds. The setting
hoodie.parquet.outputtimestamptype
was set to TIMESTAMP_MILLIS, but when writing to Hudi, the logicalType of the timestamp was still TIMESTAMP_MICROS even though the schema was set and the outputtimestamptype setting was added too.Expected behavior
I expect the parquet schema for field3 to be TIMESTAMP_MILLIS instead of TIMESTAMP_MICROS. This is what the schema output should be:
"fields" : [ {
"name" : "field1",
"type" : "integer"
}, {
"name" : "field2",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "field3",
"type" : [ "null", {
"type" : "long",
"logicalType" : "timestamp-micros"
} ],
"default" : null
}
Environment Description
Hudi version : Hudi/AWS Bundle 0.14
Spark version : 3.3
Hive version : Not sure
Hadoop version : N/A
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No
Stacktrace
No Stacktrace, just output
The text was updated successfully, but these errors were encountered: