-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] java.util.NoSuchElementException: FileID <some-uuid> of partition path tenant=xxxxx/date=YYYYMMD #12298
Comments
@hgudladona This should already be fixed by #9879 . Can you try with this patch ? |
This patch requires us to migrate to 1.x.beta release which we are not ready to do yet, Any chance this can be back ported to 0.14.x? Also, can you kindly explain how this can remediate our situation. Are the file groups outside of the active timeline treated as uncommitted with this patch? |
@ad1happy2go This patch will still not solve this problem. If you follow the code path
If a file slice base instant time is less than firstNonSavepointCommit, although the not in active timeline its treated as completed which is pretty similar to the current behavior. Kindly, go through the scenario I mentioned one more time and suggest of this is the right patch? |
@nsivabalan could you please help with this? |
taking a look. |
I am not sure I follow the use case fully. From what I gauge, this is what the situation is: table has multi-writers enabled. diff writers could write to diff parittions. partitioning column is dynamically derived from input data. So, writer1 tries to write to partitionX. we do not have any completed commits in active timeline containing any data for the partition of interest. but the stackrace shows different. if we had routed it to MergeHandle, it means that we already had some data written to parittionX and hence the file group was chosen as small file. and eventually we ended up w/ HoodieMergeHandle. But let me poke around one theory though. but the possibility of this happening is very unlikely, since the commit from writer2 is not committed at all. I have to go through the code to see if we have any edge cases around this. |
Let me clarify. This is not a multi writer, we only have 1 writer jobs with cleans running async. The scenario can be described like this: Hope this helps. |
@hgudladona I did take a pass trying to reproduce the issue by manually applying this operation but didn't able to reproduce. Do you have any code or reproducible script, or any other artifacts that can help me to reproduce this issue. |
Hello @ad1happy2go , Could you kindly describe the test setup? How large was the timeline? What was the partitions structure? what was the value for Unfortunately I don't have any reproducible script, But like we described above, I know the exact scenario this happens. |
Describe the problem you faced
Intermittent java.util.NoSuchElementException when writing to partitions that are out of order and not covered by the active timeline.
To Reproduce
We have a hudi job reading from kafka and writing to S3 in partitions dynamically derived from certain columns in the records in the format of tenant=xxxxx/date=YYYYMMDD. Under certain situations when the partition the new data is written into is not in the active timeline (Late arriving data), there seems to be a mismatch between the file group decided in the stage "Getting small files from partitions" and "Doing partition and writing data".
Lets say a FG id 'eef3ab7f-dc8a-40ec-856f-99010184d9f1-1' is decided as a small file in stage "Getting small files from partitions" and passed on to the "Doing partition and writing data" stage to INSERT new data and create a new base file for it, this stage fails with the following exception and fails the streamer job with exception below.
However, this operation streamer job succeeds in 2 situations
Expected behavior
We expect that there is no mismatch between the views of the stages "Getting small files from partitions" and "Doing partition and writing data" in cases when we are writing to a partition thats no not actively tracked in the active timeline.
Environment Description
Hudi version : 0.14.1
Spark version : 3.4.x
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : yes
Additional context
auto.offset.reset: latest
bootstrap.servers: kafka-brokers
group.id: hudi-ingest-some-group
hoodie.archive.async: true
hoodie.archive.automatic: true
hoodie.auto.adjust.lock.configs: true
hoodie.base.path: s3a://some-base-path
hoodie.clean.async: true
hoodie.cleaner.hours.retained: 36
hoodie.cleaner.parallelism: 600
hoodie.cleaner.policy: KEEP_LATEST_BY_HOURS
hoodie.cleaner.policy.failed.writes: LAZY
hoodie.clustering.async.enabled: false
hoodie.combine.before.insert: false
hoodie.copyonwrite.insert.auto.split: false
hoodie.datasource.fetch.table.enable: true
hoodie.datasource.hive_sync.database: hudi_events_v1
hoodie.datasource.hive_sync.mode: hms
hoodie.datasource.hive_sync.partition_extractor_class: org.apache.hudi.hive.MultiPartKeysValueExtractor
hoodie.datasource.hive_sync.partition_fields: tenant,date
hoodie.datasource.hive_sync.table: some-table
hoodie.datasource.hive_sync.table_properties: projection.date.type=date|projection.date.format=yyyyMMdd|projection.date.range=19700101,99990101|projection.tenant.type=integer|projection.tenant.range=-1,8675309|projection.enabled=true
hoodie.datasource.meta_sync.condition.sync: true
hoodie.datasource.sync_tool.single_instance: true
hoodie.datasource.write.hive_style_partitioning: true
hoodie.datasource.write.keygenerator.class: com.some-class-prefix.KeyGenerator
hoodie.datasource.write.operation: insert
hoodie.datasource.write.partitionpath.field: tenant:SIMPLE,date:SIMPLE
hoodie.datasource.write.precombine.field: event_time_usec
hoodie.datasource.write.reconcile.schema: false
hoodie.datasource.write.recordkey.field: resource_id
hoodie.deltastreamer.kafka.source.maxEvents: 75000000
hoodie.deltastreamer.schemaprovider.registry.url: http://schema-registry.some-suffix:8085
hoodie.deltastreamer.source.kafka.enable.commit.offset: true
hoodie.deltastreamer.source.kafka.topic: some-topic
hoodie.deltastreamer.source.schema.subject: some-topic-value
hoodie.fail.on.timeline.archiving: false
hoodie.filesystem.view.incr.timeline.sync.enable: true
hoodie.filesystem.view.remote.timeout.secs: 2
hoodie.insert.shuffle.parallelism: 1600
hoodie.memory.merge.max.size: 2147483648
hoodie.metadata.enable: false
hoodie.metrics.on: true
hoodie.metrics.reporter.metricsname.prefix:
hoodie.metrics.reporter.prefix.tablename: false
hoodie.metrics.reporter.type: DATADOG
hoodie.parquet.compression.codec: zstd
hoodie.streamer.source.kafka.minPartitions: 450
hoodie.table.name: <>
hoodie.table.partition.fields: tenant,date
hoodie.table.type: MERGE_ON_READ
hoodie.write.concurrency.mode: OPTIMISTIC_CONCURRENCY_CONTROL
hoodie.write.lock.dynamodb.billing_mode: PROVISIONED
hoodie.write.lock.dynamodb.endpoint_url: https://dynamodb.us-east-2.amazonaws.com/
hoodie.write.lock.dynamodb.partition_key: some-key
hoodie.write.lock.dynamodb.region: us-east-2
hoodie.write.lock.dynamodb.table: HudiLocker
hoodie.write.lock.provider: org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
hoodie.write.markers.type: DIRECT
Additional logs
Stacktrace
The text was updated successfully, but these errors were encountered: