-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] - Cannot Ingest Protobuf records using Hudi Streamer #12301
Comments
It seems that this was a known issue (#11598) in v0.15.0 documented here with PR fixes(#11373, #11660 ) by @the-other-tim-brown that have been merged to master. I have tested with a JAR compiled from the Master branch and I'm still running into issues.
Do you have any suggestions on how to move past this? Thank you. |
@remeajayi2022 For the latest issue you are seeing, it looks like there may be conflicting Confluent versions on your class path. I can see that the |
Thank you so much for taking the time to help with this earlier! I appreciate your insights. Following your suggestion, I’ve removed the kafka-protobuf-provider and kafka-json-schema-provider jars that I had previously. They were a higher version and the cause of those compatibility issues. However, I’m running into a
When I run the job, I see the following error: My understanding is that the There may still be a missing dependency or configuration issue that I haven’t accounted for. Do you have any suggestions on how I could resolve these errors? Am I possibly overlooking a required library or some specific classpath setup? Thanks again for your time and support—it’s greatly appreciated! |
@remeajayi2022 you need to include the jars for the providers. They should be set to use version |
Thanks, @the-other-tim-brown, for your response. I apologize for not including this earlier, but I had added the Protobuf and JSON provider jars with version 5.5.0, in previous runs. But this resulted in a Protobuf compatibility error.
|
@remeajayi2022 I will try to build a sample jar bundler with what we're using internally to help unblock you this weekend. Apologies for the delay. |
@remeajayi2022 looking at my own deployment, I have replaced the protobuf jar provided by spark. The version provided by spark 3.1.3 is |
Hi @the-other-tim-brown, thanks for your comment. Can you tell me what jars you included and the spark version for your deployment? I only used the Protobuf jar provided by Confluent. I don't know which one you are referring to above. |
@remeajayi2022 there is a protobuf jar included in the spark runtime already, you would need to remove that jar from the |
Does your deployment utilize Dataproc? I can't really modify the spark runtime since it's a managed service. The highest supported Spark version on Dataproc is v3.5.1 which uses v3.23.4 of Protobuf. I've tested the job with this Spark version , and as expected from your explanation there are still Protobuf compatibility issues. I did try to force the job to use hudi's protobuf version with this: But this resulted in a different Do you mind providing more details about the setup that worked for you so I can replicate that? |
@remeajayi2022 I cannot reveal too much detail into the Onehouse managed solution for Hudi unfortunately. If you run with spark 3.5.1 can you try omitting your own version of protobuf on the path to see if this works? I think that the 3.20+ range should work but you cannot have two versions on the path |
I’m trying to ingest from a ProtoKafka source using Hudi Streamer but encountering an issue.
The stack trace points to a misconfigured schema registry URL. However, the same URL works for Hudi streamer jobs ingesting from AvroKafka sources. When I ping the schema registry URL using curl, it correctly returns the schema.
Additional Context
AvroKafka
spark job.hoodie.streamer.schemaprovider.proto.class.name
andhoodie.streamer.source.kafka.proto.value.deserializer.class=org.apache.kafka.common.serialization.ByteArrayDeserializer
. I don't think these are required but their presence/absence did not resolve this error.Environment Details
Hudi version: v0.15.0
Spark version: 3.1.3
Scala version: 2.12
Google Dataproc version: 2.0.125-debian10
Spark Submit Command and Protobuf Configuration
Steps to Reproduce
I’d appreciate any insights into resolving this issue.
Is there an alternative or a workaround for configuring the Protobuf schema?
Am I missing any configuration settings?
Thank you for your help!
The text was updated successfully, but these errors were encountered: