Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for org.openx.data.jsonserde.JsonSerDe in Hive connector #5638

Closed
findepi opened this issue Oct 21, 2020 · 21 comments
Closed

Add support for org.openx.data.jsonserde.JsonSerDe in Hive connector #5638

findepi opened this issue Oct 21, 2020 · 21 comments
Milestone

Comments

@findepi
Copy link
Member

findepi commented Oct 21, 2020

Checklist

@pettyjamesm
Copy link
Member

Beware of rcongiu/Hive-JSON-Serde#228 if you’re going to handle timestamps with that serde.

@findepi
Copy link
Member Author

findepi commented Oct 21, 2020

Some other previous past reports of problems with openx json serde -- https://prestosql.slack.com/archives/CGB0QHWSW/p1595400446310500

@electrum
Copy link
Member

We recommend using org.apache.hive.hcatalog.data.JsonSerDe instead, as it is fully supported in Presto.

@JhonJohnatan
Copy link

Guys, we really need this issue.

@GalDayan
Copy link

GalDayan commented Nov 3, 2020

Hey guys.

I don't think it's an enhancement. I think it really mandatory.

Is there anything I can do to advance the issue?

@tooptoop4
Copy link
Contributor

would u like to code up a PR? @GalDayan

@GalDayan
Copy link

GalDayan commented Nov 3, 2020

would u like to code up a PR? @GalDayan

Sure. I would love to help.
What is the procedure?

@tooptoop4
Copy link
Contributor

@GalDayan whip up something that works on your local first can deal with the paperwork later

@findepi
Copy link
Member Author

findepi commented Nov 10, 2020

Beware of rcongiu/Hive-JSON-Serde#228 if you’re going to handle timestamps with that serde.

@pettyjamesm is there any other code version we should be using instead?

@pettyjamesm
Copy link
Member

@findepi none that I’m aware of. We’ve maintained a version with some other thread safety fixes (eg: unsynchronized static map access: https://github.com/rcongiu/Hive-JSON-Serde/blob/5c7ca56d084c6b507dcf78ad16b66e01c518201f/json-serde/src/main/java/org/openx/data/jsonserde/objectinspector/JsonObjectInspectorFactory.java#L32) but doubt we’ll be taking on the task of maintaining an open source fork, which is what seems to be what’s proposed here.

@electrum electrum removed the enhancement New feature or request label Dec 8, 2020
@JhonJohnatan
Copy link

JhonJohnatan commented Dec 29, 2020

Any updates? @electrum

@chickenPopcorn
Copy link

Also having the same issue when upgrading with 341+. We have tables in the data lake populated using openx JSON serde. It will very difficult for us to migrate to a different serde.
I think almost anyone who migrates from Athena will encounter this issue https://docs.aws.amazon.com/athena/latest/ug/json-serde.html

@findepi
Copy link
Member Author

findepi commented Feb 3, 2021

We looked into this and it turns out the OpenX serde is incompatibility licensed and cannot be included here.
(The code itself is Apache licensed, but it bundles code that's not & is not compatible)

We discussed this with @martint @electrum and myself, and it seems the best course of action is to reimplement the OpenX serde, avoiding the problematic dependency, and also avoiding the problematic dependency on Hive APIs (which would help avoid things like #5602).

I am not aware about anyone working on this currently.

@ruturaj04
Copy link

hey guys,

is there an ETA on support for openx serde with trino?

even we are facing a similar issues as pointed out by @chickenPopcorn here

would be great to see this support made available :)

@ruturaj04
Copy link

ruturaj04 commented Nov 15, 2021

@findepi @martint @electrum

any update/ETA on this request/issue?

we would really love to move to trino but cannot do the migration due to this issue

if you could point me to the right files and throw some idea on the changes maybe I could put in some time to fix this from my end and raise a PR

@erindru
Copy link

erindru commented Aug 2, 2022

I managed to get a Trino query to return data from a table using org.openx.data.jsonserde.JsonSerDe created in Athena.

I have no idea how broken this method is / if there are concurrency issues, but it did return data :)

Basically, it just requires getting the org.openx.data.jsonserde.JsonSerDe class on the classpath. Since each plugin uses an isolated classloader, I just chucked it in the Hive plugin folder, eg:

$ curl -o ./plugin/hive/json-serde-1.3.8-jar-with-dependencies.jar http://www.congiu.net/hive-json-serde/1.3.8/cdh5/json-serde-1.3.8-jar-with-dependencies.jar

@pettyjamesm
Copy link
Member

I have no idea how broken this method is / if there are concurrency issues

The publicly available openx serde is _very _ broken when used from within Trino, because it isn’t written in a way that tolerates separate threads reading data concurrently on the same JVM. Athena has many, many patches to try to fix the issues we have found over time and it seems like we’re always finding more.

When I’ve tried contributing some of those patches upstream, the PRs have sat open without review comments or being merged for years. The project seems entirely unmaintained, but I can assure you that if you’re using the publicly available artifact- you will get incorrect results. Some of those will be silent and others can cause loud failures. I do not recommend using it.

@erindru
Copy link

erindru commented Aug 3, 2022

Understood, good to know!

I've come to the conclusion that re-using Athena objects in Trino is generally unsupported (views dont work either).

Unfortunately the recommended SerDe for jsonlines files (org.apache.hive.hcatalog.data.JsonSerDe) forces you to define any nested JSON as <struct>'s while the OpenX SerDe just returns nested JSON as a string which was our main reason for using it.

Basically, this: https://stackoverflow.com/questions/58606743/hive-table-with-nested-json-as-string-value

@findepi
Copy link
Member Author

findepi commented Aug 3, 2022

@pettyjamesm @erindru ,
@MiguelWeezardo made some fixes to the serde code and made them available at https://github.com/starburstdata/hive-json-serde
cc @anusudarsan

@MiguelWeezardo
Copy link
Member

Starburst OpenX serde fork has some fixes regarding concurrent parsing of timestamps which was using thread-unsafe classes in the original code. There are multiple features added as well, such as compatibility with Hive 3 APIs. Check out the develop branch.

@findepi
Copy link
Member Author

findepi commented Mar 29, 2023

Fixed by @dain in #16073

@findepi findepi closed this as completed Mar 29, 2023
@findepi findepi added this to the 411 milestone Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

10 participants