Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-4069: Remove Reader String Cache from Generic Datum Reader #3194

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

belugabehr
Copy link
Contributor

What is the purpose of the change

  • This pull request improves file read performance by removing an unnecessary caching level, fixing AVRO-4069.

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@github-actions github-actions bot added the Java Pull Requests for Java binding label Oct 1, 2024
@belugabehr belugabehr requested a review from martin-g October 2, 2024 03:04
@belugabehr
Copy link
Contributor Author

@martin-g Thanks so much for your support on these PRs. Here's another one that needs attention that has a positive performance impact.

Copy link
Member

@martin-g martin-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked at the Git commit / JIRA that introduced this cache ?
I guess it has been added for a reason.

Now you remove one of the tests without good reason (at least I don't see it) and without adding better tests.
https://issues.apache.org/jira/browse/AVRO-3531 talks about a real world scenario where the code without the cache caused issues.

@@ -33,27 +31,9 @@ public class TestGenericDatumReader {

private static final Random r = new Random(System.currentTimeMillis());

@Test
void readerCache() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you delete the test ?
Looking at it I think it should still work. You just need to remove the ctor parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is testing the thread-safe nature of the code. This code should not be considered thread-safe and therefore should be removed.

@KalleOlaviNiemitalo
Copy link
Contributor

stringClassCache was first added in a3b38eb, as part of AVRO-1268 implementation.

AVRO-3531 was a thread safety bug. Its fix #1719 changed the types of the caches and moved them into static class ReaderCache, but did not add any new caches.

I don't know whether stringClassCache is beneficial now; but AVRO-3531 doesn't look like a reason to keep it.

@belugabehr
Copy link
Contributor Author

@KalleOlaviNiemitalo @martin-g

Thank you for the correspondence.

I'm taking another look at this. It seems less than ideal that there is synchronized code paths for fast-reading Avro data. I am not sure why this Collection was ever updated to be synchronized as this code is inherently not thread-safe. The GenericDatumReader read() method accepts a Decoder. Decoders are certainly not thread safe as they typically have unsynchronized source reads and internal buffer manipulation.

If we can relax this requirement then performance is measurably better as a read path will have no synchronization overhead.

@belugabehr belugabehr force-pushed the belugabehr/AVRO-4069 branch 2 times, most recently from 3f2cc32 to 4ff72b2 Compare January 4, 2025 19:15
// For some of the more common classes, implement specific routines.
// For more complex classes, use reflection.
if (c == Integer.class) {
return Integer.parseInt(s, 10);

Check notice

Code scanning / CodeQL

Missing catch of NumberFormatException Note

Potential uncaught 'java.lang.NumberFormatException'.
IdentitySchemaKey key = (IdentitySchemaKey) obj;
return this == key || this.schema == key.schema;
if (c == Long.class) {
return Long.parseLong(s, 10);

Check notice

Code scanning / CodeQL

Missing catch of NumberFormatException Note

Potential uncaught 'java.lang.NumberFormatException'.
public ReaderCache(Function<Schema, Class> findStringClass) {
this.findStringClass = findStringClass;
if (c == Float.class) {
return Float.parseFloat(s);

Check notice

Code scanning / CodeQL

Missing catch of NumberFormatException Note

Potential uncaught 'java.lang.NumberFormatException'.
final Function<String, Object> ctor = stringCtorCache.computeIfAbsent(c, this::buildFunction);
return ctor.apply(s);
if (c == Double.class) {
return Double.parseDouble(s);

Check notice

Code scanning / CodeQL

Missing catch of NumberFormatException Note

Potential uncaught 'java.lang.NumberFormatException'.
@belugabehr belugabehr force-pushed the belugabehr/AVRO-4069 branch from 4ff72b2 to 5b2fdf6 Compare February 2, 2025 04:39
@belugabehr
Copy link
Contributor Author

belugabehr commented Feb 2, 2025

I am taking another pass at this one. The performance improvements are just too significant to pass up (11%+ in my local benchmarks). Here are my benchmarks:

# JMH version: 1.35
# VM version: JDK 17.0.13, OpenJDK 64-Bit Server VM, 17.0.13+11-Ubuntu-2ubuntu124.04
# VM invoker: /usr/lib/jvm/java-17-openjdk-amd64/bin/java

# This is from AVRO-4069 (the worst result I measured for this branch)
Result "com.szymon_kaluza.protobuf_avro.benchmark.AvroBench.serializeAndDeserializeSmall":
  889717.049 ±(99.9%) 9103.527 ops/s [Average]
  (min, avg, max) = (860194.914, 889717.049, 909429.461), stdev = 13625.731
  CI (99.9%): [880613.522, 898820.576] (assumes normal distribution)
  
  
# This is from master (the best result I measured for this branch)
Result "com.szymon_kaluza.protobuf_avro.benchmark.AvroBench.serializeAndDeserializeSmall":
  796084.662 ±(99.9%) 12190.142 ops/s [Average]
  (min, avg, max) = (764448.008, 796084.662, 850984.160), stdev = 18245.632
  CI (99.9%): [783894.521, 808274.804] (assumes normal distribution)

The reason I even investigated this is because Visual VM profiling was lighting up on IdentityHashMap usage. By migrating to HashMap, I was able to achieve most of the performance gains. Also, another thing that I hadn't considered before is that IdentityHashMap uses object reference-equality for insertions and lookups. This means that the Map can look different for each instantiation of Avro because obviously the reference values will be different every time it runs. This made benchmarking kind of tricky and unreliable as well. If the keys just happened to have a lot of collisions, then performance would suffer, alternatively, benchmarking improved if the keys happen to be laid out more evenly in the bucket space. Yet another reason to simply get off this implementation.

There is one caveat in the JavaDocs:

For many JRE implementations and operation mixes, this class will yield better performance than HashMap (which uses chaining rather than linear-probing).

I get around this by exploiting the fact that the number of items in the map is fixed as instantiation. Therefore, I use a Map size sufficiently large such that there are no key collisions and therefore no chaining involved.

@belugabehr
Copy link
Contributor Author

One thing to note is that I had to change the Map type. The class key uses reference based equals/hashcode and so had to convert to String to get a consistent key type.

@belugabehr
Copy link
Contributor Author

@martin-g

@martin-g
Copy link
Member

martin-g commented Feb 3, 2025

Sorry, I am not familiar with this code to be able to tell whether the changes are good or not.

@belugabehr
Copy link
Contributor Author

@martin-g Thanks for the pass. Any ideas who else might be available to review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java Pull Requests for Java binding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants