AVRO-4069: Remove Reader String Cache from Generic Datum Reader #3194

belugabehr · 2024-10-01T03:50:55Z

What is the purpose of the change

This pull request improves file read performance by removing an unnecessary caching level, fixing AVRO-4069.

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

belugabehr · 2024-10-08T23:17:45Z

@martin-g Thanks so much for your support on these PRs. Here's another one that needs attention that has a positive performance impact.

martin-g

Have you looked at the Git commit / JIRA that introduced this cache ?
I guess it has been added for a reason.

Now you remove one of the tests without good reason (at least I don't see it) and without adding better tests.
https://issues.apache.org/jira/browse/AVRO-3531 talks about a real world scenario where the code without the cache caused issues.

martin-g · 2024-10-09T06:39:30Z

lang/java/avro/src/test/java/org/apache/avro/generic/TestGenericDatumReader.java

@@ -33,27 +31,9 @@ public class TestGenericDatumReader {

  private static final Random r = new Random(System.currentTimeMillis());

-  @Test
-  void readerCache() {


Why do you delete the test ?
Looking at it I think it should still work. You just need to remove the ctor parameter.

I believe this is testing the thread-safe nature of the code. This code should not be considered thread-safe and therefore should be removed.

KalleOlaviNiemitalo · 2024-10-09T10:53:08Z

stringClassCache was first added in a3b38eb, as part of AVRO-1268 implementation.

AVRO-3531 was a thread safety bug. Its fix #1719 changed the types of the caches and moved them into static class ReaderCache, but did not add any new caches.

I don't know whether stringClassCache is beneficial now; but AVRO-3531 doesn't look like a reason to keep it.

belugabehr · 2025-01-04T02:59:05Z

@KalleOlaviNiemitalo @martin-g

Thank you for the correspondence.

I'm taking another look at this. It seems less than ideal that there is synchronized code paths for fast-reading Avro data. I am not sure why this Collection was ever updated to be synchronized as this code is inherently not thread-safe. The GenericDatumReader read() method accepts a Decoder. Decoders are certainly not thread safe as they typically have unsynchronized source reads and internal buffer manipulation.

If we can relax this requirement then performance is measurably better as a read path will have no synchronization overhead.

lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java

+    // For some of the more common classes, implement specific routines.
+    // For more complex classes, use reflection.
+    if (c == Integer.class) {
+      return Integer.parseInt(s, 10);


lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java

-      IdentitySchemaKey key = (IdentitySchemaKey) obj;
-      return this == key || this.schema == key.schema;
+    if (c == Long.class) {
+      return Long.parseLong(s, 10);


lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java

-    public ReaderCache(Function<Schema, Class> findStringClass) {
-      this.findStringClass = findStringClass;
+    if (c == Float.class) {
+      return Float.parseFloat(s);


lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java

-      final Function<String, Object> ctor = stringCtorCache.computeIfAbsent(c, this::buildFunction);
-      return ctor.apply(s);
+    if (c == Double.class) {
+      return Double.parseDouble(s);


belugabehr · 2025-02-02T04:51:33Z

I am taking another pass at this one. The performance improvements are just too significant to pass up (11%+ in my local benchmarks). Here are my benchmarks:

# JMH version: 1.35
# VM version: JDK 17.0.13, OpenJDK 64-Bit Server VM, 17.0.13+11-Ubuntu-2ubuntu124.04
# VM invoker: /usr/lib/jvm/java-17-openjdk-amd64/bin/java

# This is from AVRO-4069 (the worst result I measured for this branch)
Result "com.szymon_kaluza.protobuf_avro.benchmark.AvroBench.serializeAndDeserializeSmall":
  889717.049 ±(99.9%) 9103.527 ops/s [Average]
  (min, avg, max) = (860194.914, 889717.049, 909429.461), stdev = 13625.731
  CI (99.9%): [880613.522, 898820.576] (assumes normal distribution)
  
  
# This is from master (the best result I measured for this branch)
Result "com.szymon_kaluza.protobuf_avro.benchmark.AvroBench.serializeAndDeserializeSmall":
  796084.662 ±(99.9%) 12190.142 ops/s [Average]
  (min, avg, max) = (764448.008, 796084.662, 850984.160), stdev = 18245.632
  CI (99.9%): [783894.521, 808274.804] (assumes normal distribution)

The reason I even investigated this is because Visual VM profiling was lighting up on IdentityHashMap usage. By migrating to HashMap, I was able to achieve most of the performance gains. Also, another thing that I hadn't considered before is that IdentityHashMap uses object reference-equality for insertions and lookups. This means that the Map can look different for each instantiation of Avro because obviously the reference values will be different every time it runs. This made benchmarking kind of tricky and unreliable as well. If the keys just happened to have a lot of collisions, then performance would suffer, alternatively, benchmarking improved if the keys happen to be laid out more evenly in the bucket space. Yet another reason to simply get off this implementation.

There is one caveat in the JavaDocs:

For many JRE implementations and operation mixes, this class will yield better performance than HashMap (which uses chaining rather than linear-probing).

https://docs.oracle.com/javase/8/docs/api/java/util/IdentityHashMap.html

I get around this by exploiting the fact that the number of items in the map is fixed as instantiation. Therefore, I use a Map size sufficiently large such that there are no key collisions and therefore no chaining involved.

belugabehr · 2025-02-02T04:59:02Z

One thing to note is that I had to change the Map type. The class key uses reference based equals/hashcode and so had to convert to String to get a consistent key type.

belugabehr · 2025-02-02T04:59:08Z

@martin-g

martin-g · 2025-02-03T08:12:53Z

Sorry, I am not familiar with this code to be able to tell whether the changes are good or not.

belugabehr · 2025-03-08T18:29:24Z

@martin-g Thanks for the pass. Any ideas who else might be available to review?

github-actions bot added the Java Pull Requests for Java binding label Oct 1, 2024

belugabehr requested a review from martin-g October 2, 2024 03:04

martin-g reviewed Oct 9, 2024

View reviewed changes

belugabehr force-pushed the belugabehr/AVRO-4069 branch 2 times, most recently from 3f2cc32 to 4ff72b2 Compare January 4, 2025 19:15

github-advanced-security bot found potential problems Jan 4, 2025

View reviewed changes

belugabehr added 2 commits February 1, 2025 23:31

AVRO-4069: Remove Reader String Cache from Generic Datum Reader

17da90b

AVRO-4069: Remove Reader String Cache from Generic Datum Reader

5b2fdf6

belugabehr force-pushed the belugabehr/AVRO-4069 branch from 4ff72b2 to 5b2fdf6 Compare February 2, 2025 04:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVRO-4069: Remove Reader String Cache from Generic Datum Reader #3194

AVRO-4069: Remove Reader String Cache from Generic Datum Reader #3194

belugabehr commented Oct 1, 2024

belugabehr commented Oct 8, 2024

martin-g left a comment

martin-g Oct 9, 2024

belugabehr Jan 4, 2025

KalleOlaviNiemitalo commented Oct 9, 2024

belugabehr commented Jan 4, 2025

belugabehr commented Feb 2, 2025 •

edited

Loading

belugabehr commented Feb 2, 2025

belugabehr commented Feb 2, 2025

martin-g commented Feb 3, 2025

belugabehr commented Mar 8, 2025

AVRO-4069: Remove Reader String Cache from Generic Datum Reader #3194

Are you sure you want to change the base?

AVRO-4069: Remove Reader String Cache from Generic Datum Reader #3194

Conversation

belugabehr commented Oct 1, 2024

What is the purpose of the change

Verifying this change

Documentation

belugabehr commented Oct 8, 2024

martin-g left a comment

Choose a reason for hiding this comment

martin-g Oct 9, 2024

Choose a reason for hiding this comment

belugabehr Jan 4, 2025

Choose a reason for hiding this comment

KalleOlaviNiemitalo commented Oct 9, 2024

belugabehr commented Jan 4, 2025

belugabehr commented Feb 2, 2025 • edited Loading

belugabehr commented Feb 2, 2025

belugabehr commented Feb 2, 2025

martin-g commented Feb 3, 2025

belugabehr commented Mar 8, 2025

belugabehr commented Feb 2, 2025 •

edited

Loading