Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character encoding issues with refactored BufferedTokenizerExt #16694

Open
donoghuc opened this issue Nov 19, 2024 · 0 comments
Open

Character encoding issues with refactored BufferedTokenizerExt #16694

donoghuc opened this issue Nov 19, 2024 · 0 comments

Comments

@donoghuc
Copy link
Member

With the addition of https://github.com/elastic/logstash/pull/16482/commits it is possible that character encodings can be improperly handled leading to corrupted data.

Logstash information:
The affected (released) versions are:

  • 8.15.4

Reproduction

The issue can be demonstrated by making the following changes and performing the small reproduction case in a repl:

diff --git a/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java b/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
index 2c36370af..7bd9e2e03 100644
--- a/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
+++ b/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
@@ -79,9 +79,25 @@ public class BufferedTokenizerExt extends RubyObject {
     @SuppressWarnings("rawtypes")
     public RubyArray extract(final ThreadContext context, IRubyObject data) {
         final RubyArray entities = data.convertToString().split(delimiter, -1);
+        // Debug before addAll
+        System.out.println("\n=== Before addAll ===");
+        for (int i = 0; i < entities.size(); i++) {
+            RubyString entity = (RubyString)entities.eltInternal(i);
+            System.out.println("Entity " + i + ":");
+            System.out.println("  Bytes: " + java.util.Arrays.toString(entity.getBytes()));
+            System.out.println("  Encoding: " + entity.getEncoding());
+        }
         if (!bufferFullErrorNotified) {
             input.clear();
             input.addAll(entities);
+            // Debug after addAll
+            System.out.println("\n=== After addAll ===");
+            for (int i = 0; i < input.size(); i++) {
+                RubyString stored = (RubyString)input.eltInternal(i);
+                System.out.println("Stored " + i + ":");
+                System.out.println("  Bytes: " + java.util.Arrays.toString(stored.getBytes()));
+                System.out.println("  Encoding: " + stored.getEncoding());
+            }
         } else {
             // after a full buffer signal
             if (input.isEmpty()) {
irb(main):001:0> line = LogStash::Plugin.lookup("codec", "line").new
=> <LogStash::Codecs::Line id=>"line_7fe29211-65b2-4931-985b-3ff04b227a90", enable_metric=>true, charset=>"UTF-8", delimiter=>"\n">
irb(main):002:0> buftok = FileWatch::BufferedTokenizer.new
=> #<FileWatch::BufferedTokenizer:0x350ce9db>
irb(main):003:0> buftok.extract("\xA3".force_encoding("ISO8859-1"))
irb(main):004:0> buftok.flush.bytes

=== Before addAll ===
Entity 0:
  Bytes: [-93]
  Encoding: ISO-8859-1

=== After addAll ===
Stored 0:
  Bytes: [-62, -93]
  Encoding: UTF-8
=> [194, 163]

We expect a Single byte [163] (£ in ISO-8859-1) but we observe instead Double-encoded bytes [194, 163] (UTF-8 representation of £).

Source of the bug
RubyArray.add (invoked by addAll) invokes a conversion JavaUtil.convertJavaToUsableRubyObject(metaClass.runtime, element) which invokes a StringConverter which creates a new unicode string at which appears to be the source of the extra encoding.

additional information

package org.logstash.common;

import org.jruby.RubyArray;
import org.jruby.RubyString;
import org.jruby.runtime.ThreadContext;
import org.jruby.runtime.builtin.IRubyObject;
import org.junit.Before;
import org.junit.Test;
import org.logstash.RubyUtil;

import static org.junit.Assert.assertEquals;
import static org.logstash.RubyUtil.RUBY;

@SuppressWarnings("rawtypes")
public class BoomTest {

    private IRubyObject rubyInput;

    private static void assertEqualsBytes(byte[] expected, byte[] actual) {
        assertEquals(expected.length, actual.length);
        for (int i = 0; i < expected.length; i++) {
            assertEquals(expected[i], actual[i]);
        }
    }

    private ThreadContext context;

    private static RubyString NEW_LINE = (RubyString) RubyUtil.RUBY.newString("\n").
            freeze(RubyUtil.RUBY.getCurrentContext());

    @Before
    public void setUp() {
        context = RUBY.getCurrentContext();
        RubyString rubyString = RubyString.newString(RUBY, new byte[]{(byte) 0xA3});
        rubyInput = rubyString.force_encoding(context, RUBY.newString("ISO8859-1"));
    }

    @Test
    public void testEncodingIsPreservedOutside() {
        final RubyArray entities = rubyInput.convertToString().split(NEW_LINE, -1);

        // shift the first directly from entities, doesn't apply any charset conversion
        RubyString head = (RubyString) entities.shift(context);

        assertEqualsBytes(new byte[]{(byte) 0xA3}, head.getBytes());
    }

    @Test
    public void testEncodingIsPreservedOutsideAfterAdding() {
        final RubyArray entities = rubyInput.convertToString().split(NEW_LINE, -1);

        // adding all entities and shifting the first from this secondary accumulator does some charset conversion
        RubyArray input = RubyUtil.RUBY.newArray();
        input.addAll(entities);
        RubyString head = (RubyString) input.shift(context);

        assertEqualsBytes(new byte[]{(byte) 0xA3}, head.getBytes());
    }
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant