Character encoding issues with refactored `BufferedTokenizerExt` #16694

donoghuc · 2024-11-19T20:42:52Z

With the addition of https://github.com/elastic/logstash/pull/16482/commits it is possible that character encodings can be improperly handled leading to corrupted data.

Logstash information:
The affected (released) versions are:

8.15.4

Reproduction

The issue can be demonstrated by making the following changes and performing the small reproduction case in a repl:

diff --git a/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java b/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
index 2c36370af..7bd9e2e03 100644
--- a/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
+++ b/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
@@ -79,9 +79,25 @@ public class BufferedTokenizerExt extends RubyObject {
     @SuppressWarnings("rawtypes")
     public RubyArray extract(final ThreadContext context, IRubyObject data) {
         final RubyArray entities = data.convertToString().split(delimiter, -1);
+        // Debug before addAll
+        System.out.println("\n=== Before addAll ===");
+        for (int i = 0; i < entities.size(); i++) {
+            RubyString entity = (RubyString)entities.eltInternal(i);
+            System.out.println("Entity " + i + ":");
+            System.out.println("  Bytes: " + java.util.Arrays.toString(entity.getBytes()));
+            System.out.println("  Encoding: " + entity.getEncoding());
+        }
         if (!bufferFullErrorNotified) {
             input.clear();
             input.addAll(entities);
+            // Debug after addAll
+            System.out.println("\n=== After addAll ===");
+            for (int i = 0; i < input.size(); i++) {
+                RubyString stored = (RubyString)input.eltInternal(i);
+                System.out.println("Stored " + i + ":");
+                System.out.println("  Bytes: " + java.util.Arrays.toString(stored.getBytes()));
+                System.out.println("  Encoding: " + stored.getEncoding());
+            }
         } else {
             // after a full buffer signal
             if (input.isEmpty()) {

irb(main):001:0> line = LogStash::Plugin.lookup("codec", "line").new
=> <LogStash::Codecs::Line id=>"line_7fe29211-65b2-4931-985b-3ff04b227a90", enable_metric=>true, charset=>"UTF-8", delimiter=>"\n">
irb(main):002:0> buftok = FileWatch::BufferedTokenizer.new
=> #<FileWatch::BufferedTokenizer:0x350ce9db>
irb(main):003:0> buftok.extract("\xA3".force_encoding("ISO8859-1"))
irb(main):004:0> buftok.flush.bytes

=== Before addAll ===
Entity 0:
  Bytes: [-93]
  Encoding: ISO-8859-1

=== After addAll ===
Stored 0:
  Bytes: [-62, -93]
  Encoding: UTF-8
=> [194, 163]

We expect a Single byte [163] (£ in ISO-8859-1) but we observe instead Double-encoded bytes [194, 163] (UTF-8 representation of £).

Source of the bug
RubyArray.add (invoked by addAll) invokes a conversion JavaUtil.convertJavaToUsableRubyObject(metaClass.runtime, element) which invokes a StringConverter which creates a new unicode string at which appears to be the source of the extra encoding.

additional information

A test has been raised to demonstrate the bug: [Test] add test to verify that BufferedTokenizerExt doesn't apply enc… #16690
Another example has been submitted showing the behavior outside the tokenizer code:

package org.logstash.common;

import org.jruby.RubyArray;
import org.jruby.RubyString;
import org.jruby.runtime.ThreadContext;
import org.jruby.runtime.builtin.IRubyObject;
import org.junit.Before;
import org.junit.Test;
import org.logstash.RubyUtil;

import static org.junit.Assert.assertEquals;
import static org.logstash.RubyUtil.RUBY;

@SuppressWarnings("rawtypes")
public class BoomTest {

    private IRubyObject rubyInput;

    private static void assertEqualsBytes(byte[] expected, byte[] actual) {
        assertEquals(expected.length, actual.length);
        for (int i = 0; i < expected.length; i++) {
            assertEquals(expected[i], actual[i]);
        }
    }

    private ThreadContext context;

    private static RubyString NEW_LINE = (RubyString) RubyUtil.RUBY.newString("\n").
            freeze(RubyUtil.RUBY.getCurrentContext());

    @Before
    public void setUp() {
        context = RUBY.getCurrentContext();
        RubyString rubyString = RubyString.newString(RUBY, new byte[]{(byte) 0xA3});
        rubyInput = rubyString.force_encoding(context, RUBY.newString("ISO8859-1"));
    }

    @Test
    public void testEncodingIsPreservedOutside() {
        final RubyArray entities = rubyInput.convertToString().split(NEW_LINE, -1);

        // shift the first directly from entities, doesn't apply any charset conversion
        RubyString head = (RubyString) entities.shift(context);

        assertEqualsBytes(new byte[]{(byte) 0xA3}, head.getBytes());
    }

    @Test
    public void testEncodingIsPreservedOutsideAfterAdding() {
        final RubyArray entities = rubyInput.convertToString().split(NEW_LINE, -1);

        // adding all entities and shifting the first from this secondary accumulator does some charset conversion
        RubyArray input = RubyUtil.RUBY.newArray();
        input.addAll(entities);
        RubyString head = (RubyString) input.shift(context);

        assertEqualsBytes(new byte[]{(byte) 0xA3}, head.getBytes());
    }
}

The text was updated successfully, but these errors were encountered:

donoghuc added bug status:needs-triage labels Nov 19, 2024

donoghuc mentioned this issue Nov 20, 2024

remove .travis.yml customization logstash-plugins/logstash-input-tcp#226

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character encoding issues with refactored `BufferedTokenizerExt` #16694

Character encoding issues with refactored `BufferedTokenizerExt` #16694

donoghuc commented Nov 19, 2024

Character encoding issues with refactored BufferedTokenizerExt #16694

Character encoding issues with refactored BufferedTokenizerExt #16694

Comments

donoghuc commented Nov 19, 2024

Character encoding issues with refactored `BufferedTokenizerExt` #16694

Character encoding issues with refactored `BufferedTokenizerExt` #16694