[SPARK-57419][SQL] Read and infer JSON schema from tar archives by akshatshenoi-db · Pull Request #56480 · apache/spark

akshatshenoi-db · 2026-06-12T18:58:10Z

What changes were proposed in this pull request?

SPARK-57135 added reading CSV files packed in tar archives (.tar/.tar.gz/.tgz) and SPARK-57321 added schema inference for them, both gated by spark.sql.files.archive.reader.enabled. This extends the same capability to the JSON data source.

When the flag is enabled, the V1 JSON data source reads a tar archive as if it were a directory of its entries: each entry is streamed through ArchiveReader (never unpacked to disk) and parsed exactly like a standalone JSON file, for both line-delimited and multi-line JSON (JsonDataSource.readArchive/readStream). Schema inference reads every archive entry together with any loose files in a single JsonInferSchema pass (inferWithArchives), so the inferred schema matches a directory read of the same files. The whole archive is one non-splittable unit (JsonFileFormat.isSplitable returns false), and a corrupt/missing archive is skipped as a unit under ignoreCorruptFiles/ignoreMissingFiles. The DSv2 reader cannot read archives, so JsonTable passes supportsArchiveScan = false and refuses to infer a schema for archive inputs (raising UNABLE_TO_INFER_SCHEMA).

Unlike CSV, JSON needs no per-entry header handling (records are self-describing, so one parser serves every entry) and no mergeSchema-style branching (JsonInferSchema already merges record types by field name across all inputs, so one pass is itself the union).

This also unifies the archive test suites: the format-agnostic inference and complex-type tests are hoisted into ArchiveReadSuiteBase behind capability hooks (supportsSchemaInference, supportsComplexTypes), so CSV, JSON, and future archive formats share them instead of each duplicating them.

Why are the changes needed?

To let JSON ingestion read tar archives without unpacking them to disk, matching the CSV behavior already in Spark.

Does this PR introduce any user-facing change?

Yes. With spark.sql.files.archive.reader.enabled=true (default false), the JSON data source can read and infer schemas from .tar/.tar.gz/.tgz files.

How was this patch tested?

New JSONTarArchiveReadSuite (mixing JSONArchiveReadBase with the shared ArchiveReadSuiteBase and TarArchiveReadBase), plus the hoisted shared inference and complex-type tests now also exercised by the CSV suites.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

### What changes were proposed in this pull request? SPARK-57135 added reading CSV files packed in tar archives (`.tar`/`.tar.gz`/`.tgz`) and SPARK-57321 added schema inference for them, both gated by `spark.sql.files.archive.reader.enabled`. This extends the same capability to the JSON data source. When the flag is enabled, the V1 JSON data source reads a tar archive as if it were a directory of its entries: each entry is streamed through `ArchiveReader` (never unpacked to disk) and parsed exactly like a standalone JSON file, for both line-delimited and multi-line JSON (`JsonDataSource.readArchive`/`readStream`). Schema inference reads every archive entry together with any loose files in a single `JsonInferSchema` pass (`inferWithArchives`), so the inferred schema matches a directory read of the same files. The whole archive is one non-splittable unit (`JsonFileFormat.isSplitable` returns false), and a corrupt/missing archive is skipped as a unit under `ignoreCorruptFiles`/`ignoreMissingFiles`. The DSv2 reader cannot read archives, so `JsonTable` passes `supportsArchiveScan = false` and refuses to infer a schema for archive inputs (raising `UNABLE_TO_INFER_SCHEMA`). Unlike CSV, JSON needs no per-entry header handling (records are self-describing, so one parser serves every entry) and no `mergeSchema`-style branching (`JsonInferSchema` already merges record types by field name across all inputs, so one pass is itself the union). This also unifies the archive test suites: the format-agnostic inference and complex-type tests are hoisted into `ArchiveReadSuiteBase` behind capability hooks (`supportsSchemaInference`, `supportsComplexTypes`), so CSV, JSON, and future archive formats share them instead of each duplicating them. ### Why are the changes needed? To let JSON ingestion read tar archives without unpacking them to disk, matching the CSV behavior already in Spark. ### Does this PR introduce any user-facing change? Yes. With `spark.sql.files.archive.reader.enabled=true` (default false), the JSON data source can read and infer schemas from `.tar`/`.tar.gz`/`.tgz` files. ### How was this patch tested? New `JSONTarArchiveReadSuite` (mixing `JSONArchiveReadBase` + the shared `ArchiveReadSuiteBase` and `TarArchiveReadBase`), plus the hoisted shared inference and complex-type tests now also exercised by the CSV suites. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code

…chema inference inferWithArchives now decodes each record with the configured `encoding` (re-encoding to UTF-8) so archive inference matches the scan and a directory read for non-UTF-8 inputs, and samples records via JsonUtils.sample so `samplingRatio` is honored as in the loose-file infer paths. JsonUtils.sample's RDD overload is generalized to RDD[T] to serve both the multiLine and archive record types. Adds a UTF-16 archive-vs-directory inference parity test. Co-authored-by: Isaac

cloud-fan

1 blocking, 2 non-blocking, 1 nits.
Faithful port of the CSV archive pattern with nicely unified shared tests; one narrow parity bug in no-encoding multiLine inference should be fixed before merge.

Correctness (2)

JsonDataSource.scala:159: no-encoding inference parses strictly as UTF-8 while the scan and directory reads auto-detect — UTF-16 multiLine docs infer corrupt with archives present — see inline
JsonDataSource.scala:148: comment claims records are copied but fromBytes wraps the reused line buffer — see inline

Suggestions (1)

JSONArchiveReadBase.scala:193: no test covers a malformed record inside an archive entry (_corrupt_record parity) — see inline

Nits: 1 minor item (see inline comments).

cloud-fan · 2026-06-12T22:05:50Z

+      // kept and parsed as UTF-8 by `CreateJacksonParser.utf8String`.
+      def toRecord(bytes: Array[Byte], length: Int): UTF8String = encoding match {
+        case Some(enc) => UTF8String.fromString(new String(bytes, 0, length, enc))
+        case None => UTF8String.fromBytes(bytes, 0, length)


When encoding is unset, this parses records strictly as UTF-8: CreateJacksonParser.utf8String wraps an InputStreamReader with StandardCharsets.UTF_8. But the archive scan path (MultiLineJsonDataSource.readStream -> CreateJacksonParser.inputStream) and a directory read's inference are byte-based, so Jackson auto-detects UTF-16/UTF-32 there. A multiLine UTF-16 document with no encoding option therefore infers a corrupt-record-only schema whenever an archive is among the inputs, but the real schema from a directory read of the same files — and loose files alongside the archive are affected too, since every input routes through this pass. The new encoding test only covers the explicit-encoding case.

I'd carry records as Array[Byte] and choose the parser the way CreateJacksonParser.text/internalRow do: factory.createParser(bytes, 0, len) when encoding is empty (byte-based, auto-detects, matching the scan), getStreamDecoder when set. That also subsumes the buffer-copy issue flagged at line 148.

cloud-fan · 2026-06-12T22:05:50Z

+    val encoding = parsedOptions.encoding
+    val ignoreCorruptFiles = parsedOptions.ignoreCorruptFiles
+    val ignoreMissingFiles = parsedOptions.ignoreMissingFiles
+    // Each input is streamed lazily; records are copied into fresh `UTF8String`s (the line reader


This comment overstates what the no-encoding branch does: UTF8String.fromBytes(bytes, 0, length) wraps the array without copying, and line.getBytes is lineIterator's single reused Text buffer. It works today only because sampling and inference consume each record before the next readLine overwrites the buffer — any future buffering step downstream (a cache, batched sampling) would silently corrupt records. Copying for real (line.copyBytes()) is cheap next to the parse cost and makes the comment true.

cloud-fan · 2026-06-12T22:05:50Z

+
+  // ----- JSON-specific read tests --------------------------------------------
+
+  test("JSON: entries with differing fields union like a directory") {


Both readStream implementations wire new FailureSafeParsers, and the multiLine corrupt-record echo differs from readFile's (pre-buffered bytes vs. file re-read), but no test puts a malformed JSON record inside an archive entry. A small assertArchiveMatchesDir case with one bad record would pin down _corrupt_record/permissive-mode parity with a directory read for both modes.

cloud-fan · 2026-06-12T22:05:50Z

+
+  /**
+   * Whether this format can represent nested/complex types (struct/array/map). Gates the shared
+   * complex-type round-trip test; CSV and text leave it false, JSON/Avro/Parquet/XML override true.


Only JSON overrides this — no Avro/Parquet/XML archive traits exist yet.

Suggested change

* complex-type round-trip test; CSV and text leave it false, JSON/Avro/Parquet/XML override true.

* complex-type round-trip test; CSV and text leave it false, JSON overrides it to true.

…match the scan Addresses review feedback on the JSON archive schema-inference pass: - Carry each inference record as its raw bytes (Array[Byte]) instead of a UTF8String, and parse it the way the scan does: a byte-array parser auto-detects the charset when no `encoding` is set (so a UTF-16/UTF-32 document is read correctly rather than forced through UTF-8), and a stream decoder applies an explicit `encoding`. Previously the no-`encoding` branch wrapped bytes in a UTF-8 InputStreamReader, so a multiLine non-UTF-8 document with an archive among the inputs inferred a corrupt-record-only schema, diverging from a directory read. Adds CreateJacksonParser.bytes (a public byte-array pair mirroring internalRow, since getStreamDecoder is private there). - Copy each record into a fresh array for real (Text.copyBytes / readAllBytes) rather than wrapping the line reader's reused buffer, so records stay valid independent of when they are consumed. Tests: a malformed record inside an archive entry now asserts _corrupt_record / permissive-mode parity with a directory read in both the line-delimited and multiLine modes; a multiLine UTF-16 archive with no `encoding` option now auto-detects the charset and matches a directory read. Also fixes a stale comment in ArchiveReadSuiteBase that referenced not-yet-existing Avro/Parquet/XML archive traits.

cloud-fan

4 addressed, 0 remaining, 2 new. (2 = 2 newly introduced, 0 late catches.)
The fix commit resolves every round-1 finding exactly as recommended; one narrow charset-parity edge and a stale doc line remain.

Correctness (1)

JsonDataSource.scala:185: multiLine + an encoding outside CharsetProvider's allow-list fails archive inference with INVALID_PARAMETER_VALUE.CHARSET while the scan and a directory read's inference accept it — see inline

Nits: 1 minor item (see inline comments).

Verification

Traced all charset modes through the new Array[Byte] pipeline against the scan and loose-file inference paths: no-encoding line-delimited and multiLine now auto-detect like the scan (UTF-16 pinned by the new test); explicit allow-listed encodings share the scan's decoder machinery; explicit non-allow-listed line-delimited matches the scan's rejection. The only divergent cell is the multiLine non-allow-listed case flagged inline. Also verified Text.copyBytes() returns a fresh exact-length array, closing the round-1 buffer-reuse hazard.

cloud-fan · 2026-06-12T22:37:08Z

+    // a directory read rather than always reading every one.
+    val sampled = JsonUtils.sample(records, parsedOptions)
+    val recordParser = encoding
+      .map(enc => CreateJacksonParser.bytes(enc, _: JsonFactory, _: Array[Byte]))


One charset cell still diverges from the scan: multiLine with an encoding outside CharsetProvider's allow-list (e.g. windows-1252, which JSONOptionsInRead allows in multiLine mode). bytes(enc, ...) decodes via CharsetProvider.newDecoder, which rejects such charsets unless spark.sql.legacy.javaCharsets is set, and with isReadFile=true the exception propagates out of JsonInferSchema.infer — so archive inference fails with INVALID_PARAMETER_VALUE.CHARSET while the multiLine scan (CreateJacksonParser.inputStream, a raw InputStreamReader) and a directory read's inference accept the same files. Line-delimited is unaffected: its scan path uses the same getStreamDecoder.

I'd build the multiLine parser as CreateJacksonParser.inputStream(enc, factory, new ByteArrayInputStream(record)) so each mode matches its own scan path exactly. Non-blocking: the failure is loud and the charset combination is rare.

cloud-fan · 2026-06-12T22:37:08Z

  /**
-   * Sample JSON RDD as configured by `samplingRatio`.
+   * Sample a JSON record RDD as configured by `samplingRatio`. Generic over the record type so the
+   * multiLine path (`RDD[PortableDataStream]`) and the archive inference path (`RDD[UTF8String]`)


Went stale with the switch to raw bytes:

Suggested change

* multiLine path (`RDD[PortableDataStream]`) and the archive inference path (`RDD[UTF8String]`)

* multiLine path (`RDD[PortableDataStream]`) and the archive inference path (`RDD[Array[Byte]]`)

…andling Addresses follow-up review feedback: - Pick the archive inference parser per mode so each matches its own scan's charset handling. multiLine now parses via CreateJacksonParser.inputStream (an InputStreamReader of the explicit `encoding`, or auto-detect when unset), mirroring MultiLineJsonDataSource's scan, so a multiLine `encoding` that JSONOptionsInRead allows but CharsetProvider does not (e.g. windows-1252) is accepted like a directory read instead of failing with INVALID_PARAMETER_VALUE.CHARSET. Line-delimited keeps the byte-array / stream-decoder path, which already mirrors TextInputJsonDataSource. - Test: a multiLine windows-1252 archive infers the same schema as a directory read. - Fix a stale comment in JsonUtils.sample that still referenced RDD[UTF8String] after the switch to RDD[Array[Byte]].

…nd correct its rationale The previous commit added a test asserting that multiLine archive inference accepts windows-1252 like a directory read, justified by the claim that the byte-array parser's stream decoder rejects charsets the InputStreamReader path accepts. That premise is wrong: `parsedOptions.encoding` is validated eagerly by `CharsetProvider.forName`, which gates on the same VALID_CHARSETS allow-list that `getStreamDecoder` uses, so windows-1252 (not in the list, with `spark.sql.legacy.javaCharsets` off) throws INVALID_PARAMETER_VALUE.CHARSET at option-parse time -- on both the archive and the directory read -- before any parser is built. The test would fail, and no charset can pass option validation yet be rejected by the decoder. Remove the test and rewrite the `recordParser` comment to the accurate reason. The code itself is kept: multiLine inference parses via `CreateJacksonParser.inputStream`, mirroring the multiLine scan's InputStreamReader -- including its lenient handling of bytes malformed in the charset, where the strict stream decoder would instead fail inference. The multiLine `inputStream(enc)` path stays covered by the UTF-16 encoding test.

cloud-fan

2 addressed, 0 remaining, 3 new. (3 = 1 newly introduced, 2 late catches — my misses from earlier rounds.)
The fix resolves both round-2 findings exactly as recommended; the charset matrix is now fully scan-parity. What remains is one test-strengthening suggestion and two doc nits.

Suggestions (1)

JSONArchiveReadBase.scala:218: the windows-1252 fixture is pure ASCII, so the test can't detect the encoding being silently ignored (only it being rejected) — see inline

Nits: 2 minor items (see inline comments).

Verification

Traced the new per-mode recordParser against each scan path: multiLine now uses the same CreateJacksonParser.inputStream calls as MultiLineJsonDataSource (plain InputStreamReader for an explicit encoding, byte auto-detect otherwise), closing the round-2 windows-1252 divergence, and line-delimited keeps the bytes/stream-decoder path that already matched. JsonInferSchema.infer is invoked with isReadFile = multiLine, matching each mode's loose-file inference, so error classification is also identical between archive and directory inference.

cloud-fan · 2026-06-12T23:19:06Z

+    assertArchiveMatchesDir(
+      Seq(
+        "a.json" -> jsonBytes("{\"id\":1,\"name\":\"Alice\"}\n{\"id\":2,\"name\":\"Bob\"}\n"),
+        // No "name" field: the schema's "name" column must read back as null for this entry.


The fixture is pure ASCII, which windows-1252 and UTF-8 encode identically — so if the explicit encoding were silently ignored (the auto-detect branch), the test would still pass. A non-ASCII windows-1252 byte (0xE9) is malformed standalone UTF-8, so it makes that regression observable: the archive side would infer a corrupt-record-only schema and fail the parity assert.

Suggested change

// No "name" field: the schema's "name" column must read back as null for this entry.

val bytes = "{\n \"id\": 1,\n \"name\": \"Jos\u00e9\"\n}".getBytes("windows-1252")

cloud-fan · 2026-06-12T23:19:06Z

+
+  /**
+   * Schema [[format]] infers from `paths` under [[readOptions]] ++ [[inferenceOptions]] (plus
+   * `extraOptions`). Loading several paths reads them as one fileset, exactly as a directory read.


Reads as an incomplete comparison — the sibling comments spell out the verb ("as a directory read does"). The line is at 99 chars, hence the rewrap.

Suggested change

* `extraOptions`). Loading several paths reads them as one fileset, exactly as a directory read.

* `extraOptions`). Loading several paths reads them as one fileset, exactly as a directory

* read does.

cloud-fan · 2026-06-12T23:19:06Z

+ * schema-inference and complex-type tests (see `supportsSchemaInference`/`supportsComplexTypes`),
+ * and adds the JSON-specific tests with no format-agnostic analogue: NullType canonicalization,
+ * field-union/null-in-loose merging, and multi-line documents. Reusable across archive formats: a
+ * `JSON<Archive>Read` suite mixes this in alongside the archive-format trait.


JSON<Archive>Read doesn't expand to the real suite name — the tar suite is JSONTarArchiveReadSuite, and TarArchiveReadBase documents the <format>TarArchiveReadSuite pattern.

Suggested change

* `JSON<Archive>Read` suite mixes this in alongside the archive-format trait.

* `JSON<Container>ArchiveReadSuite` mixes this in alongside the archive-format trait.

Class doc named a `JSON<Archive>Read` suite that does not exist; use the real `JSON<Container>ArchiveReadSuite` pattern (e.g. JSONTarArchiveReadSuite). Complete the inferredSchema doc comparison ("exactly as a directory read does") to match its siblings.

…hemaMerge; default capability hooks on Address review feedback on the archive test traits: - Default the test capability hooks to true so a subclass opts OUT where a capability does not apply, rather than forgetting to opt in: flip supportsSchemaInference and supportsComplexTypes to default true, and add a supportsSchemaMerge hook (also default true) for formats that union inputs by field name (JSON; not CSV's positional/header model). - Hoist the format-agnostic tests into ArchiveReadSuiteBase: the same-schema "merges archive + loose == directory" parity (under supportsSchemaInference) and the differing-field union read/inference tests (under supportsSchemaMerge), rewritten via encodeFile so any field-name format inherits them. - CSV opts out of complex-types and schema-merge; JSON keeps all three on. - Remove the now-duplicated per-format copies and a redundant comment.

…ng whole records inferWithArchives now hands each archive entry / loose file InputStream straight to the JSON parser -- multi-line via CreateJacksonParser.inputStream, line-delimited via CreateJacksonParser.bytes -- through a perInput helper that mirrors CSV's inferWithArchives, so a document is parsed incrementally rather than read into a byte array first. Also drops redundant explanatory comments in the archive test traits.

…line readStream

…types The multi-line archive inference path samples RDD[InputStream], not RDD[Array[Byte]]; only the line-delimited path is RDD[Array[Byte]]. Co-authored-by: Isaac

cloud-fan

3 addressed, 0 remaining, 2 new. (2 = 2 newly introduced, 0 late catches.)
All three round-3 items are resolved and the charset matrix stays scan-parity. The two new items are on the round-4 streaming change: one non-blocking design note and one doc nit.

Design / architecture (1)

JsonDataSource.scala:183: multi-line inference now hands infer an RDD[InputStream] of live views over the shared tar cursor; correct only because infer fully consumes each before advancing. Undocumented invariant the line-delimited and CSV paths don't carry (both materialize) — recommend documenting it. See inline.

Nits: 1 minor item (see inline comments).

Verification

Traced the multi-line streaming path: JsonInferSchema.infer consumes each entry stream via flatMap { tryWithResource(parse) }.reduceOption, fully parsing each document before the readEntries iterator advances the tar cursor (which skips unread bytes), and inferWithArchives is its only consumer — so the live CloseShieldInputStream views stay valid. RDD[InputStream] never crosses a serialization boundary (one narrow-dependency stage). Charset parity holds: inference and the scan both build the parser via CreateJacksonParser.inputStream. Default sampling is a no-op; under <0.99 the sampler's filter discards a dropped stream and the next advance skips it at the tar level.

cloud-fan · 2026-06-15T19:04:55Z

+        // Each input/entry is one JSON document: hand its stream straight to the parser
+        // (`CreateJacksonParser.inputStream`, matching MultiLineJsonDataSource and its charset
+        // auto-detect) so the document is parsed incrementally rather than buffered.
+        val docs = perInput(in => Iterator.single(in))


These RDD[InputStream] elements are live CloseShieldInputStream views over the single shared TarArchiveInputStream cursor — each valid only until readEntries advances to the next entry (getNextEntry skips the prior entry's unread bytes). This is correct today only because JsonInferSchema.infer fully consumes each stream (flatMap { tryWithResource(parse) }.reduceOption) before pulling the next, and inferWithArchives is its only consumer.

The line-delimited sibling (RDD[Array[Byte]]) and the CSV analogue (RDD[Array[String]]) both materialize their records, so they don't carry this constraint. If infer's per-partition consumption ever changes to buffer / look ahead / parallelize, this path would silently read from an advanced cursor and infer a wrong schema (no exception; the one-entry fixtures here can't observe it). Non-blocking, and the streaming is a worthwhile memory optimization — but worth a line in the perInput comment documenting that the consumer must fully consume each stream before the iterator advances.

cloud-fan · 2026-06-15T19:04:55Z

+    val ignoreMissingFiles = parsedOptions.ignoreMissingFiles
+
+    // Applies `perEntry` to each input -- once per archive entry, once for a loose file -- skipping
+    // a whole input on corrupt/missing input when the ignore flags are set. The entry/file stream


Repeated "input" reads awkwardly.

Suggested change

// a whole input on corrupt/missing input when the ignore flags are set. The entry/file stream

// a whole input when it is corrupt/missing and the ignore flags are set. The entry/file stream

…nference streams Reword the perInput skip comment and document that multiLine inference emits live views over the shared TarArchiveInputStream cursor, so the consumer must fully consume each element before the iterator advances.

cloud-fan · 2026-06-16T05:36:23Z

thanks, merging to master/4.x!

### What changes were proposed in this pull request? SPARK-57135 added reading CSV files packed in tar archives (`.tar`/`.tar.gz`/`.tgz`) and SPARK-57321 added schema inference for them, both gated by `spark.sql.files.archive.reader.enabled`. This extends the same capability to the JSON data source. When the flag is enabled, the V1 JSON data source reads a tar archive as if it were a directory of its entries: each entry is streamed through `ArchiveReader` (never unpacked to disk) and parsed exactly like a standalone JSON file, for both line-delimited and multi-line JSON (`JsonDataSource.readArchive`/`readStream`). Schema inference reads every archive entry together with any loose files in a single `JsonInferSchema` pass (`inferWithArchives`), so the inferred schema matches a directory read of the same files. The whole archive is one non-splittable unit (`JsonFileFormat.isSplitable` returns false), and a corrupt/missing archive is skipped as a unit under `ignoreCorruptFiles`/`ignoreMissingFiles`. The DSv2 reader cannot read archives, so `JsonTable` passes `supportsArchiveScan = false` and refuses to infer a schema for archive inputs (raising `UNABLE_TO_INFER_SCHEMA`). Unlike CSV, JSON needs no per-entry header handling (records are self-describing, so one parser serves every entry) and no `mergeSchema`-style branching (`JsonInferSchema` already merges record types by field name across all inputs, so one pass is itself the union). This also unifies the archive test suites: the format-agnostic inference and complex-type tests are hoisted into `ArchiveReadSuiteBase` behind capability hooks (`supportsSchemaInference`, `supportsComplexTypes`), so CSV, JSON, and future archive formats share them instead of each duplicating them. ### Why are the changes needed? To let JSON ingestion read tar archives without unpacking them to disk, matching the CSV behavior already in Spark. ### Does this PR introduce _any_ user-facing change? Yes. With `spark.sql.files.archive.reader.enabled=true` (default false), the JSON data source can read and infer schemas from `.tar`/`.tar.gz`/`.tgz` files. ### How was this patch tested? New `JSONTarArchiveReadSuite` (mixing `JSONArchiveReadBase` with the shared `ArchiveReadSuiteBase` and `TarArchiveReadBase`), plus the hoisted shared inference and complex-type tests now also exercised by the CSV suites. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code Closes #56480 from akshatshenoi-db/archive-json. Authored-by: akshatshenoi-db <akshat.shenoi@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6507faa) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

akshatshenoi-db added 2 commits June 12, 2026 18:13

cloud-fan reviewed Jun 12, 2026

View reviewed changes

akshatshenoi-db added 2 commits June 12, 2026 22:41

cloud-fan approved these changes Jun 12, 2026

View reviewed changes

akshatshenoi-db added 5 commits June 12, 2026 23:23

[SPARK-57419][SQL] Restore ByteArrayInputStream import used by multi-…

540153b

…line readStream

[SPARK-57419][SQL] Correct sample[T] doc to name both archive record …

eca63e9

…types The multi-line archive inference path samples RDD[InputStream], not RDD[Array[Byte]]; only the line-delimited path is RDD[Array[Byte]]. Co-authored-by: Isaac

cloud-fan approved these changes Jun 15, 2026

View reviewed changes

cloud-fan closed this in 6507faa Jun 16, 2026


		// ----- JSON-specific read tests --------------------------------------------

		test("JSON: entries with differing fields union like a directory") {

	* complex-type round-trip test; CSV and text leave it false, JSON/Avro/Parquet/XML override true.
	* complex-type round-trip test; CSV and text leave it false, JSON overrides it to true.

	* multiLine path (`RDD[PortableDataStream]`) and the archive inference path (`RDD[UTF8String]`)
	* multiLine path (`RDD[PortableDataStream]`) and the archive inference path (`RDD[Array[Byte]]`)

	// No "name" field: the schema's "name" column must read back as null for this entry.
	val bytes = "{\n \"id\": 1,\n \"name\": \"Jos\u00e9\"\n}".getBytes("windows-1252")

	* `extraOptions`). Loading several paths reads them as one fileset, exactly as a directory read.
	* `extraOptions`). Loading several paths reads them as one fileset, exactly as a directory
	* read does.

	* `JSON<Archive>Read` suite mixes this in alongside the archive-format trait.
	* `JSON<Container>ArchiveReadSuite` mixes this in alongside the archive-format trait.

	// a whole input on corrupt/missing input when the ignore flags are set. The entry/file stream
	// a whole input when it is corrupt/missing and the ignore flags are set. The entry/file stream

Conversation

akshatshenoi-db commented Jun 12, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Correctness (2)

Suggestions (1)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Correctness (1)

Verification

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Suggestions (1)

Verification

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Design / architecture (1)

Verification

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants