Skip to content

#415 Implement the basic version of EBCDIC writer #775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 19, 2025

Conversation

yruslan
Copy link
Collaborator

@yruslan yruslan commented Aug 19, 2025

Closes #415

Summary by CodeRabbit

  • New Features

    • Writer support to export Spark DataFrames as fixed-length COBOL/EBCDIC files (Overwrite supported).
    • New pluggable record combiner and raw-binary output path to produce byte-oriented outputs.
  • Bug Fixes

    • Multi-copybook parameter handling clarified and made mutually exclusive.
    • SaveMode behaviors enforced: Append rejected; ErrorIfExists and Ignore handled as documented.
  • Documentation

    • Added experimental EBCDIC writer usage, limitations and performance notes (duplicated block).
  • Tests

    • Integration suite validating EBCDIC output and save-mode behaviors.

Copy link

coderabbitai bot commented Aug 19, 2025

Walkthrough

Adds an experimental Spark DataFrame writer for fixed-length COBOL/EBCDIC records (writer API, combiner abstraction, basic combiner implementation, raw binary output format, and tests); relocates EncoderSelector into an encoding package with a buffer prefill to 0x40; updates parser imports; adjusts multi-copybook parameter parsing; and appends duplicate README writer documentation blocks.

Changes

Cohort / File(s) Summary
Docs: EBCDIC Writer (duplicate blocks)
README.md
Adds two "EBCDIC Writer (experimental)" blocks with usage, limitations, and a performance analysis; content is duplicated/overlapping.
EncoderSelector package move & encoding prefill
cobol-parser/.../encoding/EncoderSelector.scala
Moves EncoderSelector to encoding package and pre-fills encoding buffer with 0x40 (EBCDIC space) before encoding.
Parser / AST import updates
cobol-parser/.../antlr/ParserVisitor.scala, .../ast/Primitive.scala, .../asttransform/NonTerminalsAdder.scala, .../test/.../BinaryExtractorSpec.scala
Adjusts imports to the new EncoderSelector location and consolidates import grouping; test imports updated accordingly.
Reader params: multi-copybook parsing
cobol-parser/.../reader/parameters/CobolParametersParser.scala
Splits copybook inputs into copybookPathOpt: Option[String] and copybookPaths: Seq[String], enforces mutual exclusivity, and changes absent multi-copybook default to empty Seq.
Spark writer: DataSource integration
spark-cobol/.../source/DefaultSource.scala
Implements writer path by adding CreatableRelationProvider and createRelation(...); handles SaveMode semantics, loads copybook, derives CobolSchema, selects combiner, builds RDD[Bytes] and writes via Hadoop output.
Writer API & implementations
spark-cobol/.../writer/RecordCombiner.scala, .../RecordCombinerSelector.scala, .../BasicRecordCombiner.scala
Adds RecordCombiner trait, RecordCombinerSelector (returns BasicRecordCombiner), and BasicRecordCombiner that validates AST and encodes DataFrame rows into fixed-length byte arrays.
Raw binary output format
spark-cobol/.../writer/RawBinaryOutputFormat.scala
Adds RawBinaryOutputFormat that writes BytesWritable payloads as raw bytes (configurable extension) to output files.
Writer tests
spark-cobol/src/test/.../writer/FixedLengthEbcdicWriterSuite.scala
Adds tests validating fixed-length EBCDIC output, field order/null handling, and SaveMode behaviors (Append/ErrorIfExists/Ignore).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User as Spark Job
  participant DS as DefaultSource (writer)
  participant CCL as CopybookContentLoader
  participant CS as CobolSchema
  participant RCS as RecordCombinerSelector
  participant RC as BasicRecordCombiner
  participant H as RawBinaryOutputFormat

  User->>DS: createRelation(sqlContext, SaveMode, params, df)
  DS->>CCL: load copybook (path or copybook_contents)
  CCL-->>DS: copybook text
  DS->>CS: build CobolSchema (AST, sizes)
  CS-->>DS: CobolSchema
  DS->>RCS: selectCombiner(schema, readerParams)
  RCS-->>DS: BasicRecordCombiner
  DS->>RC: combine(df, schema, readerParams)
  RC-->>DS: RDD[Array[Byte]]
  alt SaveMode = Overwrite / Create
    DS->>H: saveAsNewAPIHadoopFile(path, NullWritable, BytesWritable)
    H-->>User: bytes written (part files)
  else SaveMode = Append
    DS-->>User: IllegalArgumentException (append unsupported)
  else SaveMode = Ignore / ErrorIfExists
    DS-->>User: no-op or IllegalArgumentException
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Assessment against linked issues

Objective Addressed Explanation
Implement Spark DataFrame writer for Cobol datafiles (#415)
Support configurable endianness and code page output (#415) Only a core EBCDIC encoder is used; no exposed code page or endianness configuration.
Support writing record formats F, FB, V, VB (#415) Writer implements fixed-length (F) output only.
Derive copybook from Spark schema and per-field overrides (#415) No automatic copybook derivation, no emitted copybook, and no per-field override mechanism present.

Assessment against linked issues: Out-of-scope changes

Code Change Explanation
Import change in parser that may remove EncoderSelector import (cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/antlr/ParserVisitor.scala) Adjusting parser imports is unrelated to writer objectives and may cause compilation issues if other files aren't updated consistently.
Reader parameter parsing change to multi-copybook default (cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala) Modifies reader-side parameter signature/behavior, which is not required by the writer feature in #415 and affects reader configuration.

Possibly related PRs

Poem

I nibble bytes and hop with cheer,
From rows to EBCDIC I steer.
I pad with 0x40 and set each line,
Fixed-length fields all neat and fine.
One day I'll append — but for now, I write clear. 🥕🐇

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 01e8709 and 6013341.

📒 Files selected for processing (8)
  • README.md (1 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (3 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1 hunks)
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala (2 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala (1 hunks)
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
  • README.md
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Spark 3.5.5 on Scala 2.13.16
  • GitHub Check: Spark 2.4.8 on Scala 2.11.12
  • GitHub Check: Spark 3.5.5 on Scala 2.12.20
  • GitHub Check: Spark 3.4.4 on Scala 2.12.20
  • GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/415-implement-cobol-writer

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary or {PR Summary} to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

github-actions bot commented Aug 19, 2025

JaCoCo code coverage report - 'cobol-parser'

Overall Project 91.66% 🍏
Files changed 100% 🍏

File Coverage
NonTerminalsAdder.scala 100% 🍏
EncoderSelector.scala 98.99% 🍏
ParserVisitor.scala 89.49% 🍏
Primitive.scala 74.95% 🍏

Copy link

github-actions bot commented Aug 19, 2025

JaCoCo code coverage report - 'spark-cobol'

File Coverage [89.69%] 🍏
RecordCombinerSelector.scala 100% 🍏
RawBinaryOutputFormat.scala 98% 🍏
DefaultSource.scala 93.9% 🍏
BasicRecordCombiner.scala 77.71%
Total Project Coverage 80.48% 🍏

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limits.

🔭 Outside diff range comments (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (2)

39-55: Use the provided code page and charset; avoid hardcoding CodePageCommon

getStringEncoder ignores both ebcdicCodePage and asciiCharset and always uses CodePageCommon. This makes the API misleading and prevents future extension from working as expected.

  • Leverage ebcdicCodePage’s ASCII→EBCDIC mapping instead of the hardcoded one.
  • If you keep ASCII unimplemented (None), that’s fine for now, but do use asciiCharset whenever you convert Unicode string input to bytes.

Proposed change:

   private def getStringEncoder(encoding: Encoding,
                                ebcdicCodePage: CodePage,
                                asciiCharset: Charset,
                                fieldLength: Int
                               ): Option[Encoder] = {
     encoding match {
       case EBCDIC =>
-        val encoder = (a: Any) => {
-          encodeEbcdicString(a.toString, CodePageCommon.asciiToEbcdicMapping, fieldLength)
-        }
+        val encoder = (a: Any) => {
+          // Honor selected code page for ASCII→EBCDIC conversion
+          encodeEbcdicString(a.toString, ebcdicCodePage.asciiToEbcdicMapping, fieldLength)
+        }
         Option(encoder)
       case ASCII =>
         None
       case _ =>
         None
     }
   }

65-77: Pad PIC X fields with EBCDIC spaces (0x40) instead of zeros; avoid data corruption

The current implementation initializes the output buffer with zeros, which will overwrite existing space-filled fields with 0x00 bytes on write. That is not a valid filler for EBCDIC text fields and can break downstream consumers. It also risks tests passing only because decoders trim control bytes.

  • Initialize the buffer with 0x40 (EBCDIC space).
  • Optional but recommended: obtain ASCII bytes via asciiCharset (if you extend the signature), otherwise keep as-is until ASCII support is added.

Minimal fix (keeps the current signature):

   def encodeEbcdicString(string: String, conversionTable: Array[Byte], length: Int): Array[Byte] = {
     require(length >= 0, s"Field length cannot be negative, got $length")

     var i = 0
     val buf = new Array[Byte](length)
+    // PIC X fields are space-filled on mainframe. Use EBCDIC space 0x40.
+    java.util.Arrays.fill(buf, 0x40.toByte)

     while (i < string.length && i < length) {
       val asciiByte = string(i).toByte
       buf(i) = conversionTable((asciiByte + 256) % 256)
       i = i + 1
     }
     buf
   }

Follow-up: consider switching to string.getBytes(asciiCharset) (and passing the charset down) for robust non-ASCII handling.

🧹 Nitpick comments (15)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)

255-257: Define precedence/validation between 'copybook' and 'copybooks'

Right now both options can be set simultaneously with no explicit rule. Consider validating exclusivity (or documenting a clear precedence) to avoid ambiguous configurations.

Apply this diff to enforce mutual exclusivity and avoid double lookups:

-    val cobolParameters = CobolParameters(
-      getParameter(PARAM_COPYBOOK_PATH, params),
-      copybookPaths,
+    val copybookPathOpt = getParameter(PARAM_COPYBOOK_PATH, params)
+    if (copybookPathOpt.nonEmpty && copybookPaths.nonEmpty) {
+      throw new IllegalArgumentException(s"Options '$PARAM_COPYBOOK_PATH' and '$PARAM_MULTI_COPYBOOK_PATH' cannot be used together.")
+    }
+
+    val cobolParameters = CobolParameters(
+      copybookPathOpt,
+      copybookPaths,

If mutual exclusivity already exists elsewhere, point me to it and I’ll align this check or drop the suggestion.


249-253: Normalize and clean ‘multiCopybookPath’ by trimming and dropping empty entries

Verified that no downstream logic depends on retaining empty-string entries in the multiCopybookPath sequence (e.g., there are no .headOption or size checks that assume a length ≥ 1 for Seq(“”)). To make parsing of comma-separated copybook paths robust, update the splitter to trim whitespace and filter out blank tokens.

• File: cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
• Lines: ~249–253

Apply this diff:

-    val copybookPaths = params.get(PARAM_MULTI_COPYBOOK_PATH) match {
-      case Some(paths) => paths.split(',').toSeq
-      case None => Seq.empty[String]
-    }
+    val copybookPaths = params.get(PARAM_MULTI_COPYBOOK_PATH) match {
+      case Some(paths) =>
+        paths
+          .split(',')
+          .iterator
+          .map(_.trim)
+          .filter(_.nonEmpty)
+          .toSeq
+      case None => Seq.empty[String]
+    }
+    // Optional: deduplicate if the same path may appear multiple times
+    // val copybookPaths = copybookPaths.distinct

This change ensures inputs like
"a, b , ,c"Seq("a","b","c")
"copybooks=" or an all-whitespace string ⇒ Seq.empty

README.md (1)

1666-1693: Document padding/truncation semantics and code page scope for the writer

Good to see a clear, scoped introduction of the experimental writer. Two clarifications will save users time:

  • State how PIC X(n) values are handled: truncated when longer than n and right-padded when shorter. Specify the padding byte (EBCDIC space 0x40).
  • Reiterate that only the default/common EBCDIC mapping is used for output today, and that per-field or alternate code pages are not honored yet (options will be ignored).

This avoids surprises when bytes are inspected or mainframe readers expect space-filled fields.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/antlr/ParserVisitor.scala (1)

26-26: Import refactor looks correct; EncoderSelector remains in scope via encoding._

The move of EncoderSelector to the encoding package is reflected by the wildcard import. No functional issues expected.

For clarity and to avoid future shadowing, consider explicit imports:
import za.co.absa.cobrix.cobol.parser.encoding.{Encoding, EncoderSelector}

cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala (1)

208-219: Add an assertion to verify EBCDIC space-fill behavior on setFieldValueByName

Once the encoder pads with 0x40 (EBCDIC spaces), it’s worth asserting that the trailing bytes in the field are 0x40 after setting a shorter value to prevent regressions. This will catch accidental zero-filling.

Example (outside this hunk, for illustration):

// After setting "NEWNAME" into X(10)
val fieldOffset = /* compute offset of SHORT-NAME within the record */
val fieldLen    = 10
val trailing    = bytes.slice(fieldOffset + "NEWNAME".length, fieldOffset + fieldLen)
assert(trailing.forall(_ == 0x40.toByte))
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/asttransform/NonTerminalsAdder.scala (1)

76-77: Symmetric decode/encode wiring for virtual fields is fine; confirm EBCDIC-only encoding expectation.

EncoderSelector.getEncoder returns None for ASCII. Given the current writer throws when encode is empty (see BasicRecordCombiner), this effectively constrains non-terminals to EBCDIC for writing. If that’s intentional for the first iteration, we’re good; otherwise, plan ASCII support or a clearer error message upstream.

Would you like a follow-up PR that adds ASCII string encoders and a targeted error message when a copybook field is not encodable?

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1)

43-46: Optional: avoid forcing the “.dat” extension (or make it configurable).

Hardcoding “.dat” is fine for tests, but you might prefer:

  • inherit default extension: getDefaultWorkFile(context, ""), or
  • make extension configurable via a job conf (e.g., cobol.writer.output.extension).

Low priority; no functional impact.

-    val path: Path = getDefaultWorkFile(context, ".dat")
+    val extension = context.getConfiguration.get("cobol.writer.output.extension", "")
+    val path: Path = getDefaultWorkFile(context, extension)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombiner.scala (1)

24-26: Trait definition is clear; consider making it Serializable.

Implementations are likely to be captured in Spark closures. Declaring the trait Serializable helps avoid accidental serialization issues for stateful implementations later.

-trait RecordCombiner {
+trait RecordCombiner extends Serializable {
   def combine(df: DataFrame, cobolSchema: CobolSchema, readerParameters: ReaderParameters): RDD[Array[Byte]]
 }
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (2)

41-48: Minor: coalesce after ordering to avoid an extra shuffle.

repartition(1).orderBy("A") shuffles twice. For a single-file deterministic output, orderBy("A").coalesce(1) reduces shuffling.

-        df.repartition(1)
-          .orderBy("A")
+        df.orderBy("A")
+          .coalesce(1)
           .write

63-67: Optional: derive expected bytes using the library’s code page mapping to avoid magic constants.

Hardcoded hex is fine for a golden test. Alternatively, using CodePageCommon.asciiToEbcdicMapping to build expected reduces maintenance if mappings evolve.

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala (2)

24-33: Doc nit: fix minor typo.

“ot output record” → “of the output record”.

-   * @param cobolSchema      The COBOL schema ot output record.
+   * @param cobolSchema      The COBOL schema of the output record.

34-36: Selector stub is fine; consider early routing for record formats.

As support expands (FB, VB, V, RDW/BDW), this selector can examine readerParameters.recordFormat and isRecordSequence/rdw to pick an appropriate combiner (e.g., length-prefixed writer vs. fixed-length). Not required now.

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (2)

92-94: No fallback when copybook is absent (contradicts PR objective of deriving from DataFrame schema).

This path unconditionally loads copybook content and will throw if none is provided. The PR objectives mention deriving a default copybook from the Spark schema; consider a fallback here when the copybook is not specified.

I can help scaffold:

  • A minimal Schema→Copybook derivation for primitive types used by BasicRecordCombiner.
  • A parameter flag (e.g., derive_copybook=true) to opt-in while experimental.
    Would you like me to propose an implementation?

95-101: Consider logging writer configuration for traceability.

Reader path logs build info; writer path does not log anything. Adding a concise log (path, record format, code page) aids debugging.

Example:

logger.info(
  s"Writing COBOL data to '$path' with recordFormat=${readerParameters.recordFormat}, " +
  s"isEbcdic=${readerParameters.isEbcdic}, codePage=${readerParameters.ebcdicCodePage}"
)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1)

65-90: Schema validation should ignore fillers to reduce false negatives.

Depending on parser settings, filler fields may still be present in the AST. Requiring them in the DataFrame can be onerous. Consider excluding fillers from both the not-found and unsupported checks.

Example adjustment:

val children = ast.children.filter {
  case p: Primitive => !p.isFiller
  case g: Group     => !g.isFiller
  case _            => true
}

Then use children instead of ast.children for the subsequent validations and mapping.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f899488 and 6f7a75a.

📒 Files selected for processing (13)
  • README.md (1 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/antlr/ParserVisitor.scala (1 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala (1 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/asttransform/NonTerminalsAdder.scala (1 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (1 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1 hunks)
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala (1 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (3 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombiner.scala (1 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala (1 hunks)
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (9)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/asttransform/NonTerminalsAdder.scala (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/DecoderSelector.scala (6)
  • parser (114-168)
  • parser (170-179)
  • parser (181-190)
  • parser (193-254)
  • parser (256-277)
  • DecoderSelector (30-352)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/FloatingPointFormat.scala (1)
  • FloatingPointFormat (19-44)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (1)
  • EncoderSelector (24-79)
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/DecoderSelector.scala (6)
  • parser (114-168)
  • parser (170-179)
  • parser (181-190)
  • parser (193-254)
  • parser (256-277)
  • DecoderSelector (30-352)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/Encoding.scala (1)
  • EBCDIC (23-23)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (1)
  • EncoderSelector (24-79)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/fixtures/BinaryFileFixture.scala (1)
  • withTempDirectory (71-78)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (1)
  • EncoderSelector (24-79)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombiner.scala (3)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala (1)
  • cobrix (132-139)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (1)
  • schema (110-110)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1)
  • combine (26-63)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/antlr/ParserVisitor.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/DecoderSelector.scala (6)
  • parser (114-168)
  • parser (170-179)
  • parser (181-190)
  • parser (193-254)
  • parser (256-277)
  • DecoderSelector (30-352)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (5)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala (3)
  • cobrix (132-139)
  • CobolSchema (52-300)
  • CobolSchema (302-325)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/copybook/CopybookContentLoader.scala (2)
  • CopybookContentLoader (27-60)
  • load (29-42)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1)
  • RawBinaryOutputFormat (41-58)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala (2)
  • RecordCombinerSelector (22-38)
  • selectCombiner (34-36)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1)
  • combine (26-63)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/Parameters.scala (1)
  • get (47-50)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala (1)
  • CobolParameters (67-109)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (4)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala (1)
  • Primitive (42-140)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala (1)
  • ReaderParameters (79-133)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (1)
  • schema (110-110)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala (1)
  • setPrimitiveField (230-248)
🔇 Additional comments (9)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/antlr/ParserVisitor.scala (1)

858-860: Encoder wiring for primitives is consistent with current scope

Binding decoders and (optional) encoders at AST construction time keeps downstream logic simple. Given current encoder constraints (PIC X only), the Option is appropriate.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala (1)

20-22: Import path update is correct and preserves behavior

Switching EncoderSelector import to encoding aligns with the package move; no runtime changes. The encode type remains Optional as required by current encoder capabilities.

cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala (1)

23-25: Test imports updated correctly to match EncoderSelector package move

The test continues to construct Primitive with both decoder and (optional) encoder. Imports align with the refactor.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/asttransform/NonTerminalsAdder.scala (1)

23-26: Import path update looks correct; no behavioral change.

Switching EncoderSelector to the encoding package and grouping the import with Encoding is consistent with the refactor. DecoderSelector remains from decoders; types resolve cleanly.

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1)

41-58: LGTM: minimal, correct raw binary writer.

  • Writes exactly value.getLength bytes without separators.
  • Uses task work file via getDefaultWorkFile; close is handled.

This is appropriate for fixed-length and “self-delimiting” variable-length records (headers embedded).

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombiner.scala (1)

24-26: Critical: Fix field-to-column mapping in BasicRecordCombiner to prevent data corruption

The current implementation in BasicRecordCombiner.scala builds
sparkFieldPositions: Seq[Int] by mapping each cobolField to its DataFrame column index, but then uses those same indices to look up both the value in the Row and the corresponding cobolFields(index). If the DataFrame columns aren’t in the exact order of the copybook, values will be written into the wrong fields.

Locations:

  • File: spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
  • Method: combine, around lines 38–57

Suggested fix:

  1. Build a mapping of (cobolField, sparkIndex) pairs
  2. Iterate over that mapping when encoding, so each primitive is set on the correct cobolField.

Illustrative patch:

-    // current: Seq of just spark column indexes
-    val sparkFieldPositions = cobolFields.map { cobolField =>
+    // map each cobolField to its DataFrame index
+    val fieldMappings: Seq[(Primitive, Int)] = cobolFields.map { cobolField =>
       val fieldName = cobolField.name.toLowerCase
       val position = sparkFields.indexOf(fieldName)
       if (position < 0) {
         throw new IllegalArgumentException(s"Field '${cobolField.name}' not found in DataFrame schema.")
       }
-      position
+      (cobolField, position)
     }
@@
-    df.rdd.map { row =>
+    df.rdd.map { row =>
       val ar = new Array[Byte](size)
-
-      sparkFieldPositions.foreach { index =>
-        val fieldStr = row.get(index)
-        val cobolField = cobolFields(index)
-        cobolSchema.copybook.setPrimitiveField(cobolField, ar, fieldStr, 0)
-      }
+      fieldMappings.foreach { case (cobolField, sparkIndex) =>
+        val value = row.get(sparkIndex)
+        cobolSchema.copybook.setPrimitiveField(cobolField, ar, value, 0)
+      }
       ar
     }

This ensures each COBOL field is always paired with its correct Spark column, regardless of column ordering.

Likely an incorrect or invalid review comment.

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1)

34-75: End-to-end writer test is solid and readable.

  • Uses a simple copybook and deterministic ordering.
  • Validates raw bytes against expected EBCDIC, including 0x00 padding.

Nice coverage for the first iteration of the writer.

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (2)

30-37: Good guard: verify encoder presence per field.

Early validation for missing encode is appropriate and will prevent silent corruption.


92-100: getAst selection is reasonable for a single-root group scenario.

This keeps the writer focused on flat, primitive-only records for now.

Comment on lines 76 to 90
mode match {
case SaveMode.Overwrite =>
val outputPath = new Path(path)
val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
val fs = outputPath.getFileSystem(hadoopConf)
if (fs.exists(outputPath)) {
fs.delete(outputPath, true)
}
case SaveMode.Append =>
throw new IllegalArgumentException(
s"Save mode '$mode' is not supported by the 'spark-cobol' data source at the moment. " +
"Please use 'Overwrite' mode to write data to a file or folder."
)
case _ =>
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Honor SaveMode.ErrorIfExists and Ignore semantics.

Currently, only Overwrite is implemented and Append is rejected. ErrorIfExists and Ignore default to a no-op, which can lead to accidental overwrites or backend errors when the path exists.

Suggested minimal handling for the remaining modes:

     mode match {
       case SaveMode.Overwrite =>
         val outputPath = new Path(path)
         val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
         val fs = outputPath.getFileSystem(hadoopConf)
         if (fs.exists(outputPath)) {
           fs.delete(outputPath, true)
         }
       case SaveMode.Append =>
         throw new IllegalArgumentException(
           s"Save mode '$mode' is not supported by the 'spark-cobol' data source at the moment. " +
             "Please use 'Overwrite' mode to write data to a file or folder."
         )
+      case SaveMode.ErrorIfExists =>
+        val outputPath = new Path(path)
+        val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
+        val fs = outputPath.getFileSystem(hadoopConf)
+        if (fs.exists(outputPath)) {
+          throw new IllegalArgumentException(
+            s"Path '$path' already exists. SaveMode.ErrorIfExists prevents overwriting."
+          )
+        }
+      case SaveMode.Ignore =>
+        val outputPath = new Path(path)
+        val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
+        val fs = outputPath.getFileSystem(hadoopConf)
+        if (fs.exists(outputPath)) {
+          // Skip writing entirely and return the relation
+          val outSqlContext = sqlContext
+          return new BaseRelation {
+            override val sqlContext: SQLContext = outSqlContext
+            override def schema: StructType = data.schema
+          }
+        }
       case _ =>
     }

Please verify behavior with existing tests, and add writer-mode test cases if missing (Overwrite, ErrorIfExists, Ignore on existing path).


🏁 Script executed:

#!/bin/bash
# Locate the SaveMode match in DefaultSource.scala
echo ">>> mode match in DefaultSource.scala <<<"
rg -n "mode match" -C3 spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala

# Show all SaveMode references across the codebase
echo -e "\n>>> All SaveMode references in spark-cobol <<<"
rg -n "SaveMode\." -C2 spark-cobol

# Find any existing tests for SaveMode behaviors
echo -e "\n>>> Tests referencing SaveMode or DefaultSource <<<"
rg -nE "SaveMode|DefaultSource" -C2 -g "src/test/scala/**"

Length of output: 3832


🏁 Script executed:

#!/bin/bash
# Retrieve the signature of the createRelation method in DefaultSource.scala
echo ">>> createRelation signature in DefaultSource.scala <<<"
rg -n "def createRelation" -C2 spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala

Length of output: 1051


🏁 Script executed:

#!/bin/bash
# Show the writer createRelation method including the entire mode match and subsequent code
sed -n '60,120p' spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala

Length of output: 2443


Implement SaveMode.ErrorIfExists and SaveMode.Ignore in the writer
Currently, any mode other than Overwrite or Append falls through and the write proceeds unguarded, violating Spark’s semantics for ErrorIfExists and Ignore.

Please update createRelation in DefaultSource.scala as follows:

   mode match {
     case SaveMode.Overwrite =>
       val outputPath = new Path(path)
       val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
       val fs = outputPath.getFileSystem(hadoopConf)
       if (fs.exists(outputPath)) {
         fs.delete(outputPath, true)
       }
     case SaveMode.Append =>
       throw new IllegalArgumentException(
         s"Save mode '$mode' is not supported by the 'spark-cobol' data source at the moment. " +
           "Please use 'Overwrite' mode to write data to a file or folder."
       )
+    case SaveMode.ErrorIfExists =>
+      val outputPath = new Path(path)
+      val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
+      val fs = outputPath.getFileSystem(hadoopConf)
+      if (fs.exists(outputPath)) {
+        throw new IllegalArgumentException(
+          s"Path '$path' already exists; SaveMode.ErrorIfExists prevents overwriting."
+        )
+      }
+    case SaveMode.Ignore =>
+      val outputPath = new Path(path)
+      val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
+      val fs = outputPath.getFileSystem(hadoopConf)
+      if (fs.exists(outputPath)) {
+        // Skip the write entirely
+        return new BaseRelation {
+          override val sqlContext: SQLContext = sqlContext
+          override def schema: StructType = data.schema
+        }
+      }
     case _ =>
   }

– Add new unit/integration tests covering:
• SaveMode.ErrorIfExists on existing path (should throw)
• SaveMode.Ignore on existing path (should no-op and return schema only)
• Verify Overwrite still deletes and writes as before

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala
around lines 76 to 90, the writer currently only handles Overwrite and Append
and falls through for other SaveModes; implement explicit handling for
SaveMode.ErrorIfExists and SaveMode.Ignore. For ErrorIfExists, check if the
output path exists and throw an IllegalStateException (or
IllegalArgumentException) if it does; for Ignore, if the path exists simply
return without writing (no-op) and ensure the relation/schema is returned; keep
existing Overwrite behavior. Add unit/integration tests to assert that
ErrorIfExists throws when path exists, Ignore no-ops and returns schema when
path exists, and Overwrite still deletes and writes as before.

Comment on lines 52 to 63
df.rdd.map { row =>
val ar = new Array[Byte](size)

sparkFieldPositions.foreach { index =>
val fieldStr = row.get(index)
val cobolField = cobolFields(index)
cobolSchema.copybook.setPrimitiveField(cobolField, ar, fieldStr, 0)
}

ar
}
}
Copy link

@coderabbitai coderabbitai bot Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Null-handling for DataFrame values is unspecified (encoding may crash).

row.get(dfIdx) can be null; most encoders are unlikely to accept null and will throw. Decide on a policy:

  • Fail fast with a descriptive error, or
  • Provide defaults (spaces for alphanumeric, zeros for numerics), configurable via a parameter.

Minimal fail-fast guard:

cobolFieldToDfIndex.foreach { case (cobolField, dfIdx) =>
  if (row.isNullAt(dfIdx)) {
    throw new IllegalArgumentException(s"Null value encountered for field '${cobolField.name}'.")
  }
  cobolSchema.copybook.setPrimitiveField(cobolField, ar, row.get(dfIdx), 0)
}

If you prefer defaults, I can propose a small helper that inspects cobolField.dataType and supplies appropriate fillers.


🏁 Script executed:

#!/bin/bash
# Check for any existing null‐handling in BasicRecordCombiner.scala
rg -n 'isNullAt' -C2 spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala

Length of output: 110


🏁 Script executed:

#!/bin/bash
# Locate the definition of setPrimitiveField to verify its null-handling behavior
rg -n "def setPrimitiveField" -C3 -g '*.scala'

Length of output: 847


Specify null-handling policy for DataFrame values

Currently in BasicRecordCombiner.scala rows with nulls will crash inside setPrimitiveField (via encode(null)). You need to choose one of:

  • Fail fast with a clear error when a required column is null
  • Supply defaults (spaces for alphanumeric, zeros for numerics), configurable per use case

Minimal fail-fast guard to insert in your map function:

df.rdd.map { row =>
  val ar = new Array[Byte](size)

  sparkFieldPositions.foreach { index =>
    if (row.isNullAt(index)) {
      val field = cobolFields(index)
      throw new IllegalArgumentException(
        s"Null value encountered for field '${field.name}' (index=$index)")
    }
    val value = row.get(index)
    val field = cobolFields(index)
    cobolSchema.copybook.setPrimitiveField(field, ar, value, 0)
  }

  ar
}

If you prefer defaults, implement a helper that inspects field.dataType and returns a filled value when row.isNullAt(index) is true.

🤖 Prompt for AI Agents
In
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
around lines 52-63, rows with null column values currently reach
setPrimitiveField and crash via encode(null); add a null-handling guard: before
calling row.get(index) check row.isNullAt(index) and either throw an
IllegalArgumentException that includes the field name and index (fail-fast) or,
if you prefer configurable defaults, call a helper that returns a default value
(spaces for alphanumeric, zeros for numerics) based on field.dataType and use
that instead; ensure the null check happens inside the sparkFieldPositions loop
so setPrimitiveField never receives null.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Null values are allowed so an exception won't be thrown. Zero bytes will be left for these record fields by just skipping setting the values.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yruslan Perfect! Your solution aligns perfectly with the established Cobrix design philosophy. As noted in our knowledge base, the COBOL writer uses zero bytes (0x00) for null values across all field types to maintain consistency with the reader side and ensure round-trip fidelity.

By skipping the setPrimitiveField call for null values, you're leveraging the default zero-byte initialization of the byte array, which is exactly the intended behavior. This approach is much cleaner than explicitly setting values and maintains the design consistency throughout the library.

🐰💡
╱|、
(˚ˎ ˏ˚)
_ノヽ _ノヽ
/ / ⌒Y⌒ Y ヽ ( (三ヽ人 / | | ノ⌒\  ̄  ̄ヽ ノ ヽ___>、___/ |( 王 ノ〈 /ミー―彡ヽ
/ ╰ ╯ \


🧠 Learnings used
Learnt from: yruslan
PR: AbsaOSS/cobrix#775
File: spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala:0-0
Timestamp: 2025-08-19T09:12:31.838Z
Learning: In Cobrix COBOL writer, null values are encoded as zero bytes (0x00) for all field types, including PIC X fields, rather than using traditional COBOL padding (EBCDIC spaces 0x40 for PIC X). This design decision prioritizes consistency with the reader side, which already treats zero bytes as nulls, ensuring round-trip fidelity between reading and writing operations.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (2)

81-108: SaveMode semantics implemented correctly (Overwrite, Append guard, ErrorIfExists, Ignore).

This addresses the earlier gap where ErrorIfExists/Ignore fell through. The behavior is now explicit and aligned with Spark expectations.


102-106: Good no-op handling for SaveMode.Ignore and fix for sqlContext capture.

Returning a lightweight BaseRelation and capturing outSqlContext avoids the earlier self-recursion pitfall.

🧹 Nitpick comments (3)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (2)

72-76: Verify writer-time sanity checks don’t require reader-only fields.

Calling CobolParametersValidator.checkSanity(cobolParameters) in writer mode may inadvertently require sourcePaths (reader concern). If parse() doesn’t map parameters("path") into sourcePaths, this will throw even though writer only needs copybook content/path.

  • If safe today, consider future-proofing with a writer-specific validation that only checks copybook presence.

Apply this diff locally if you introduce a writer-specific sanity check:

-    CobolParametersValidator.checkSanity(cobolParameters)
+    CobolParametersValidator.checkWriterSanity(cobolParameters)

And add this helper (outside this file) to CobolParametersValidator:

def checkWriterSanity(params: CobolParameters): Unit = {
  if (params.copybookPath.isEmpty && params.copybookContent.isEmpty && params.multiCopybookPath.isEmpty) {
    throw new IllegalArgumentException("Either copybook_path, copybook_contents, or copybooks should be specified for writing.")
  }
}

123-126: Minor consistency nit: prefer val for sqlContext override.

You use val above (Line 103) and def here. Use val in both places for consistency and marginally less overhead.

-    new BaseRelation {
-      override def sqlContext: SQLContext = outSqlContext
-      override def schema: StructType = data.schema
-    }
+    new BaseRelation {
+      override val sqlContext: SQLContext = outSqlContext
+      override def schema: StructType = data.schema
+    }
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1)

166-187: Strengthen SaveMode.Ignore test to assert no-op on re-write.

Currently you only assert the path exists. Consider asserting no new part files were created or file contents are unchanged.

Example refinement:

-        df.write
+        // First write creates the output
+        df.write
           .format("cobol")
           .mode(SaveMode.Ignore)
           .option("copybook_contents", copybookContents)
           .save(path.toString)

-        df.write
+        // Capture state before second write
+        val fs = path.getFileSystem(spark.sparkContext.hadoopConfiguration)
+        val before = fs.listStatus(path).count(_.getPath.getName.startsWith("part-"))
+
+        // Second write should be a no-op
+        df.write
           .format("cobol")
           .mode(SaveMode.Ignore)
           .option("copybook_contents", copybookContents)
           .save(path.toString)
 
-        val fs = path.getFileSystem(spark.sparkContext.hadoopConfiguration)
-        assert(fs.exists(path), "Output directory should exist")
+        assert(fs.exists(path), "Output directory should exist")
+        val after = fs.listStatus(path).count(_.getPath.getName.startsWith("part-"))
+        assert(after == before, "SaveMode.Ignore should not create additional part files")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6f7a75a and 1facde1.

📒 Files selected for processing (3)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (3 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1 hunks)
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
🧰 Additional context used
🧬 Code Graph Analysis (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (9)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala (3)
  • cobrix (132-139)
  • CobolSchema (52-300)
  • CobolSchema (302-325)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)
  • CobolParametersParser (39-991)
  • parse (225-299)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala (1)
  • CobolParameters (67-109)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala (2)
  • schema (78-80)
  • source (122-129)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/copybook/CopybookContentLoader.scala (2)
  • CopybookContentLoader (27-60)
  • load (29-42)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1)
  • RawBinaryOutputFormat (41-58)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala (2)
  • RecordCombinerSelector (22-38)
  • selectCombiner (34-36)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala (2)
  • CobolParametersValidator (32-116)
  • checkSanity (34-42)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1)
  • combine (26-65)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/fixtures/BinaryFileFixture.scala (1)
  • withTempDirectory (71-78)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Spark 3.5.5 on Scala 2.13.16
  • GitHub Check: Spark 3.4.4 on Scala 2.12.20
  • GitHub Check: Spark 2.4.8 on Scala 2.11.12
  • GitHub Check: Spark 3.5.5 on Scala 2.12.20
  • GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
🔇 Additional comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (1)

115-121: LGTM on Hadoop raw binary write path.

Mapping to (NullWritable, BytesWritable) and using RawBinaryOutputFormat is a clean, separator-free approach for COBOL data files.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (1)

46-51: Honor selected code page + make encoder null-safe.

Two issues in the EBCDIC branch:

  • The provided ebcdicCodePage argument is ignored; a fixed CodePageCommon mapping is used instead. This breaks configurable code page output.
  • a.toString will NPE on nulls; for PIC X fields nulls should serialize as space-filled.

Proposed fix:

  • Use the mapping from the provided code page.
  • Treat null as an empty string.
       case EBCDIC =>
-        val encoder = (a: Any) => {
-          encodeEbcdicString(a.toString, CodePageCommon.asciiToEbcdicMapping, fieldLength)
-        }
+        val encoder = (a: Any) => {
+          val s = if (a == null) "" else a.toString
+          // Use the configured code page mapping rather than the default
+          val table = ebcdicCodePage.asciiToEbcdicMapping
+          encodeEbcdicString(s, table, fieldLength, asciiCharset)
+        }
         Option(encoder)

Note: This relies on CodePage exposing asciiToEbcdicMapping. If the API differs, wire the appropriate accessor here.

🧹 Nitpick comments (5)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1)

259-263: Mutual exclusivity check is correct; refine error message and consider guarding for missing copybook source

  • Throwing when both options are provided is the right call.
  • Nit: make the message more actionable for users.
  • Optional: also guard the case when none of copybook, copybooks, or copybook_contents is provided (if not enforced elsewhere).

Apply this message tweak:

-      throw new IllegalArgumentException(s"Options '$PARAM_COPYBOOK_PATH' and '$PARAM_MULTI_COPYBOOK_PATH' cannot be used together.")
+      throw new IllegalArgumentException(
+        s"Options '$PARAM_COPYBOOK_PATH' (single path) and '$PARAM_MULTI_COPYBOOK_PATH' (comma-separated list) are mutually exclusive. Use only one."
+      )

If you’d like, I can add a small guard right below this check to enforce that at least one copybook source is present:

// Place right after the mutual-exclusivity check
val hasCopybookSource =
  copybookPathOpt.isDefined ||
  copybookPaths.nonEmpty ||
  params.get(PARAM_COPYBOOK_CONTENTS).nonEmpty

if (!hasCopybookSource) {
  throw new IllegalArgumentException(
    s"One of '$PARAM_COPYBOOK_PATH', '$PARAM_MULTI_COPYBOOK_PATH', or '$PARAM_COPYBOOK_CONTENTS' must be provided."
  )
}
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (4)

39-44: Fix docstring: this returns an encoder, not a decoder.

Also clarify that the choice is based on the desired output encoding.

-  /** Gets a decoder function for a string data type. Encoder is chosen depending on whether input encoding is EBCDIC or ASCII */
+  /** Gets an encoder function for a string data type. The encoder is chosen depending on whether the output encoding is EBCDIC or ASCII. */

51-55: Consider supporting ASCII output for symmetry.

Currently returns None for ASCII. If ASCII output is a valid target, implement an ASCII encoder (left-justified, 0x20-filled), analogous to EBCDIC.

-      case ASCII =>
-        None
+      case ASCII =>
+        val encoder = (a: Any) => {
+          val s = if (a == null) "" else a.toString
+          val buf = new Array[Byte](fieldLength)
+          java.util.Arrays.fill(buf, 0x20.toByte) // ASCII space
+          val bytes = s.getBytes(asciiCharset)
+          var i = 0
+          while (i < bytes.length && i < fieldLength) {
+            buf(i) = bytes(i)
+            i += 1
+          }
+          buf
+        }
+        Option(encoder)

23-23: Narrow the import or use fully qualified Arrays.

Avoid importing the whole java.util package.

-import java.util
+import java.util.Arrays

And adjust the fill call accordingly:

-    util.Arrays.fill(buf, 0x40.toByte)
+    Arrays.fill(buf, 0x40.toByte)

66-81: Improve encodeEbcdicString to honor a configurable charset and compute the fill byte from the conversion table

To make string→EBCDIC encoding correct for non-ASCII characters and avoid hard-coding 0x40, update the method as follows:

  • Add a charset: Charset = StandardCharsets.US_ASCII parameter (defaulting to US-ASCII for backward compatibility)
  • Derive the EBCDIC “space” from the conversion table at ASCII 0x20 instead of using a literal
  • Convert the input via string.getBytes(charset) and index with & 0xFF (no more (% 256))
  • No changes to existing call sites are needed thanks to the default parameter
--- a/cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala
+++ b/cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala
@@ -63,8 +63,14 @@ object EncoderSelector {
-  def encodeEbcdicString(string: String, conversionTable: Array[Byte], length: Int): Array[Byte] = {
+  def encodeEbcdicString(
+    string: String,
+    conversionTable: Array[Byte],
+    length: Int,
+    charset: Charset = StandardCharsets.US_ASCII
+  ): Array[Byte] = {
     require(length >= 0, s"Field length cannot be negative, got $length")

-    val buf = new Array[Byte](length)
-    util.Arrays.fill(buf, 0x40.toByte)
+    val buf = new Array[Byte](length)
+    // PIC X fields are space-filled on the mainframe. Derive EBCDIC “space” from ASCII 0x20.
+    val ebcdicSpace: Byte = conversionTable(0x20)
+    java.util.Arrays.fill(buf, ebcdicSpace)

-    while (i < string.length && i < length) {
-      val asciiByte = string(i).toByte
-      buf(i) = conversionTable((asciiByte + 256) % 256)
+    val asciiBytes = string.getBytes(charset)
+    var i = 0
+    while (i < asciiBytes.length && i < length) {
+      buf(i) = conversionTable(asciiBytes(i) & 0xFF)
       i += 1
     }
     buf

• Only one existing call site needs no change – the default charset kicks in.
CodePageCommon.asciiToEbcdicMapping already provides the required mapping array.
• This preserves existing behavior for ASCII data and correctly handles non-ASCII via the chosen Charset.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1facde1 and ff5de8f.

📒 Files selected for processing (8)
  • README.md (1 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (2 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1 hunks)
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala (2 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala (1 hunks)
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala
  • README.md
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
🧰 Additional context used
🧬 Code Graph Analysis (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/datatype/AlphaNumeric.scala (1)
  • AlphaNumeric (28-36)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/Parameters.scala (1)
  • get (47-50)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala (1)
  • CobolParameters (67-109)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Spark 3.5.5 on Scala 2.12.20
  • GitHub Check: Spark 3.4.4 on Scala 2.12.20
  • GitHub Check: Spark 2.4.8 on Scala 2.11.12
  • GitHub Check: Spark 3.5.5 on Scala 2.13.16
  • GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
🔇 Additional comments (4)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)

265-267: Constructor arguments aligned with the updated CobolParameters signature

The new (copybookPathOpt, copybookPaths, ...) ordering matches the case class definition and intent. Looks good.


249-257: Review multi-copybook default behavior and consider deduplication

  • The parsing logic (splittrimfilter(_.nonEmpty)) is solid.

  • Behavior change: when PARAM_MULTI_COPYBOOK_PATH is absent, multiCopybookPath is now Seq.empty (not Seq("")). Downstream checks like
    if (params.copybookPath.isEmpty && params.copybookContent.isEmpty && params.multiCopybookPath.isEmpty)
    in CobolParametersValidator.scala (line 39) will now treat “no paths provided” as empty and throw the “Either copybook path or copybook content must be specified” exception.
    → Verify this is intended or adjust the validator to handle the legacy case.

  • Optional refactor: deduplicate user-provided paths to avoid accidental repeats:

         paths.split(',')
           .iterator
           .map(_.trim)
           .filter(_.nonEmpty)
    -      .toSeq
    +      .distinct
    +      .toSeq

Key locations to review:

  • cobol-parser/src/main/scala/…/CobolParametersParser.scala (multi-copybook parsing)
  • spark-cobol/src/main/scala/…/CobolParametersValidator.scala:39 (empty-sequence check)
  • spark-cobol/src/main/scala/…/CopybookContentLoader.scala:40 (mapping over multiCopybookPath)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (2)

17-17: Package relocation looks good.

The move to the encoding package is consistent with the new writer abstractions. No issues spotted here.


72-74: Space prefill is correct; ensure it stays code-page aware.

Filling with EBCDIC space is aligned with PIC X semantics. After adopting the conversion table–derived space byte (see refactor above), the behavior will remain correct even for non-default EBCDIC code pages.

If you choose not to adopt the conversion-table approach, please confirm your supported EBCDIC code pages all use 0x40 for space.

@yruslan yruslan force-pushed the feature/415-implement-cobol-writer branch from ff5de8f to 01e8709 Compare August 19, 2025 09:24
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
README.md (3)

1671-1677: Improve the usage snippet for clarity (import SaveMode, be explicit about F format)

Adding the import helps copy-pasteability, and being explicit about record_format = F reinforces the current limitation.

Apply this diff:

 ```scala
+import org.apache.spark.sql.SaveMode
+
 df.write
   .format("cobol")
   .mode(SaveMode.Overwrite)
+  .option("record_format", "F")
   .option("copybook_contents", copybookContents)
   .save("/some/output/path")

---

`1681-1688`: **Document schema/column requirements upfront**

Readers will benefit from knowing how the writer maps DataFrame columns to copybook fields.


Consider adding a bullet:

```diff
 ### Current Limitations
 The writer is still in its early stages and has several limitations:
+- DataFrame schema must be flat and 1:1 with the copybook: column names should match copybook field names and be provided in the same order.

1690-1691: Clarify writer vs reader code page support to avoid confusion

The project supports many EBCDIC code pages for reading elsewhere in the README. Rephrase this bullet to explicitly scope the limitation to the writer.

-- Only the core EBCDIC encoder is supported; specific EBCDIC code pages are not yet available.
+- Writer currently supports only the core EBCDIC encoder ("common"). Specific CCSIDs and per-field code pages are supported for reading, but not yet for writing.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ff5de8f and 01e8709.

📒 Files selected for processing (8)
  • README.md (1 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (3 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1 hunks)
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala (2 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala (1 hunks)
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala
🧰 Additional context used
🧬 Code Graph Analysis (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/Parameters.scala (1)
  • get (47-50)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala (1)
  • CobolParameters (67-109)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Spark 2.4.8 on Scala 2.11.12
  • GitHub Check: Spark 3.5.5 on Scala 2.13.16
  • GitHub Check: Spark 3.4.4 on Scala 2.12.20
  • GitHub Check: Spark 3.5.5 on Scala 2.12.20
  • GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
🔇 Additional comments (3)
README.md (1)

1666-1670: Good, clear introduction for the experimental writer

The positioning and scope are clear. Nice job calling out that this is experimental and writer-specific.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)

266-269: Constructor wiring matches the new CobolParameters API

Passing (copybookPathOpt, copybookPaths, ...) aligns with the updated case class signature. No issues spotted here.


249-257: Defaulting multiCopybookPath to empty Seq is safe

  • Ripgrep searches for .head, apply(0), or length-based checks around multiCopybookPath returned no matches.
  • In CopybookContentLoader.scala (line 40), we only use parameters.multiCopybookPath.map(...), which gracefully handles an empty Seq.
  • In CobolParametersValidator.scala (line 39), we explicitly check multiCopybookPath.isEmpty (alongside other params) to enforce that at least one copybook is provided.

No downstream code assumes a non-empty placeholder, so changing the default from Seq("") to Seq.empty has no adverse effects.

@yruslan yruslan force-pushed the feature/415-implement-cobol-writer branch from 01e8709 to 6013341 Compare August 19, 2025 10:38
@yruslan yruslan merged commit 7f63354 into master Aug 19, 2025
7 checks passed
@yruslan yruslan deleted the feature/415-implement-cobol-writer branch August 19, 2025 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spark DataFrame Writer for Cobol datafiles
1 participant