-
Notifications
You must be signed in to change notification settings - Fork 82
#415 Implement the basic version of EBCDIC writer #775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds an experimental Spark DataFrame writer for fixed-length COBOL/EBCDIC records (writer API, combiner abstraction, basic combiner implementation, raw binary output format, and tests); relocates EncoderSelector into an encoding package with a buffer prefill to 0x40; updates parser imports; adjusts multi-copybook parameter parsing; and appends duplicate README writer documentation blocks. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User as Spark Job
participant DS as DefaultSource (writer)
participant CCL as CopybookContentLoader
participant CS as CobolSchema
participant RCS as RecordCombinerSelector
participant RC as BasicRecordCombiner
participant H as RawBinaryOutputFormat
User->>DS: createRelation(sqlContext, SaveMode, params, df)
DS->>CCL: load copybook (path or copybook_contents)
CCL-->>DS: copybook text
DS->>CS: build CobolSchema (AST, sizes)
CS-->>DS: CobolSchema
DS->>RCS: selectCombiner(schema, readerParams)
RCS-->>DS: BasicRecordCombiner
DS->>RC: combine(df, schema, readerParams)
RC-->>DS: RDD[Array[Byte]]
alt SaveMode = Overwrite / Create
DS->>H: saveAsNewAPIHadoopFile(path, NullWritable, BytesWritable)
H-->>User: bytes written (part files)
else SaveMode = Append
DS-->>User: IllegalArgumentException (append unsupported)
else SaveMode = Ignore / ErrorIfExists
DS-->>User: no-op or IllegalArgumentException
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Assessment against linked issues
Assessment against linked issues: Out-of-scope changes
Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: CodeRabbit UI 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (8)
🚧 Files skipped from review as they are similar to previous changes (8)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
✨ Finishing Touches🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
JaCoCo code coverage report - 'cobol-parser'
|
JaCoCo code coverage report - 'spark-cobol'
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limits.
🔭 Outside diff range comments (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (2)
39-55
: Use the provided code page and charset; avoid hardcoding CodePageCommongetStringEncoder ignores both ebcdicCodePage and asciiCharset and always uses CodePageCommon. This makes the API misleading and prevents future extension from working as expected.
- Leverage ebcdicCodePage’s ASCII→EBCDIC mapping instead of the hardcoded one.
- If you keep ASCII unimplemented (None), that’s fine for now, but do use asciiCharset whenever you convert Unicode string input to bytes.
Proposed change:
private def getStringEncoder(encoding: Encoding, ebcdicCodePage: CodePage, asciiCharset: Charset, fieldLength: Int ): Option[Encoder] = { encoding match { case EBCDIC => - val encoder = (a: Any) => { - encodeEbcdicString(a.toString, CodePageCommon.asciiToEbcdicMapping, fieldLength) - } + val encoder = (a: Any) => { + // Honor selected code page for ASCII→EBCDIC conversion + encodeEbcdicString(a.toString, ebcdicCodePage.asciiToEbcdicMapping, fieldLength) + } Option(encoder) case ASCII => None case _ => None } }
65-77
: Pad PIC X fields with EBCDIC spaces (0x40) instead of zeros; avoid data corruptionThe current implementation initializes the output buffer with zeros, which will overwrite existing space-filled fields with 0x00 bytes on write. That is not a valid filler for EBCDIC text fields and can break downstream consumers. It also risks tests passing only because decoders trim control bytes.
- Initialize the buffer with 0x40 (EBCDIC space).
- Optional but recommended: obtain ASCII bytes via asciiCharset (if you extend the signature), otherwise keep as-is until ASCII support is added.
Minimal fix (keeps the current signature):
def encodeEbcdicString(string: String, conversionTable: Array[Byte], length: Int): Array[Byte] = { require(length >= 0, s"Field length cannot be negative, got $length") var i = 0 val buf = new Array[Byte](length) + // PIC X fields are space-filled on mainframe. Use EBCDIC space 0x40. + java.util.Arrays.fill(buf, 0x40.toByte) while (i < string.length && i < length) { val asciiByte = string(i).toByte buf(i) = conversionTable((asciiByte + 256) % 256) i = i + 1 } buf }Follow-up: consider switching to
string.getBytes(asciiCharset)
(and passing the charset down) for robust non-ASCII handling.
🧹 Nitpick comments (15)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)
255-257
: Define precedence/validation between 'copybook' and 'copybooks'Right now both options can be set simultaneously with no explicit rule. Consider validating exclusivity (or documenting a clear precedence) to avoid ambiguous configurations.
Apply this diff to enforce mutual exclusivity and avoid double lookups:
- val cobolParameters = CobolParameters( - getParameter(PARAM_COPYBOOK_PATH, params), - copybookPaths, + val copybookPathOpt = getParameter(PARAM_COPYBOOK_PATH, params) + if (copybookPathOpt.nonEmpty && copybookPaths.nonEmpty) { + throw new IllegalArgumentException(s"Options '$PARAM_COPYBOOK_PATH' and '$PARAM_MULTI_COPYBOOK_PATH' cannot be used together.") + } + + val cobolParameters = CobolParameters( + copybookPathOpt, + copybookPaths,If mutual exclusivity already exists elsewhere, point me to it and I’ll align this check or drop the suggestion.
249-253
: Normalize and clean ‘multiCopybookPath’ by trimming and dropping empty entriesVerified that no downstream logic depends on retaining empty-string entries in the
multiCopybookPath
sequence (e.g., there are no.headOption
or size checks that assume a length ≥ 1 for Seq(“”)). To make parsing of comma-separated copybook paths robust, update the splitter to trim whitespace and filter out blank tokens.• File:
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
• Lines: ~249–253Apply this diff:
- val copybookPaths = params.get(PARAM_MULTI_COPYBOOK_PATH) match { - case Some(paths) => paths.split(',').toSeq - case None => Seq.empty[String] - } + val copybookPaths = params.get(PARAM_MULTI_COPYBOOK_PATH) match { + case Some(paths) => + paths + .split(',') + .iterator + .map(_.trim) + .filter(_.nonEmpty) + .toSeq + case None => Seq.empty[String] + } + // Optional: deduplicate if the same path may appear multiple times + // val copybookPaths = copybookPaths.distinctThis change ensures inputs like
–"a, b , ,c"
⇒Seq("a","b","c")
–"copybooks="
or an all-whitespace string ⇒Seq.empty
README.md (1)
1666-1693
: Document padding/truncation semantics and code page scope for the writerGood to see a clear, scoped introduction of the experimental writer. Two clarifications will save users time:
- State how PIC X(n) values are handled: truncated when longer than n and right-padded when shorter. Specify the padding byte (EBCDIC space 0x40).
- Reiterate that only the default/common EBCDIC mapping is used for output today, and that per-field or alternate code pages are not honored yet (options will be ignored).
This avoids surprises when bytes are inspected or mainframe readers expect space-filled fields.
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/antlr/ParserVisitor.scala (1)
26-26
: Import refactor looks correct; EncoderSelector remains in scope via encoding._The move of EncoderSelector to the encoding package is reflected by the wildcard import. No functional issues expected.
For clarity and to avoid future shadowing, consider explicit imports:
import za.co.absa.cobrix.cobol.parser.encoding.{Encoding, EncoderSelector}
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala (1)
208-219
: Add an assertion to verify EBCDIC space-fill behavior on setFieldValueByNameOnce the encoder pads with 0x40 (EBCDIC spaces), it’s worth asserting that the trailing bytes in the field are 0x40 after setting a shorter value to prevent regressions. This will catch accidental zero-filling.
Example (outside this hunk, for illustration):
// After setting "NEWNAME" into X(10) val fieldOffset = /* compute offset of SHORT-NAME within the record */ val fieldLen = 10 val trailing = bytes.slice(fieldOffset + "NEWNAME".length, fieldOffset + fieldLen) assert(trailing.forall(_ == 0x40.toByte))cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/asttransform/NonTerminalsAdder.scala (1)
76-77
: Symmetric decode/encode wiring for virtual fields is fine; confirm EBCDIC-only encoding expectation.
EncoderSelector.getEncoder
returns None for ASCII. Given the current writer throws whenencode
is empty (see BasicRecordCombiner), this effectively constrains non-terminals to EBCDIC for writing. If that’s intentional for the first iteration, we’re good; otherwise, plan ASCII support or a clearer error message upstream.Would you like a follow-up PR that adds ASCII string encoders and a targeted error message when a copybook field is not encodable?
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1)
43-46
: Optional: avoid forcing the “.dat” extension (or make it configurable).Hardcoding “.dat” is fine for tests, but you might prefer:
- inherit default extension:
getDefaultWorkFile(context, "")
, or- make extension configurable via a job conf (e.g.,
cobol.writer.output.extension
).Low priority; no functional impact.
- val path: Path = getDefaultWorkFile(context, ".dat") + val extension = context.getConfiguration.get("cobol.writer.output.extension", "") + val path: Path = getDefaultWorkFile(context, extension)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombiner.scala (1)
24-26
: Trait definition is clear; consider making it Serializable.Implementations are likely to be captured in Spark closures. Declaring the trait Serializable helps avoid accidental serialization issues for stateful implementations later.
-trait RecordCombiner { +trait RecordCombiner extends Serializable { def combine(df: DataFrame, cobolSchema: CobolSchema, readerParameters: ReaderParameters): RDD[Array[Byte]] }spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (2)
41-48
: Minor: coalesce after ordering to avoid an extra shuffle.
repartition(1).orderBy("A")
shuffles twice. For a single-file deterministic output,orderBy("A").coalesce(1)
reduces shuffling.- df.repartition(1) - .orderBy("A") + df.orderBy("A") + .coalesce(1) .write
63-67
: Optional: derive expected bytes using the library’s code page mapping to avoid magic constants.Hardcoded hex is fine for a golden test. Alternatively, using
CodePageCommon.asciiToEbcdicMapping
to buildexpected
reduces maintenance if mappings evolve.spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala (2)
24-33
: Doc nit: fix minor typo.“ot output record” → “of the output record”.
- * @param cobolSchema The COBOL schema ot output record. + * @param cobolSchema The COBOL schema of the output record.
34-36
: Selector stub is fine; consider early routing for record formats.As support expands (FB, VB, V, RDW/BDW), this selector can examine
readerParameters.recordFormat
andisRecordSequence/rdw
to pick an appropriate combiner (e.g., length-prefixed writer vs. fixed-length). Not required now.spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (2)
92-94
: No fallback when copybook is absent (contradicts PR objective of deriving from DataFrame schema).This path unconditionally loads copybook content and will throw if none is provided. The PR objectives mention deriving a default copybook from the Spark schema; consider a fallback here when the copybook is not specified.
I can help scaffold:
- A minimal Schema→Copybook derivation for primitive types used by BasicRecordCombiner.
- A parameter flag (e.g.,
derive_copybook=true
) to opt-in while experimental.
Would you like me to propose an implementation?
95-101
: Consider logging writer configuration for traceability.Reader path logs build info; writer path does not log anything. Adding a concise log (path, record format, code page) aids debugging.
Example:
logger.info( s"Writing COBOL data to '$path' with recordFormat=${readerParameters.recordFormat}, " + s"isEbcdic=${readerParameters.isEbcdic}, codePage=${readerParameters.ebcdicCodePage}" )spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1)
65-90
: Schema validation should ignore fillers to reduce false negatives.Depending on parser settings, filler fields may still be present in the AST. Requiring them in the DataFrame can be onerous. Consider excluding fillers from both the not-found and unsupported checks.
Example adjustment:
val children = ast.children.filter { case p: Primitive => !p.isFiller case g: Group => !g.isFiller case _ => true }Then use
children
instead ofast.children
for the subsequent validations and mapping.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (13)
README.md
(1 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/antlr/ParserVisitor.scala
(1 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala
(1 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/asttransform/NonTerminalsAdder.scala
(1 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala
(1 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
(1 hunks)cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala
(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala
(3 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala
(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombiner.scala
(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala
(1 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (9)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/asttransform/NonTerminalsAdder.scala (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/DecoderSelector.scala (6)
parser
(114-168)parser
(170-179)parser
(181-190)parser
(193-254)parser
(256-277)DecoderSelector
(30-352)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/FloatingPointFormat.scala (1)
FloatingPointFormat
(19-44)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (1)
EncoderSelector
(24-79)
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/DecoderSelector.scala (6)
parser
(114-168)parser
(170-179)parser
(181-190)parser
(193-254)parser
(256-277)DecoderSelector
(30-352)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/Encoding.scala (1)
EBCDIC
(23-23)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (1)
EncoderSelector
(24-79)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/fixtures/BinaryFileFixture.scala (1)
withTempDirectory
(71-78)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (1)
EncoderSelector
(24-79)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombiner.scala (3)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala (1)
cobrix
(132-139)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (1)
schema
(110-110)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1)
combine
(26-63)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/antlr/ParserVisitor.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/DecoderSelector.scala (6)
parser
(114-168)parser
(170-179)parser
(181-190)parser
(193-254)parser
(256-277)DecoderSelector
(30-352)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (5)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala (3)
cobrix
(132-139)CobolSchema
(52-300)CobolSchema
(302-325)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/copybook/CopybookContentLoader.scala (2)
CopybookContentLoader
(27-60)load
(29-42)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1)
RawBinaryOutputFormat
(41-58)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala (2)
RecordCombinerSelector
(22-38)selectCombiner
(34-36)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1)
combine
(26-63)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/Parameters.scala (1)
get
(47-50)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala (1)
CobolParameters
(67-109)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (4)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala (1)
Primitive
(42-140)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala (1)
ReaderParameters
(79-133)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (1)
schema
(110-110)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala (1)
setPrimitiveField
(230-248)
🔇 Additional comments (9)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/antlr/ParserVisitor.scala (1)
858-860
: Encoder wiring for primitives is consistent with current scopeBinding decoders and (optional) encoders at AST construction time keeps downstream logic simple. Given current encoder constraints (PIC X only), the Option is appropriate.
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala (1)
20-22
: Import path update is correct and preserves behaviorSwitching EncoderSelector import to encoding aligns with the package move; no runtime changes. The encode type remains Optional as required by current encoder capabilities.
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala (1)
23-25
: Test imports updated correctly to match EncoderSelector package moveThe test continues to construct Primitive with both decoder and (optional) encoder. Imports align with the refactor.
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/asttransform/NonTerminalsAdder.scala (1)
23-26
: Import path update looks correct; no behavioral change.Switching EncoderSelector to the encoding package and grouping the import with Encoding is consistent with the refactor. DecoderSelector remains from decoders; types resolve cleanly.
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1)
41-58
: LGTM: minimal, correct raw binary writer.
- Writes exactly
value.getLength
bytes without separators.- Uses task work file via
getDefaultWorkFile
; close is handled.This is appropriate for fixed-length and “self-delimiting” variable-length records (headers embedded).
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombiner.scala (1)
24-26
: Critical: Fix field-to-column mapping in BasicRecordCombiner to prevent data corruptionThe current implementation in
BasicRecordCombiner.scala
builds
sparkFieldPositions: Seq[Int]
by mapping eachcobolField
to its DataFrame column index, but then uses those same indices to look up both the value in the Row and the correspondingcobolFields(index)
. If the DataFrame columns aren’t in the exact order of the copybook, values will be written into the wrong fields.Locations:
- File:
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
- Method:
combine
, around lines 38–57Suggested fix:
- Build a mapping of
(cobolField, sparkIndex)
pairs- Iterate over that mapping when encoding, so each primitive is set on the correct
cobolField
.Illustrative patch:
- // current: Seq of just spark column indexes - val sparkFieldPositions = cobolFields.map { cobolField => + // map each cobolField to its DataFrame index + val fieldMappings: Seq[(Primitive, Int)] = cobolFields.map { cobolField => val fieldName = cobolField.name.toLowerCase val position = sparkFields.indexOf(fieldName) if (position < 0) { throw new IllegalArgumentException(s"Field '${cobolField.name}' not found in DataFrame schema.") } - position + (cobolField, position) } @@ - df.rdd.map { row => + df.rdd.map { row => val ar = new Array[Byte](size) - - sparkFieldPositions.foreach { index => - val fieldStr = row.get(index) - val cobolField = cobolFields(index) - cobolSchema.copybook.setPrimitiveField(cobolField, ar, fieldStr, 0) - } + fieldMappings.foreach { case (cobolField, sparkIndex) => + val value = row.get(sparkIndex) + cobolSchema.copybook.setPrimitiveField(cobolField, ar, value, 0) + } ar }This ensures each COBOL field is always paired with its correct Spark column, regardless of column ordering.
Likely an incorrect or invalid review comment.
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1)
34-75
: End-to-end writer test is solid and readable.
- Uses a simple copybook and deterministic ordering.
- Validates raw bytes against expected EBCDIC, including 0x00 padding.
Nice coverage for the first iteration of the writer.
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (2)
30-37
: Good guard: verify encoder presence per field.Early validation for missing
encode
is appropriate and will prevent silent corruption.
92-100
: getAst selection is reasonable for a single-root group scenario.This keeps the writer focused on flat, primitive-only records for now.
mode match { | ||
case SaveMode.Overwrite => | ||
val outputPath = new Path(path) | ||
val hadoopConf = sqlContext.sparkContext.hadoopConfiguration | ||
val fs = outputPath.getFileSystem(hadoopConf) | ||
if (fs.exists(outputPath)) { | ||
fs.delete(outputPath, true) | ||
} | ||
case SaveMode.Append => | ||
throw new IllegalArgumentException( | ||
s"Save mode '$mode' is not supported by the 'spark-cobol' data source at the moment. " + | ||
"Please use 'Overwrite' mode to write data to a file or folder." | ||
) | ||
case _ => | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Honor SaveMode.ErrorIfExists and Ignore semantics.
Currently, only Overwrite is implemented and Append is rejected. ErrorIfExists and Ignore default to a no-op, which can lead to accidental overwrites or backend errors when the path exists.
Suggested minimal handling for the remaining modes:
mode match {
case SaveMode.Overwrite =>
val outputPath = new Path(path)
val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
val fs = outputPath.getFileSystem(hadoopConf)
if (fs.exists(outputPath)) {
fs.delete(outputPath, true)
}
case SaveMode.Append =>
throw new IllegalArgumentException(
s"Save mode '$mode' is not supported by the 'spark-cobol' data source at the moment. " +
"Please use 'Overwrite' mode to write data to a file or folder."
)
+ case SaveMode.ErrorIfExists =>
+ val outputPath = new Path(path)
+ val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
+ val fs = outputPath.getFileSystem(hadoopConf)
+ if (fs.exists(outputPath)) {
+ throw new IllegalArgumentException(
+ s"Path '$path' already exists. SaveMode.ErrorIfExists prevents overwriting."
+ )
+ }
+ case SaveMode.Ignore =>
+ val outputPath = new Path(path)
+ val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
+ val fs = outputPath.getFileSystem(hadoopConf)
+ if (fs.exists(outputPath)) {
+ // Skip writing entirely and return the relation
+ val outSqlContext = sqlContext
+ return new BaseRelation {
+ override val sqlContext: SQLContext = outSqlContext
+ override def schema: StructType = data.schema
+ }
+ }
case _ =>
}
Please verify behavior with existing tests, and add writer-mode test cases if missing (Overwrite, ErrorIfExists, Ignore on existing path).
🏁 Script executed:
#!/bin/bash
# Locate the SaveMode match in DefaultSource.scala
echo ">>> mode match in DefaultSource.scala <<<"
rg -n "mode match" -C3 spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala
# Show all SaveMode references across the codebase
echo -e "\n>>> All SaveMode references in spark-cobol <<<"
rg -n "SaveMode\." -C2 spark-cobol
# Find any existing tests for SaveMode behaviors
echo -e "\n>>> Tests referencing SaveMode or DefaultSource <<<"
rg -nE "SaveMode|DefaultSource" -C2 -g "src/test/scala/**"
Length of output: 3832
🏁 Script executed:
#!/bin/bash
# Retrieve the signature of the createRelation method in DefaultSource.scala
echo ">>> createRelation signature in DefaultSource.scala <<<"
rg -n "def createRelation" -C2 spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala
Length of output: 1051
🏁 Script executed:
#!/bin/bash
# Show the writer createRelation method including the entire mode match and subsequent code
sed -n '60,120p' spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala
Length of output: 2443
Implement SaveMode.ErrorIfExists and SaveMode.Ignore in the writer
Currently, any mode other than Overwrite or Append falls through and the write proceeds unguarded, violating Spark’s semantics for ErrorIfExists and Ignore.
Please update createRelation
in DefaultSource.scala as follows:
mode match {
case SaveMode.Overwrite =>
val outputPath = new Path(path)
val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
val fs = outputPath.getFileSystem(hadoopConf)
if (fs.exists(outputPath)) {
fs.delete(outputPath, true)
}
case SaveMode.Append =>
throw new IllegalArgumentException(
s"Save mode '$mode' is not supported by the 'spark-cobol' data source at the moment. " +
"Please use 'Overwrite' mode to write data to a file or folder."
)
+ case SaveMode.ErrorIfExists =>
+ val outputPath = new Path(path)
+ val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
+ val fs = outputPath.getFileSystem(hadoopConf)
+ if (fs.exists(outputPath)) {
+ throw new IllegalArgumentException(
+ s"Path '$path' already exists; SaveMode.ErrorIfExists prevents overwriting."
+ )
+ }
+ case SaveMode.Ignore =>
+ val outputPath = new Path(path)
+ val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
+ val fs = outputPath.getFileSystem(hadoopConf)
+ if (fs.exists(outputPath)) {
+ // Skip the write entirely
+ return new BaseRelation {
+ override val sqlContext: SQLContext = sqlContext
+ override def schema: StructType = data.schema
+ }
+ }
case _ =>
}
– Add new unit/integration tests covering:
• SaveMode.ErrorIfExists on existing path (should throw)
• SaveMode.Ignore on existing path (should no-op and return schema only)
• Verify Overwrite still deletes and writes as before
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala
around lines 76 to 90, the writer currently only handles Overwrite and Append
and falls through for other SaveModes; implement explicit handling for
SaveMode.ErrorIfExists and SaveMode.Ignore. For ErrorIfExists, check if the
output path exists and throw an IllegalStateException (or
IllegalArgumentException) if it does; for Ignore, if the path exists simply
return without writing (no-op) and ensure the relation/schema is returned; keep
existing Overwrite behavior. Add unit/integration tests to assert that
ErrorIfExists throws when path exists, Ignore no-ops and returns schema when
path exists, and Overwrite still deletes and writes as before.
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala
Outdated
Show resolved
Hide resolved
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
Outdated
Show resolved
Hide resolved
df.rdd.map { row => | ||
val ar = new Array[Byte](size) | ||
|
||
sparkFieldPositions.foreach { index => | ||
val fieldStr = row.get(index) | ||
val cobolField = cobolFields(index) | ||
cobolSchema.copybook.setPrimitiveField(cobolField, ar, fieldStr, 0) | ||
} | ||
|
||
ar | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Null-handling for DataFrame values is unspecified (encoding may crash).
row.get(dfIdx)
can be null; most encoders are unlikely to accept null and will throw. Decide on a policy:
- Fail fast with a descriptive error, or
- Provide defaults (spaces for alphanumeric, zeros for numerics), configurable via a parameter.
Minimal fail-fast guard:
cobolFieldToDfIndex.foreach { case (cobolField, dfIdx) =>
if (row.isNullAt(dfIdx)) {
throw new IllegalArgumentException(s"Null value encountered for field '${cobolField.name}'.")
}
cobolSchema.copybook.setPrimitiveField(cobolField, ar, row.get(dfIdx), 0)
}
If you prefer defaults, I can propose a small helper that inspects cobolField.dataType
and supplies appropriate fillers.
🏁 Script executed:
#!/bin/bash
# Check for any existing null‐handling in BasicRecordCombiner.scala
rg -n 'isNullAt' -C2 spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
Length of output: 110
🏁 Script executed:
#!/bin/bash
# Locate the definition of setPrimitiveField to verify its null-handling behavior
rg -n "def setPrimitiveField" -C3 -g '*.scala'
Length of output: 847
Specify null-handling policy for DataFrame values
Currently in BasicRecordCombiner.scala rows with nulls will crash inside setPrimitiveField (via encode(null)
). You need to choose one of:
- Fail fast with a clear error when a required column is null
- Supply defaults (spaces for alphanumeric, zeros for numerics), configurable per use case
Minimal fail-fast guard to insert in your map function:
df.rdd.map { row =>
val ar = new Array[Byte](size)
sparkFieldPositions.foreach { index =>
if (row.isNullAt(index)) {
val field = cobolFields(index)
throw new IllegalArgumentException(
s"Null value encountered for field '${field.name}' (index=$index)")
}
val value = row.get(index)
val field = cobolFields(index)
cobolSchema.copybook.setPrimitiveField(field, ar, value, 0)
}
ar
}
If you prefer defaults, implement a helper that inspects field.dataType
and returns a filled value when row.isNullAt(index)
is true.
🤖 Prompt for AI Agents
In
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
around lines 52-63, rows with null column values currently reach
setPrimitiveField and crash via encode(null); add a null-handling guard: before
calling row.get(index) check row.isNullAt(index) and either throw an
IllegalArgumentException that includes the field name and index (fail-fast) or,
if you prefer configurable defaults, call a helper that returns a default value
(spaces for alphanumeric, zeros for numerics) based on field.dataType and use
that instead; ensure the null check happens inside the sparkFieldPositions loop
so setPrimitiveField never receives null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Null values are allowed so an exception won't be thrown. Zero bytes will be left for these record fields by just skipping setting the values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yruslan Perfect! Your solution aligns perfectly with the established Cobrix design philosophy. As noted in our knowledge base, the COBOL writer uses zero bytes (0x00) for null values across all field types to maintain consistency with the reader side and ensure round-trip fidelity.
By skipping the setPrimitiveField
call for null values, you're leveraging the default zero-byte initialization of the byte array, which is exactly the intended behavior. This approach is much cleaner than explicitly setting values and maintains the design consistency throughout the library.
🐰💡
╱|、
(˚ˎ ˏ˚)
_ノヽ _ノヽ
/ / ⌒Y⌒ Y ヽ ( (三ヽ人 / | | ノ⌒\  ̄  ̄ヽ ノ ヽ___>、___/ |( 王 ノ〈 /ミ
ー―彡ヽ
/ ╰ ╯ \
🧠 Learnings used
Learnt from: yruslan
PR: AbsaOSS/cobrix#775
File: spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala:0-0
Timestamp: 2025-08-19T09:12:31.838Z
Learning: In Cobrix COBOL writer, null values are encoded as zero bytes (0x00) for all field types, including PIC X fields, rather than using traditional COBOL padding (EBCDIC spaces 0x40 for PIC X). This design decision prioritizes consistency with the reader side, which already treats zero bytes as nulls, ensuring round-trip fidelity between reading and writing operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (2)
81-108
: SaveMode semantics implemented correctly (Overwrite, Append guard, ErrorIfExists, Ignore).This addresses the earlier gap where ErrorIfExists/Ignore fell through. The behavior is now explicit and aligned with Spark expectations.
102-106
: Good no-op handling for SaveMode.Ignore and fix for sqlContext capture.Returning a lightweight BaseRelation and capturing
outSqlContext
avoids the earlier self-recursion pitfall.
🧹 Nitpick comments (3)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (2)
72-76
: Verify writer-time sanity checks don’t require reader-only fields.Calling
CobolParametersValidator.checkSanity(cobolParameters)
in writer mode may inadvertently requiresourcePaths
(reader concern). Ifparse()
doesn’t mapparameters("path")
intosourcePaths
, this will throw even though writer only needs copybook content/path.
- If safe today, consider future-proofing with a writer-specific validation that only checks copybook presence.
Apply this diff locally if you introduce a writer-specific sanity check:
- CobolParametersValidator.checkSanity(cobolParameters) + CobolParametersValidator.checkWriterSanity(cobolParameters)And add this helper (outside this file) to CobolParametersValidator:
def checkWriterSanity(params: CobolParameters): Unit = { if (params.copybookPath.isEmpty && params.copybookContent.isEmpty && params.multiCopybookPath.isEmpty) { throw new IllegalArgumentException("Either copybook_path, copybook_contents, or copybooks should be specified for writing.") } }
123-126
: Minor consistency nit: prefer val for sqlContext override.You use
val
above (Line 103) anddef
here. Useval
in both places for consistency and marginally less overhead.- new BaseRelation { - override def sqlContext: SQLContext = outSqlContext - override def schema: StructType = data.schema - } + new BaseRelation { + override val sqlContext: SQLContext = outSqlContext + override def schema: StructType = data.schema + }spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1)
166-187
: Strengthen SaveMode.Ignore test to assert no-op on re-write.Currently you only assert the path exists. Consider asserting no new part files were created or file contents are unchanged.
Example refinement:
- df.write + // First write creates the output + df.write .format("cobol") .mode(SaveMode.Ignore) .option("copybook_contents", copybookContents) .save(path.toString) - df.write + // Capture state before second write + val fs = path.getFileSystem(spark.sparkContext.hadoopConfiguration) + val before = fs.listStatus(path).count(_.getPath.getName.startsWith("part-")) + + // Second write should be a no-op + df.write .format("cobol") .mode(SaveMode.Ignore) .option("copybook_contents", copybookContents) .save(path.toString) - val fs = path.getFileSystem(spark.sparkContext.hadoopConfiguration) - assert(fs.exists(path), "Output directory should exist") + assert(fs.exists(path), "Output directory should exist") + val after = fs.listStatus(path).count(_.getPath.getName.startsWith("part-")) + assert(after == before, "SaveMode.Ignore should not create additional part files")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala
(3 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
(1 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
🧰 Additional context used
🧬 Code Graph Analysis (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (9)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala (3)
cobrix
(132-139)CobolSchema
(52-300)CobolSchema
(302-325)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)
CobolParametersParser
(39-991)parse
(225-299)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala (1)
CobolParameters
(67-109)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala (2)
schema
(78-80)source
(122-129)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/copybook/CopybookContentLoader.scala (2)
CopybookContentLoader
(27-60)load
(29-42)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1)
RawBinaryOutputFormat
(41-58)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala (2)
RecordCombinerSelector
(22-38)selectCombiner
(34-36)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala (2)
CobolParametersValidator
(32-116)checkSanity
(34-42)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala (1)
combine
(26-65)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/fixtures/BinaryFileFixture.scala (1)
withTempDirectory
(71-78)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Spark 3.5.5 on Scala 2.13.16
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Spark 3.5.5 on Scala 2.12.20
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
🔇 Additional comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (1)
115-121
: LGTM on Hadoop raw binary write path.Mapping to (NullWritable, BytesWritable) and using RawBinaryOutputFormat is a clean, separator-free approach for COBOL data files.
...cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala
Outdated
Show resolved
Hide resolved
...cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (1)
46-51
: Honor selected code page + make encoder null-safe.Two issues in the EBCDIC branch:
- The provided ebcdicCodePage argument is ignored; a fixed CodePageCommon mapping is used instead. This breaks configurable code page output.
- a.toString will NPE on nulls; for PIC X fields nulls should serialize as space-filled.
Proposed fix:
- Use the mapping from the provided code page.
- Treat null as an empty string.
case EBCDIC => - val encoder = (a: Any) => { - encodeEbcdicString(a.toString, CodePageCommon.asciiToEbcdicMapping, fieldLength) - } + val encoder = (a: Any) => { + val s = if (a == null) "" else a.toString + // Use the configured code page mapping rather than the default + val table = ebcdicCodePage.asciiToEbcdicMapping + encodeEbcdicString(s, table, fieldLength, asciiCharset) + } Option(encoder)Note: This relies on CodePage exposing asciiToEbcdicMapping. If the API differs, wire the appropriate accessor here.
🧹 Nitpick comments (5)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1)
259-263
: Mutual exclusivity check is correct; refine error message and consider guarding for missing copybook source
- Throwing when both options are provided is the right call.
- Nit: make the message more actionable for users.
- Optional: also guard the case when none of
copybook
,copybooks
, orcopybook_contents
is provided (if not enforced elsewhere).Apply this message tweak:
- throw new IllegalArgumentException(s"Options '$PARAM_COPYBOOK_PATH' and '$PARAM_MULTI_COPYBOOK_PATH' cannot be used together.") + throw new IllegalArgumentException( + s"Options '$PARAM_COPYBOOK_PATH' (single path) and '$PARAM_MULTI_COPYBOOK_PATH' (comma-separated list) are mutually exclusive. Use only one." + )If you’d like, I can add a small guard right below this check to enforce that at least one copybook source is present:
// Place right after the mutual-exclusivity check val hasCopybookSource = copybookPathOpt.isDefined || copybookPaths.nonEmpty || params.get(PARAM_COPYBOOK_CONTENTS).nonEmpty if (!hasCopybookSource) { throw new IllegalArgumentException( s"One of '$PARAM_COPYBOOK_PATH', '$PARAM_MULTI_COPYBOOK_PATH', or '$PARAM_COPYBOOK_CONTENTS' must be provided." ) }cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (4)
39-44
: Fix docstring: this returns an encoder, not a decoder.Also clarify that the choice is based on the desired output encoding.
- /** Gets a decoder function for a string data type. Encoder is chosen depending on whether input encoding is EBCDIC or ASCII */ + /** Gets an encoder function for a string data type. The encoder is chosen depending on whether the output encoding is EBCDIC or ASCII. */
51-55
: Consider supporting ASCII output for symmetry.Currently returns None for ASCII. If ASCII output is a valid target, implement an ASCII encoder (left-justified, 0x20-filled), analogous to EBCDIC.
- case ASCII => - None + case ASCII => + val encoder = (a: Any) => { + val s = if (a == null) "" else a.toString + val buf = new Array[Byte](fieldLength) + java.util.Arrays.fill(buf, 0x20.toByte) // ASCII space + val bytes = s.getBytes(asciiCharset) + var i = 0 + while (i < bytes.length && i < fieldLength) { + buf(i) = bytes(i) + i += 1 + } + buf + } + Option(encoder)
23-23
: Narrow the import or use fully qualified Arrays.Avoid importing the whole java.util package.
-import java.util +import java.util.ArraysAnd adjust the fill call accordingly:
- util.Arrays.fill(buf, 0x40.toByte) + Arrays.fill(buf, 0x40.toByte)
66-81
: ImproveencodeEbcdicString
to honor a configurable charset and compute the fill byte from the conversion tableTo make string→EBCDIC encoding correct for non-ASCII characters and avoid hard-coding 0x40, update the method as follows:
- Add a
charset: Charset = StandardCharsets.US_ASCII
parameter (defaulting to US-ASCII for backward compatibility)- Derive the EBCDIC “space” from the conversion table at ASCII 0x20 instead of using a literal
- Convert the input via
string.getBytes(charset)
and index with& 0xFF
(no more(% 256)
)- No changes to existing call sites are needed thanks to the default parameter
--- a/cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala +++ b/cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala @@ -63,8 +63,14 @@ object EncoderSelector { - def encodeEbcdicString(string: String, conversionTable: Array[Byte], length: Int): Array[Byte] = { + def encodeEbcdicString( + string: String, + conversionTable: Array[Byte], + length: Int, + charset: Charset = StandardCharsets.US_ASCII + ): Array[Byte] = { require(length >= 0, s"Field length cannot be negative, got $length") - val buf = new Array[Byte](length) - util.Arrays.fill(buf, 0x40.toByte) + val buf = new Array[Byte](length) + // PIC X fields are space-filled on the mainframe. Derive EBCDIC “space” from ASCII 0x20. + val ebcdicSpace: Byte = conversionTable(0x20) + java.util.Arrays.fill(buf, ebcdicSpace) - while (i < string.length && i < length) { - val asciiByte = string(i).toByte - buf(i) = conversionTable((asciiByte + 256) % 256) + val asciiBytes = string.getBytes(charset) + var i = 0 + while (i < asciiBytes.length && i < length) { + buf(i) = conversionTable(asciiBytes(i) & 0xFF) i += 1 } buf• Only one existing call site needs no change – the default
charset
kicks in.
•CodePageCommon.asciiToEbcdicMapping
already provides the required mapping array.
• This preserves existing behavior for ASCII data and correctly handles non-ASCII via the chosenCharset
.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
README.md
(1 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala
(2 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
(1 hunks)cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala
(2 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala
(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala
(1 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala
- spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala
- cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala
- README.md
- spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala
- spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
🧰 Additional context used
🧬 Code Graph Analysis (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/datatype/AlphaNumeric.scala (1)
AlphaNumeric
(28-36)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/Parameters.scala (1)
get
(47-50)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala (1)
CobolParameters
(67-109)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Spark 3.5.5 on Scala 2.12.20
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Spark 3.5.5 on Scala 2.13.16
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
🔇 Additional comments (4)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)
265-267
: Constructor arguments aligned with the updated CobolParameters signatureThe new
(copybookPathOpt, copybookPaths, ...)
ordering matches the case class definition and intent. Looks good.
249-257
: Review multi-copybook default behavior and consider deduplication
The parsing logic (
split
→trim
→filter(_.nonEmpty)
) is solid.Behavior change: when
PARAM_MULTI_COPYBOOK_PATH
is absent,multiCopybookPath
is nowSeq.empty
(notSeq("")
). Downstream checks like
if (params.copybookPath.isEmpty && params.copybookContent.isEmpty && params.multiCopybookPath.isEmpty)
in CobolParametersValidator.scala (line 39) will now treat “no paths provided” as empty and throw the “Either copybook path or copybook content must be specified” exception.
→ Verify this is intended or adjust the validator to handle the legacy case.Optional refactor: deduplicate user-provided paths to avoid accidental repeats:
paths.split(',') .iterator .map(_.trim) .filter(_.nonEmpty) - .toSeq + .distinct + .toSeqKey locations to review:
- cobol-parser/src/main/scala/…/CobolParametersParser.scala (multi-copybook parsing)
- spark-cobol/src/main/scala/…/CobolParametersValidator.scala:39 (empty-sequence check)
- spark-cobol/src/main/scala/…/CopybookContentLoader.scala:40 (mapping over
multiCopybookPath
)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (2)
17-17
: Package relocation looks good.The move to the encoding package is consistent with the new writer abstractions. No issues spotted here.
72-74
: Space prefill is correct; ensure it stays code-page aware.Filling with EBCDIC space is aligned with PIC X semantics. After adopting the conversion table–derived space byte (see refactor above), the behavior will remain correct even for non-default EBCDIC code pages.
If you choose not to adopt the conversion-table approach, please confirm your supported EBCDIC code pages all use 0x40 for space.
ff5de8f
to
01e8709
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
README.md (3)
1671-1677
: Improve the usage snippet for clarity (import SaveMode, be explicit about F format)Adding the import helps copy-pasteability, and being explicit about
record_format = F
reinforces the current limitation.Apply this diff:
```scala +import org.apache.spark.sql.SaveMode + df.write .format("cobol") .mode(SaveMode.Overwrite) + .option("record_format", "F") .option("copybook_contents", copybookContents) .save("/some/output/path")--- `1681-1688`: **Document schema/column requirements upfront** Readers will benefit from knowing how the writer maps DataFrame columns to copybook fields. Consider adding a bullet: ```diff ### Current Limitations The writer is still in its early stages and has several limitations: +- DataFrame schema must be flat and 1:1 with the copybook: column names should match copybook field names and be provided in the same order.
1690-1691
: Clarify writer vs reader code page support to avoid confusionThe project supports many EBCDIC code pages for reading elsewhere in the README. Rephrase this bullet to explicitly scope the limitation to the writer.
-- Only the core EBCDIC encoder is supported; specific EBCDIC code pages are not yet available. +- Writer currently supports only the core EBCDIC encoder ("common"). Specific CCSIDs and per-field code pages are supported for reading, but not yet for writing.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
README.md
(1 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala
(3 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
(1 hunks)cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala
(2 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala
(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala
(1 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala
- spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala
- spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala
- spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
- cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala
- cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala
🧰 Additional context used
🧬 Code Graph Analysis (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/Parameters.scala (1)
get
(47-50)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala (1)
CobolParameters
(67-109)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Spark 3.5.5 on Scala 2.13.16
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Spark 3.5.5 on Scala 2.12.20
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
🔇 Additional comments (3)
README.md (1)
1666-1670
: Good, clear introduction for the experimental writerThe positioning and scope are clear. Nice job calling out that this is experimental and writer-specific.
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (2)
266-269
: Constructor wiring matches the new CobolParameters APIPassing
(copybookPathOpt, copybookPaths, ...)
aligns with the updated case class signature. No issues spotted here.
249-257
: Defaulting multiCopybookPath to empty Seq is safe
- Ripgrep searches for
.head
,apply(0)
, or length-based checks aroundmultiCopybookPath
returned no matches.- In CopybookContentLoader.scala (line 40), we only use
parameters.multiCopybookPath.map(...)
, which gracefully handles an empty Seq.- In CobolParametersValidator.scala (line 39), we explicitly check
multiCopybookPath.isEmpty
(alongside other params) to enforce that at least one copybook is provided.No downstream code assumes a non-empty placeholder, so changing the default from
Seq("")
toSeq.empty
has no adverse effects.
...-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
Outdated
Show resolved
Hide resolved
01e8709
to
6013341
Compare
Closes #415
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests