Skip to content

BH Compliant Text Reader

Paul Rogers edited this page Jan 15, 2018 · 1 revision

Drill provides a variety of text readers: CSV, TSV, PSV and so on. As it turns out, they are all variations on the "compliant text reader". (It is compliant with RFC 4180.) To demonstrate the new scan framework, and the result set loader, this project upgraded the compliant reader.

Text Format Plugin

The TextFormatPlugin extends the Easy format plugin to define the compliant text reader. Each specific reader (CSV, etc.) is defined by a specific set of plugin options for the compliant plugin.

Key changes included:

  • Use the new EasyFormatConfig to configure the plugin.
  • Implement the scanBatchCreator() method to create the required scan framework.
  • Remove methods associated with the prior text record reader class.

Scan Batch Creator

The scan framework is assembled using the new TextScanBatchCreator nested class. Primary tasks:

  • Create a columns aware file scan framework.
  • Determine if file headers are to be provided by the reader.
  • Specify that the null type is VarChar. (Text readers can never produce nullable INT columns, so VarChar is a better guess. Missing values will be empty, consistent with the fact that text files don't support NULLs.)
  • For backward compatibility, specify to use the Drill 1.11 position for partitions. (This line allows existing QA tests to pass. Once the check is committed, this like should be removed and QA tests rebased accordingly.)

Reader Creator

As has been noted, the new scan framework creates readers as needed, rather than up front as in the legacy version. The text format plugin must provide a class that creates a reader on request. For simplicity, the text format plugin itself implements the FileReaderCreator interface and the makeBatchReader() to create the actual batch reader.

Compliant Text Batch Reader

The CompliantTextBatchReader class replaces the prior TextRecordReader class to do the work of reading a batch using the result set loader.

The changes to this class were pretty straightforward:

  • Rip out the code that implemented direct memory access to write to vectors.
  • Replace the code with calls to the result set loader.
  • Change the code to read records until the result set loader reports that it is full (rather than reading a fixed number of records.)
  • Clean up some error handling.

The FieldVarCharOutput class handles the case in which the file provides headers. It was modified to write use the result set loader to write each column.

The RepeatedVarCharOutput class handles the case of using the columns[] array, by writing to a VarChar array using the result set loader.

Status

Frankly, the revised implementation seems to work fine. A prior version of the code (before adding the JSON reader) passed all the Drill unit tests and the MapR pre-commit tests. This is one part of the project that can be considered done.

Clone this wiki locally