Is your feature request related to a problem or challenge?
DataFusion currently uses arrow-json's LineDelimitedReader, which is optimized for NDJSON format. When we encounter data sources that provide JSON arrays (i.e., [{...}, {...}]), we run into parsing issues.
Describe the solution you'd like
Add a format_array option to JsonOptions to support reading JSON array format:
CREATE EXTERNAL TABLE my_table
STORED AS JSON
OPTIONS ('format.format_array' 'true')
LOCATION 'path/to/array.json';
- Backward compatible - existing code continues to work unchanged (default is line-delimited)
- Explicit control - users can specify which format they're working with
Implementation approach
Since arrow-json's ReaderBuilder only supports line-delimited JSON directly, the implementation:
- Parses JSON array
[{...}, {...}] with serde_json
- Converts to NDJSON format for arrow-json's
ReaderBuilder to process
Note: JSON array format does not support range-based file scanning (repartition_file_scans) since the entire array must be read to parse correctly.
Describe alternatives you've considered
- Auto-detection of format: Rejected due to potential errors with large files and added complexity
- Waiting for arrow-json native support: No timeline for this feature upstream
Additional context
Is your feature request related to a problem or challenge?
DataFusion currently uses arrow-json's LineDelimitedReader, which is optimized for NDJSON format. When we encounter data sources that provide JSON arrays (i.e.,
[{...}, {...}]), we run into parsing issues.Describe the solution you'd like
Add a
format_arrayoption toJsonOptionsto support reading JSON array format:Implementation approach
Since arrow-json's
ReaderBuilderonly supports line-delimited JSON directly, the implementation:[{...}, {...}]withserde_jsonReaderBuilderto processNote: JSON array format does not support range-based file scanning (
repartition_file_scans) since the entire array must be read to parse correctly.Describe alternatives you've considered
Additional context