Skip to content

Add cargo-fuzz / OSS-Fuzz coverage for DataFusion's parser and analyzer #21998

@youichi-uda

Description

@youichi-uda

Add cargo-fuzz / OSS-Fuzz coverage for DataFusion's parser and analyzer

Background

DataFusion already has excellent property-based fuzzers for runtime correctness — sort_fuzz, aggregation_fuzzer, join_fuzz, and friends in datafusion/core/tests/fuzz_cases/. They generate random queries against valid data and cross-check results, which is exactly the right tool for catching aggregation/sort/join semantic bugs.

What we don't yet have is byte-level (libFuzzer / cargo-fuzz) coverage. That's a different attack surface: feeding arbitrary bytes into entry points like DFParser::parse_sql exercises the SQL tokenizer/parser, the dialect-extension code, and any error-recovery paths the parser walks while building the AST. These are the kinds of bugs that show up as panics or pathological allocations rather than wrong results.

This is the same delta arrow-rs#5332 is closing on the format-parser side; DataFusion benefits from byte-level fuzzing for its own reasons.

Concrete proposal

  1. Add per-crate fuzz/ directories with cargo-fuzz harnesses, starting with the smallest blast radius and growing outward as we land. Two are already prototyped on my fork at fuzz/initial-harnesses:

    • datafusion/sql/fuzz/parse_sqlDFParser::parse_sql over UTF-8-validated &[u8]. 22 s sanity run reaches 3,520 edges / 12,788 features / 2,342 corpus entries from an empty start, RSS peak 117 MiB, no crashes.
    • datafusion/sql/fuzz/parse_exprDFParser::parse_sql_into_expr (different production set). 22 s sanity reaches 3,980 edges / 14,500 features / 2,492 corpus entries at 9,090 runs/s, no crashes.

    Then in subsequent PRs, depending on what shakes out: a LogicalPlanBuilder harness in datafusion/expr, a SessionContext::sql end-to-end harness, and possibly per-format datasource readers (csv/json/parquet/arrow/avro), though those overlap with the Fuzz tests for Arrow/Parquet arrow-rs#5332 work and we'd want to dedupe.

  2. Wire ClusterFuzzLite into GitHub Actions so PR-time fuzz + nightly batch + corpus pruning run automatically. Catches regressions before they merge instead of relying on me to remember to run cargo-fuzz locally.

  3. Open the OSS-Fuzz integration PR (projects/datafusion/) once the upstream fuzz/ lands and the harnesses are stable. Mirrors the layout I'm proposing for arrow-rs in google/oss-fuzz. Coverage target ≥80% on harness-touched modules.

Why this is worth doing

  • DataFusion is widely used as a SQL engine inside other projects (InfluxDB, Ballista, Coralogix, DataFusion Comet, ...). A panic or pathological allocation in DFParser::parse_sql can DoS any service that exposes a SQL endpoint to untrusted input.
  • The SQL parser walks an attacker-controlled token stream through DataFusion's extensions on top of sqlparser-rs. The sqlparser crate itself does not appear to be fuzzed in OSS-Fuzz, so DataFusion sits behind two unfuzzed parser layers right now.
  • Coverage cost is modest: cargo-fuzz harnesses are typically tens of lines each, and the OSS-Fuzz/ClusterFuzzLite plumbing is a one-time setup.

Status

I'm prototyping this in my fork at masumi-ryugo/datafusion@fuzz/initial-harnesses and have a working parse_sql harness building under nightly cargo-fuzz. Happy to send the upstream PR once the layout question is settled.

Two questions for the maintainers (cc @alamb @andygrove @comphead @ozankabak):

  1. Layout preference: per-crate fuzz/ dirs (gitoxide-style — independent buildable units, easy to add per-crate seed corpora) vs. a single top-level fuzz/ workspace? I'd lean per-crate but happy to do whatever you prefer.

  2. primary_contact / auto_ccs for OSS-Fuzz project.yaml: OSS-Fuzz wants Google-account emails on file for crash notifications. Who from the PMC would want to be on that list? I can default to a single contact if a PMC alias isn't preferred.

xref apache/arrow-rs#5332 (sibling work for arrow-rs side, same author, same OSS-Fuzz integration pattern).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions