Prefer `FileIO` over the PyArrow `FileSystem` #2115

Fokko · 2025-06-18T10:15:42Z

Rationale for this change

This is problematic if you try to implement your own FileIO. Then Streams are opened both through the FileIO and the FileSystem directly.

Are these changes tested?

Yes, existing tests.

Are there any user-facing changes?

No, but I think this makes the code esthetically also more pleasing by removing complexity.

Numbers

A while ago I did some inspection of the calls being made to S3, so just to be sure that we don't alter anything, I've collected some stats using a small "benchmark" locally:

def test_fokko(session_catalog: RestCatalog):
    parquet_file = "/Users/fokko.driesprong/Downloads/yellow_tripdata_2024-01.parquet"

    from pyarrow import parquet as pq

    df = pq.read_table(parquet_file)

    try:
        session_catalog.drop_table("default.taxi")
    except Exception:
        pass

    tbl = session_catalog.create_table("default.taxi", schema=df.schema)

    with tbl.update_spec() as tx:
        tx.add_field("tpep_pickup_datetime", "hour")

    tbl.append(df)

    rounds = []
    for _ in range(22):
        start = round(time.time() * 1000)
        assert len(tbl.scan().to_arrow()) == 2964624
        stop = round(time.time() * 1000)
        rounds.append(stop - start)

    print(f"Took: {sum(rounds) / len(rounds)} ms on average")

Main:
Took: 1715.1818181818182 ms on average

> mc admin trace --stats minio

Call                        Count       RPM    Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size     Rate /min    Errors  
s3.GetObject                77 (29.2%)  697.9  701µs     153µs     1.6ms     463µs     838µs     ↑159B ↓712K  ↑108K ↓485M  0       
s3.HeadObject               73 (27.7%)  661.6  192µs     107µs     735µs     177µs     719µs     ↑153B        ↑99K         0       
s3.CompleteMultipartUpload  37 (14.0%)  335.4  8.2ms     1.9ms     17.5ms    8.2ms     17.5ms    ↑397B ↓507B  ↑130K ↓166K  0       
s3.NewMultipartUpload       37 (14.0%)  335.4  6.2ms     2.1ms     14.2ms    6.1ms     14.1ms    ↑153B ↓437B  ↑50K ↓143K   0       
s3.PutObjectPart            37 (14.0%)  335.4  18.4ms    5.1ms     38.8ms    18.4ms    38.8ms    ↑1.4M        ↑469M        0       
s3.PutObject                3 (1.1%)    27.2   5.4ms     3.4ms     8.8ms     5.3ms     8.8ms     ↑2.8K        ↑75K         0

Branch:
Took: 1723.1818181818182 ms on average

> mc admin trace --stats minio

Call                        Count       RPM    Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size     Rate /min    Errors  
s3.GetObject                77 (29.2%)  696.3  927µs     171µs     4.5ms     610µs     3.5ms     ↑159B ↓712K  ↑108K ↓484M  0       
s3.HeadObject               73 (27.7%)  660.1  222µs     109µs     1.2ms     205µs     1.2ms     ↑153B        ↑99K         0       
s3.CompleteMultipartUpload  37 (14.0%)  334.6  4.4ms     1.2ms     14.2ms    4.4ms     14.2ms    ↑397B ↓507B  ↑130K ↓166K  0       
s3.NewMultipartUpload       37 (14.0%)  334.6  4.3ms     1.2ms     15ms      4.3ms     15ms      ↑153B ↓437B  ↑50K ↓143K   0       
s3.PutObjectPart            37 (14.0%)  334.6  14.5ms    2.6ms     30.7ms    14.5ms    30.7ms    ↑1.4M        ↑468M        0       
s3.PutObject                3 (1.1%)    27.1   6.6ms     2.8ms     10.4ms    6.5ms     10.3ms    ↑2.8K        ↑75K         0

kevinjqliu

LGTM!

Prefer FileIO over the PyArrow FileSystem

3fddfa6

kevinjqliu approved these changes Jun 18, 2025

View reviewed changes

kevinjqliu merged commit f71806e into apache:main Jun 18, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prefer `FileIO` over the PyArrow `FileSystem` #2115

Prefer `FileIO` over the PyArrow `FileSystem` #2115

Fokko commented Jun 18, 2025 •

edited

Loading

Uh oh!

kevinjqliu left a comment

Uh oh!

Uh oh!

Uh oh!

Prefer FileIO over the PyArrow FileSystem #2115

Prefer FileIO over the PyArrow FileSystem #2115

Conversation

Fokko commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Numbers

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Prefer `FileIO` over the PyArrow `FileSystem` #2115

Prefer `FileIO` over the PyArrow `FileSystem` #2115

Fokko commented Jun 18, 2025 •

edited

Loading