Skip to content

Prefer FileIO over the PyArrow FileSystem #2115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 18, 2025

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Jun 18, 2025

Rationale for this change

This is problematic if you try to implement your own FileIO. Then Streams are opened both through the FileIO and the FileSystem directly.

Are these changes tested?

Yes, existing tests.

Are there any user-facing changes?

No, but I think this makes the code esthetically also more pleasing by removing complexity.

Numbers

A while ago I did some inspection of the calls being made to S3, so just to be sure that we don't alter anything, I've collected some stats using a small "benchmark" locally:

def test_fokko(session_catalog: RestCatalog):
    parquet_file = "/Users/fokko.driesprong/Downloads/yellow_tripdata_2024-01.parquet"

    from pyarrow import parquet as pq

    df = pq.read_table(parquet_file)

    try:
        session_catalog.drop_table("default.taxi")
    except Exception:
        pass

    tbl = session_catalog.create_table("default.taxi", schema=df.schema)

    with tbl.update_spec() as tx:
        tx.add_field("tpep_pickup_datetime", "hour")

    tbl.append(df)

    rounds = []
    for _ in range(22):
        start = round(time.time() * 1000)
        assert len(tbl.scan().to_arrow()) == 2964624
        stop = round(time.time() * 1000)
        rounds.append(stop - start)

    print(f"Took: {sum(rounds) / len(rounds)} ms on average")

Main:
Took: 1715.1818181818182 ms on average

> mc admin trace --stats minio

Call                        Count       RPM    Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size     Rate /min    Errors  
s3.GetObject                77 (29.2%)  697.9  701µs     153µs     1.6ms     463µs     838µs     ↑159B ↓712K  ↑108K ↓485M  0       
s3.HeadObject               73 (27.7%)  661.6  192µs     107µs     735µs     177µs     719µs     ↑153B        ↑99K         0       
s3.CompleteMultipartUpload  37 (14.0%)  335.4  8.2ms     1.9ms     17.5ms    8.2ms     17.5ms    ↑397B ↓507B  ↑130K ↓166K  0       
s3.NewMultipartUpload       37 (14.0%)  335.4  6.2ms     2.1ms     14.2ms    6.1ms     14.1ms    ↑153B ↓437B  ↑50K ↓143K   0       
s3.PutObjectPart            37 (14.0%)  335.4  18.4ms    5.1ms     38.8ms    18.4ms    38.8ms    ↑1.4M        ↑469M        0       
s3.PutObject                3 (1.1%)    27.2   5.4ms     3.4ms     8.8ms     5.3ms     8.8ms     ↑2.8K        ↑75K         0  

Branch:
Took: 1723.1818181818182 ms on average

> mc admin trace --stats minio

Call                        Count       RPM    Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size     Rate /min    Errors  
s3.GetObject                77 (29.2%)  696.3  927µs     171µs     4.5ms     610µs     3.5ms     ↑159B ↓712K  ↑108K ↓484M  0       
s3.HeadObject               73 (27.7%)  660.1  222µs     109µs     1.2ms     205µs     1.2ms     ↑153B        ↑99K         0       
s3.CompleteMultipartUpload  37 (14.0%)  334.6  4.4ms     1.2ms     14.2ms    4.4ms     14.2ms    ↑397B ↓507B  ↑130K ↓166K  0       
s3.NewMultipartUpload       37 (14.0%)  334.6  4.3ms     1.2ms     15ms      4.3ms     15ms      ↑153B ↓437B  ↑50K ↓143K   0       
s3.PutObjectPart            37 (14.0%)  334.6  14.5ms    2.6ms     30.7ms    14.5ms    30.7ms    ↑1.4M        ↑468M        0       
s3.PutObject                3 (1.1%)    27.1   6.6ms     2.8ms     10.4ms    6.5ms     10.3ms    ↑2.8K        ↑75K         0  

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kevinjqliu kevinjqliu merged commit f71806e into apache:main Jun 18, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants