Consider renaming "Arrow" case? #229

jorgecarleitao · 2021-07-01T06:07:24Z

Arrow is a format and specification for in-memory format. Both Polars and cuDF use Arrow as their in-memory format.

The current case "Arrow" is in fact mostly dplyr, since the large scale ops such as group by are done by that R package.

Would it make sense to rename the "Arrow" case to "dplyr-arrow" or something?

The text was updated successfully, but these errors were encountered:

jangorecki · 2021-07-01T07:41:39Z

Thank you for suggestion. It is arrow as it is now. It will automatically be using more arrow over time, without any adjustments (assuming that arrow API will no be changing). People were asking for adding arrow, therefore it has been added as is as of now.

jorgecarleitao · 2021-07-01T09:12:56Z

Thank you for your quick reply.

I do not really understand the argument: arrow is not a query engine, and it does not implement a "group by" or a "join": a group by is an operation, not a format.

In my opinion it is misleading to write "Arrow" , as it gives the sense that the arrow format is very slow, when arrow is not even something that "runs"; it is like saying that "parquet" or "ORC" are slow.

Making the case use more or less arrow automatically does not make it be more or less arrow; query engines and formats are fundamentally different notions.

Disambiguating which specific query engine implementation is being benchmarked helps users deciding what technology to adopt, and portraying Arrow as a query engine only adds confusion to this decision-making process.

I would re-consider opening this issue, at the very least to offer some time for other members of the community to weight in.

jangorecki · 2021-07-01T17:24:21Z

Arrow will implement groupby and join in future. Also note there is explanation at the bottom of the report page mentioning that. Arrow is query engine (or will be), it uses feather format.

lorentzenchr · 2021-07-06T10:44:46Z

If it uses in fact dplyr, why not call it "dplyr" then? Maybe "dplyr on arrow table" and the other one "dplyr on tibble"?

jangorecki · 2021-07-06T10:58:15Z

Because it is arrow as it is now. And automatically will use more of arrow engine over time, once it will be ready. Fallback to dplyr is built in to arrow package, I did not set any of these. Note that fallback doesn't yet work for join.

jangorecki · 2021-07-06T11:00:33Z

Reason for adding arrow was not to show that it is as slow as dplyr but to address requests from community. People wanted to know where is arrow now. Therefore it has been added as is now (but automatically will use arrow engine more as fallback happens inside arrow).

jorgecarleitao · 2021-07-06T11:08:43Z

@jangorecki , can you describe what "arrow engine" is? What is its source code, what are their capabilities?

lorentzenchr · 2021-07-06T11:25:01Z

I'm a user not involved in neither dplyr nor arrow and the current state of names, for me, is confusing. Maybe we should trust @jorgecarleitao with this, as he is a PMC of the Apache Arrow project.

As a use, I also enjoy those benchmarks very much!

jorisvandenbossche · 2021-07-06T11:43:02Z

People wanted to know where is arrow now. Therefore it has been added as is now

The main issue is that there is no such thing as "arrow" or "arrow engine" when it comes to benchmarking libraries. There is a single Arrow specification, but then there are many implementations of this specification, with varying degree of scope and performance.

The dplyr interface provided by the R arrow package is one such implementation, but eg Polars or Datafusion are just as much "arrow" while using an entirely different implementation that doesn't share any code with the R arrow package.

So I think renaming the current dplyr benchmark from "arrow" to the suggested "dplyr-arrow" (or "arrow-dplyr") makes sense.

thatcort · 2022-05-13T01:11:23Z

I think it's great that Arrow is included, but please include other implementations. For example, it would be great to compare Datafusion to dplyr on Arrow, and similarly compare Ballista to Spark.

wjones127 · 2022-06-21T17:11:24Z

The Arrow C++ engine (that supports R arrow's dplyr functionality) has now been named Acero to differentiate it from other Arrow-based engines. We can rename the benchmark to that.

jangorecki · 2022-06-22T04:48:06Z

@wjones127 thank you for that info. Agree about renaming. Unfortunately I am not maintainer anymore and you have to contact h2o support about any changes in this project.

jangorecki closed this as completed Jul 1, 2021

jangorecki reopened this Jul 1, 2021

ghuls mentioned this issue Nov 29, 2021

pyarrow supports groupby operations now. #237

Open

eitsupi mentioned this issue Nov 22, 2023

Update name for Arrow R package duckdblabs/db-benchmark#66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider renaming "Arrow" case? #229

Consider renaming "Arrow" case? #229

jorgecarleitao commented Jul 1, 2021

jangorecki commented Jul 1, 2021

jorgecarleitao commented Jul 1, 2021

jangorecki commented Jul 1, 2021 •

edited

Loading

lorentzenchr commented Jul 6, 2021

jangorecki commented Jul 6, 2021 •

edited

Loading

jangorecki commented Jul 6, 2021 •

edited

Loading

jorgecarleitao commented Jul 6, 2021

lorentzenchr commented Jul 6, 2021

jorisvandenbossche commented Jul 6, 2021

thatcort commented May 13, 2022

wjones127 commented Jun 21, 2022

jangorecki commented Jun 22, 2022 •

edited

Loading

Consider renaming "Arrow" case? #229

Consider renaming "Arrow" case? #229

Comments

jorgecarleitao commented Jul 1, 2021

jangorecki commented Jul 1, 2021

jorgecarleitao commented Jul 1, 2021

jangorecki commented Jul 1, 2021 • edited Loading

lorentzenchr commented Jul 6, 2021

jangorecki commented Jul 6, 2021 • edited Loading

jangorecki commented Jul 6, 2021 • edited Loading

jorgecarleitao commented Jul 6, 2021

lorentzenchr commented Jul 6, 2021

jorisvandenbossche commented Jul 6, 2021

thatcort commented May 13, 2022

wjones127 commented Jun 21, 2022

jangorecki commented Jun 22, 2022 • edited Loading

jangorecki commented Jul 1, 2021 •

edited

Loading

jangorecki commented Jul 6, 2021 •

edited

Loading

jangorecki commented Jul 6, 2021 •

edited

Loading

jangorecki commented Jun 22, 2022 •

edited

Loading