Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset size #145

Open
bkmgit opened this issue Dec 9, 2022 · 8 comments
Open

Dataset size #145

bkmgit opened this issue Dec 9, 2022 · 8 comments

Comments

@bkmgit
Copy link
Contributor

bkmgit commented Dec 9, 2022

It may be helpful to indicate size of datasets that can be used with Red Amber and what operations will be supported.
For a comparison with other dataframes, see Table 3 in Towards Scalable Dataframe Systems and
https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray

@heronshoes
Copy link
Contributor

I think what you have in mind is too early question for RedAmber, however, it is important for users to know how much scale data and what features it has compared to other data frames, thanks!

1. data size

Since RedAmber is an on-memory, single-threaded, non-streaming, eager execution data frame in Ruby (a dynamic language).
It does not look like much fun compared to a data frame that is focused on scalability and execution speed.

Still, I am trying to find out how large data can be handled using https://github.com/h2oai/db-benchmark . Please let me know if you have a better data set to check scalability. (It is written in R and not convenient to use.)

2. possible operations

The references you gave me are helpful. I would like to make a comparison chart.
At this point, I can easily come up with the following:

  • Lazy execution: possible in the future since Arrow has a mechanism (Acero).
  • Parallel execution: Next step after establishing a basic API.
    Grouping is a good match for parallel execution and Ruby's iterators, so I would like to work on it first.

By the way, I think the data frame library that RedAmber should be most compared to is Polars. What do you think?

@bkmgit
Copy link
Contributor Author

bkmgit commented Dec 9, 2022

Polars seems to use threads. A comparison chart would be helpful. Perhaps indicate features wish to add. Possibly compare with other data frame implementations. Arrow has flight https://github.com/apache/arrow/tree/master/ruby/red-arrow-flight and UCX can run on distributed memory, so larger datasets might be possible.

@bkmgit
Copy link
Contributor Author

bkmgit commented Dec 13, 2022

Can add RedAmber to the db-benchmark h2oai/db-benchmark#250 then look for larger datasets.

@heronshoes
Copy link
Contributor

Comparing features between RedAmber, dplyr/tidyr and pandas

This is the comparison of basic feature between RedAmber and other major DataFrame libraries, comparing only for the method 'verbs' ignoring parameters and options.

Remarks:

  1. dataframe represents 2D data containers such as DataFrame, tibble or Table.
  2. vector represents 1D data containers such as Vector, Series or Column.

Comments or suggestions are welcome!

Select columns (variables)

Features RedAmber tidyverse pandas
Select columns as a dataframe pick, drop, [] dplyr::select, dplyr::select_if [], loc[], iloc[], drop, select_dtypes
Select a column as a vector [], v dplyr::pull [], loc[], iloc[]
Move columns to a new position pick, [] relocate [], reindex, loc[], iloc[]

Select rows (records, observations)

Features RedAmber tidyverse pandas
Select rows
that meet logical criteria as a dataframe
slice, remove, [] dplyr::filter [], filter, query, loc[]
Select rows
by position as a dataframe
slice, remove, [] dplyr::slice iloc[], drop
Move rows to a new position slice, [] dplyr::filter, dplyr::slice reindex, loc[], iloc[]

Update columns / create new columns

Features RedAmber tidyverse pandas
Update existing columns assign dplyr::mutate assign, []=
Create new columns assign, assign_left dplyr::mutate apply
Compute new columns, drop others new transmute (dfply:)transmute
Rename columns rename dplyr::rename, dplyr::rename_with, purrr::set_names rename, set_axis
Sort dataframe sort dplyr::arrange sort_values

Reshape dataframe

Features RedAmber tidyverse pandas
Gather columns into rows
(create a longer dataframe)
to_long tidyr::pivot_longer melt
Spread rows into columns
(create a wider dataframe)
to_wide tidyr::pivot_wider pivot
transpose a wide dataframe transpose transpose, t transpose, T

Grouping

Features RedAmber tidyverse pandas
Grouping group, group.summarize dplyr::group_by %>% dplyr::summarise groupby.agg

Combine dataframes or tables

Features RedAmber tidyverse pandas
Combine additional columns merge, bind_cols dplyr::bind_cols concat
Combine additional rows concatenate, concat, bind_rows dplyr::bind_rows concat
Inner join join, inner_join dplyr::inner_join merge
Full join join, full_join, outer_join dplyr::full_join merge
Left join join, left_join dplyr::left_join merge
Right join join, right_join dplyr::right_join merge
Semi join join, semi_join dplyr::semi_join [isin]
Anti join join, anti_join dplyr::anti_join [isin]
Collect rows that appear in x or y union dplyr::union merge
Collect rows that appear in both x and y intersect dplyr::intersect merge
Collect rows that appear in x but not y difference, setdiff dplyr::setdiff merge

@bkmgit
Copy link
Contributor Author

bkmgit commented Dec 21, 2022

This is helpful. Thanks. May also want to compare with Julia where the comparison is part of the documentation.

@bkmgit
Copy link
Contributor Author

bkmgit commented Dec 21, 2022

Can create a pull request with this if of interest.

@heronshoes
Copy link
Contributor

Yes. It would be nice if this is part of the Document in source tree. I can accept requests for modifications.

@bkmgit
Copy link
Contributor Author

bkmgit commented Feb 22, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants