Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New user feedback from an R veteran #295

Open
dgkf opened this issue Jan 18, 2020 · 1 comment
Open

New user feedback from an R veteran #295

dgkf opened this issue Jan 18, 2020 · 1 comment

Comments

@dgkf
Copy link

dgkf commented Jan 18, 2020

Following a discussion with @davidanthoff in the Julia Slack who referred me back to Query.jl in a discussion about some changes to DataFramesMeta.jl and I offered to give some new user feedback.

Background

I've got ~3 years of pretty intensive R developer experience. At my workplace, R is the preferred language, and in using it there I've developed a pretty deep understanding of the tidyverse set of packages and a lot of my software opinions stem from digging through the internals of those packages and all my feedback comes through a heavily dplyr-centric data handling worldview. I have about 3 months of Julia experience where I went on a DataFramesMeta.jl refactoring bender. I'm still learning every day about the Julia ecosystem, and there are certainly huge gaps in my ecosystem and technical knowledge. These are my initial impressions about Query.jl, and specifically the "standalone query commands."

Query.jl Getting Started "Standalong Query Commands"

First glance syntax impressions

  • I like the native |> usage as it makes me feel like I can interweave these macros with other packages or my own lambdas quite easily.
  • The _ feels weird to me. I'm aware of Lazy.jl and it's @_ macro, but in the context of a DataFrame it feels clunky to preface column names with it. On the other hand, it's cool to have direct access to (I assume) the whole DataFrame (or Row?). edit: I realized later on that this isn't coming from Lazy.jl, but is reimplemented - I think to allow for :__ usage. The duplication here gives me some mild code smell vibes.
  • I like the agnostic approach to data, but I'm skeptical that it can be both a rich syntax for operating on tabular data while also remaining agnostic.
  • Having to "collect" the query result back into a DataFrame at the tail end of a pipeline feels a bit weird. It would be nice if it defaulted to being endomorphic when it prints to console, perhaps only trying to coerce the first n elements as to not fail on large data.

Operators

@map
  • The { ... } feels very uncomfortable to me. It somewhat erodes my trust in idiomatic Julia syntax. It took me quite a while to hunt down this expansion in helper_namedtuples_replacement and I think I get what's going on now. As far as I can tell this is done to avoid dispatching on a multi-argument function call. My gut feeling is that there must be away to get around this.
  • The @groupby(...) |> @map(...) example had me confused for a bit since it uses mean(_.b) to somehow calculate a mean across multiple rows, yet I could only access elements rowwise otherwise. I'm still getting my bearings here, but I had to really reevaluate my assumptions to digest this one.
  • I haven't figured out how to do columnwise operations. For example, doing (lag(a) .+ a .+ lead(a)) ./ 3 (a crude running average). The closest I've come is by "grouping" everything and doing grouped operations, though that results in a single row of arrays.
    df |> @groupby(1) |> @map({a = _.b .* _.a})
  • The macro expansion seems to only make an exception for anonymous functions (expr.head == :->), but doesn't accept unary function objects. This is really nice in situations when you have a complicated function that you don't want to write out inside a data processing step or want to reuse.
    @map(df, x -> x)  # returns dataframe
    @map(df, identity)  # returns array of "identity"
    
  • A bit pedantic, but the documentation for each verb says that each verb takes an anonymous function, but the expression is not (at least before macro expansion) a function. e.g. _^2 is not a function. I know it gets expanded out to an anonymous function, but it might be nice to acknowledge that it also accepts these _-style lambdas.
@filter
  • @filter feels quite intuitive. This is a place where the rowwise behavior really shines and the filtering operations read really nicely.
  • Just trying to push the limits here, if I wanted to filter on rows where any Number columns are >40
    df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
    df |> @filter(any(v > 40 for v=_ if typeof(v) <: Real)) |> DataFrame
    This feels a bit clunky, but it seems like it's definitely not the intended use case. It's good to know it's at least possible.
@groupby
  • I tried to learn more about the element_selector using ?@groupby but documentation was minimal.
  • I errored when I passed it a lambda function as the element_selector, but I was able to evaluate an expression with a _. This functionality seems like it introduces a lot of complexity to "groupby" while being functionally equivalent to the more readable@groupby(...) |> @map(...).
@orderby_* and @thenby_*
  • These feel verbose, but also I really like how readable they are once composed
  • My instinct form a dplyr perspective is that these must be able to be expressed more succinctly, but it certainly is clear.
@groupjoin, @join
  • The join operations use of __ feels like new syntax is getting a bit heavy
  • Creating the new columns inside of a join also feels like the function scope is a bit too big. My preference would again be to break out a map call.
  • The outer_selector and inner_selector language is a bit confusing as "outer" and "inner" are typically used to describe the overlap of the join, not the source dataset.
@mapmany
  • I think I'd need more experience querying other data types to weigh on this one, but at first glance it seems useful for the Dict example. I'm sure with heavily nested data structures - perhaps something read out of a .json file or something like that - this would be really useful.
@take and @drop
  • Love me some good functional basics. Glad to see these bases are covered for a lazy collection.
@unique
  • Another staple, especially for filtering unique rows.
@select
  • This one feels very dplyry to me (and that's a good thing - I think it has some fantastic select syntax).
  • I like the use of the ! operator. I forgot Julia natively composes it with functions. edit upon further investigation it looks like these are handled via macro expansion and converted into "not_*" versions of each function. I think I'd prefer to see it lean on Base Julia where possible.
  • The macro handling of specific functions by name ("startswith", "endswith", "occursin") always feels a bit clunky to me, and means that it's less composable with outside functions.
@rename
  • Reads very clearly. I love the Pair operator specifically for the rename function. I can never remember which thing gets renamed to what in the dplyr world (rename(a = b)) and this is just so clear.
  • I don't think there's a way around it, but the symbol notation introduces another way to refer to columns. Between _.a, __.a, and now :a.
@mutate
  • Now this looks familiar! For me the jury is still out on the rowwise operations. Rowwise is good 90% of the time, but those 10% can look really nasty without some syntax to support it. Common tasks like renormalizing data to standard deviations around a mean value are very simple transformations conceptually that get really muddy without columnwise transforms.

Impressions after toying around

  • The lazy data processing is quite cool. I really like that the query result only evaluates a head to print.
    @time 1:1e8 |> @map(_ * 2)   # ~10x faster
    # vs
    @time 1:1e8 |> @map(_ * 2) |> collect
    

After digging through some source code

  • The query expansion macros and helpers source code feel very overwhelming and difficult to contribute to.
  • Looking at query_expression_translation_phase_4 specifically, the individual handling for specific macros by name makes me concerned that the package would be quite difficult to extend. In retrospect, I do recall looking into Query.jl when diving into Julia's macro system and I think the complexity here was bit daunting, leading me to look into DataFramesMeta.jl instead. I found its macros more approachable as a starting point for learning.
  • Even if long at times, I appreciate the clarity of the internal function names. Even if the code is sometimes complex, it was usually interpretable because of how fluently the function names read.

Closing thoughts

Really cool package! The versatility to process data agnosticaly is really ambitious and it seems like you've brought it to a pretty polished state. @select, @mutate and @filter definitely have that "dplyr feel".

Syntactically the _ and __ feel a bit weird, but that might just take some getting used to. I think some sort of syntax for accessing columns of data would be nice, but I can't imagine how that would look, or what would be the performance costs of doing so in a lazily evaluated query engine.

Most importantly, the package feels nice to use. It feels pretty snappy. I'm always comforted knowing that data is being lazily evaluated and minimally computed to print only 10 rows out to console. It's nice knowing I'm not going to kill my session by accidentally trying to compute something on a tens of millions of records.

@dgkf dgkf changed the title New user feedback from a R veteran New user feedback from an R veteran Jan 18, 2020
@rleyvasal
Copy link

I would also like to see a more intuitive syntax on Query.jl, specially for calling the names of columns.

The _. could potentially be included as part of the macro because it is already expected to get the name of the column.

for example:
cars |> @groupby(_.Origin) |> @map(key(_))

would be more readable as:
cars |> @groupby(Origin) |> @map(key(_))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants