You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following a discussion with @davidanthoff in the Julia Slack who referred me back to Query.jl in a discussion about some changes to DataFramesMeta.jl and I offered to give some new user feedback.
Background
I've got ~3 years of pretty intensive R developer experience. At my workplace, R is the preferred language, and in using it there I've developed a pretty deep understanding of the tidyverse set of packages and a lot of my software opinions stem from digging through the internals of those packages and all my feedback comes through a heavily dplyr-centric data handling worldview. I have about 3 months of Julia experience where I went on a DataFramesMeta.jl refactoring bender. I'm still learning every day about the Julia ecosystem, and there are certainly huge gaps in my ecosystem and technical knowledge. These are my initial impressions about Query.jl, and specifically the "standalone query commands."
Query.jl Getting Started "Standalong Query Commands"
First glance syntax impressions
I like the native |> usage as it makes me feel like I can interweave these macros with other packages or my own lambdas quite easily.
The _ feels weird to me. I'm aware of Lazy.jl and it's @_ macro, but in the context of a DataFrame it feels clunky to preface column names with it. On the other hand, it's cool to have direct access to (I assume) the whole DataFrame (or Row?). edit: I realized later on that this isn't coming from Lazy.jl, but is reimplemented - I think to allow for :__ usage. The duplication here gives me some mild code smell vibes.
I like the agnostic approach to data, but I'm skeptical that it can be both a rich syntax for operating on tabular data while also remaining agnostic.
Having to "collect" the query result back into a DataFrame at the tail end of a pipeline feels a bit weird. It would be nice if it defaulted to being endomorphic when it prints to console, perhaps only trying to coerce the first n elements as to not fail on large data.
Operators
@map
The { ... } feels very uncomfortable to me. It somewhat erodes my trust in idiomatic Julia syntax. It took me quite a while to hunt down this expansion in helper_namedtuples_replacement and I think I get what's going on now. As far as I can tell this is done to avoid dispatching on a multi-argument function call. My gut feeling is that there must be away to get around this.
The @groupby(...) |> @map(...) example had me confused for a bit since it uses mean(_.b) to somehow calculate a mean across multiple rows, yet I could only access elements rowwise otherwise. I'm still getting my bearings here, but I had to really reevaluate my assumptions to digest this one.
I haven't figured out how to do columnwise operations. For example, doing (lag(a) .+ a .+ lead(a)) ./ 3 (a crude running average). The closest I've come is by "grouping" everything and doing grouped operations, though that results in a single row of arrays.
df |>@groupby(1) |>@map({a = _.b .* _.a})
The macro expansion seems to only make an exception for anonymous functions (expr.head == :->), but doesn't accept unary function objects. This is really nice in situations when you have a complicated function that you don't want to write out inside a data processing step or want to reuse.
@map(df, x -> x) # returns dataframe
@map(df, identity) # returns array of "identity"
A bit pedantic, but the documentation for each verb says that each verb takes an anonymous function, but the expression is not (at least before macro expansion) a function. e.g. _^2 is not a function. I know it gets expanded out to an anonymous function, but it might be nice to acknowledge that it also accepts these _-style lambdas.
@filter
@filter feels quite intuitive. This is a place where the rowwise behavior really shines and the filtering operations read really nicely.
Just trying to push the limits here, if I wanted to filter on rows where any Number columns are >40
This feels a bit clunky, but it seems like it's definitely not the intended use case. It's good to know it's at least possible.
@groupby
I tried to learn more about the element_selector using ?@groupby but documentation was minimal.
I errored when I passed it a lambda function as the element_selector, but I was able to evaluate an expression with a _. This functionality seems like it introduces a lot of complexity to "groupby" while being functionally equivalent to the more readable@groupby(...) |> @map(...).
@orderby_* and @thenby_*
These feel verbose, but also I really like how readable they are once composed
My instinct form a dplyr perspective is that these must be able to be expressed more succinctly, but it certainly is clear.
@groupjoin, @join
The join operations use of __ feels like new syntax is getting a bit heavy
Creating the new columns inside of a join also feels like the function scope is a bit too big. My preference would again be to break out a map call.
The outer_selector and inner_selector language is a bit confusing as "outer" and "inner" are typically used to describe the overlap of the join, not the source dataset.
@mapmany
I think I'd need more experience querying other data types to weigh on this one, but at first glance it seems useful for the Dict example. I'm sure with heavily nested data structures - perhaps something read out of a .json file or something like that - this would be really useful.
@take and @drop
Love me some good functional basics. Glad to see these bases are covered for a lazy collection.
@unique
Another staple, especially for filtering unique rows.
@select
This one feels very dplyry to me (and that's a good thing - I think it has some fantastic select syntax).
I like the use of the ! operator. I forgot Julia natively composes it with functions. edit upon further investigation it looks like these are handled via macro expansion and converted into "not_*" versions of each function. I think I'd prefer to see it lean on Base Julia where possible.
The macro handling of specific functions by name ("startswith", "endswith", "occursin") always feels a bit clunky to me, and means that it's less composable with outside functions.
@rename
Reads very clearly. I love the Pair operator specifically for the rename function. I can never remember which thing gets renamed to what in the dplyr world (rename(a = b)) and this is just so clear.
I don't think there's a way around it, but the symbol notation introduces another way to refer to columns. Between _.a, __.a, and now :a.
@mutate
Now this looks familiar! For me the jury is still out on the rowwise operations. Rowwise is good 90% of the time, but those 10% can look really nasty without some syntax to support it. Common tasks like renormalizing data to standard deviations around a mean value are very simple transformations conceptually that get really muddy without columnwise transforms.
Impressions after toying around
The lazy data processing is quite cool. I really like that the query result only evaluates a head to print.
The query expansion macros and helpers source code feel very overwhelming and difficult to contribute to.
Looking at query_expression_translation_phase_4 specifically, the individual handling for specific macros by name makes me concerned that the package would be quite difficult to extend. In retrospect, I do recall looking into Query.jl when diving into Julia's macro system and I think the complexity here was bit daunting, leading me to look into DataFramesMeta.jl instead. I found its macros more approachable as a starting point for learning.
Even if long at times, I appreciate the clarity of the internal function names. Even if the code is sometimes complex, it was usually interpretable because of how fluently the function names read.
Closing thoughts
Really cool package! The versatility to process data agnosticaly is really ambitious and it seems like you've brought it to a pretty polished state. @select, @mutate and @filter definitely have that "dplyr feel".
Syntactically the _ and __ feel a bit weird, but that might just take some getting used to. I think some sort of syntax for accessing columns of data would be nice, but I can't imagine how that would look, or what would be the performance costs of doing so in a lazily evaluated query engine.
Most importantly, the package feels nice to use. It feels pretty snappy. I'm always comforted knowing that data is being lazily evaluated and minimally computed to print only 10 rows out to console. It's nice knowing I'm not going to kill my session by accidentally trying to compute something on a tens of millions of records.
The text was updated successfully, but these errors were encountered:
dgkf
changed the title
New user feedback from a R veteran
New user feedback from an R veteran
Jan 18, 2020
Following a discussion with @davidanthoff in the Julia Slack who referred me back to
Query.jl
in a discussion about some changes toDataFramesMeta.jl
and I offered to give some new user feedback.Query.jl Getting Started "Standalong Query Commands"
First glance syntax impressions
|>
usage as it makes me feel like I can interweave these macros with other packages or my own lambdas quite easily._
feels weird to me. I'm aware ofLazy.jl
and it's@_
macro, but in the context of a DataFrame it feels clunky to preface column names with it. On the other hand, it's cool to have direct access to (I assume) the wholeDataFrame
(orRow
?). edit: I realized later on that this isn't coming fromLazy.jl
, but is reimplemented - I think to allow for:__
usage. The duplication here gives me some mild code smell vibes.DataFrame
at the tail end of a pipeline feels a bit weird. It would be nice if it defaulted to being endomorphic when it prints to console, perhaps only trying to coerce the first n elements as to not fail on large data.Operators
@map
{ ... }
feels very uncomfortable to me. It somewhat erodes my trust in idiomatic Julia syntax. It took me quite a while to hunt down this expansion inhelper_namedtuples_replacement
and I think I get what's going on now. As far as I can tell this is done to avoid dispatching on a multi-argument function call. My gut feeling is that there must be away to get around this.@groupby(...) |> @map(...)
example had me confused for a bit since it usesmean(_.b)
to somehow calculate a mean across multiple rows, yet I could only access elements rowwise otherwise. I'm still getting my bearings here, but I had to really reevaluate my assumptions to digest this one.(lag(a) .+ a .+ lead(a)) ./ 3
(a crude running average). The closest I've come is by "grouping" everything and doing grouped operations, though that results in a single row of arrays.expr.head == :->
), but doesn't accept unary function objects. This is really nice in situations when you have a complicated function that you don't want to write out inside a data processing step or want to reuse._^2
is not a function. I know it gets expanded out to an anonymous function, but it might be nice to acknowledge that it also accepts these_
-style lambdas.@filter
@filter
feels quite intuitive. This is a place where the rowwise behavior really shines and the filtering operations read really nicely.Number
columns are >40@groupby
element_selector
using?@groupby
but documentation was minimal.element_selector
, but I was able to evaluate an expression with a_
. This functionality seems like it introduces a lot of complexity to "groupby" while being functionally equivalent to the more readable@groupby(...) |> @map(...)
.@orderby_*
and@thenby_*
dplyr
perspective is that these must be able to be expressed more succinctly, but it certainly is clear.@groupjoin
,@join
__
feels like new syntax is getting a bit heavymap
call.outer_selector
andinner_selector
language is a bit confusing as "outer" and "inner" are typically used to describe the overlap of the join, not the source dataset.@mapmany
Dict
example. I'm sure with heavily nested data structures - perhaps something read out of a.json
file or something like that - this would be really useful.@take
and@drop
@unique
@select
dplyr
y to me (and that's a good thing - I think it has some fantasticselect
syntax).!
operator. I forgot Julia natively composes it with functions. edit upon further investigation it looks like these are handled via macro expansion and converted into "not_*" versions of each function. I think I'd prefer to see it lean on Base Julia where possible.@rename
Pair
operator specifically for the rename function. I can never remember which thing gets renamed to what in thedplyr
world (rename(a = b)
) and this is just so clear._.a
,__.a
, and now:a
.@mutate
Impressions after toying around
After digging through some source code
query_expression_translation_phase_4
specifically, the individual handling for specific macros by name makes me concerned that the package would be quite difficult to extend. In retrospect, I do recall looking intoQuery.jl
when diving into Julia's macro system and I think the complexity here was bit daunting, leading me to look intoDataFramesMeta.jl
instead. I found its macros more approachable as a starting point for learning.Closing thoughts
Really cool package! The versatility to process data agnosticaly is really ambitious and it seems like you've brought it to a pretty polished state.
@select
,@mutate
and@filter
definitely have that "dplyr feel".Syntactically the
_
and__
feel a bit weird, but that might just take some getting used to. I think some sort of syntax for accessing columns of data would be nice, but I can't imagine how that would look, or what would be the performance costs of doing so in a lazily evaluated query engine.Most importantly, the package feels nice to use. It feels pretty snappy. I'm always comforted knowing that data is being lazily evaluated and minimally computed to print only 10 rows out to console. It's nice knowing I'm not going to kill my session by accidentally trying to compute something on a tens of millions of records.
The text was updated successfully, but these errors were encountered: