Experiments with metaprogramming tools for DataFrames (and maybe other Julia objects that hold variables). Goals are to improve performance and provide more convenient syntax.
In earlier versions of DataFrames, expressions were used in indexing
and in with
and within
functions. This approach had several
deficiencies. Performance was poor. The functions relied on eval
which caused several issues, most notably that results were different
when used in the REPL than when used inside a function.
@with
allows DataFrame columns to be referenced as symbols like
:colX
in expressions. If an expression is wrapped in ^(expr)
,
expr
gets passed through untouched. Here are some examples:
using DataArrays, DataFrames
using DataFramesMeta
df = DataFrame(x = 1:3, y = [2, 1, 2])
x = [2, 1, 0]
@with(df, :y + 1)
@with(df, :x + x) # the two x's are different
x = @with df begin
res = 0.0
for i in 1:length(:x)
res += :x[i] * :y[i]
end
res
end
@with(df, df[:x .> 1, ^(:y)]) # The ^ means leave the :y alone
This works for Associative types, too:
y = 3
d = {:s => 3, :y => 44, :d => 5}
@with(d, :s + :y + y)
@with
is the fundamental macro used by the other metaprogramming
utilities.
Select row and/or columns. This is an alternative to getindex
.
@ix(df, :x .> 1)
@ix(df, :x .> x) # again, the x's are different
@ix(df, :A .> 1, [:B, :A])
Select row subsets.
@where(df, :x .> 1)
@where(df, :x .> x)
Column selections and transformations. Also works with Associative types.
@select(df, :x, :y, :z)
@select(df, x2 = 2 * :x, :y, :z)
Add additional arguments based on keyword arguments. This is available in both function and macro versions with the macro version allowing direct reference to columns using the colon syntax:
transform(df, newCol = cos(df[:x]), anotherCol = df[:x]^2 + 3*df[:x] + 4)
@transform(df, newCol = cos(:x), anotherCol = :x^2 + 3*:x + 4)
@transform
works for associative types, too.
A number of functions for operations on DataFrames have been defined. Here is a table of equivalents for Hadley's dplyr and common LINQ functions.
Julia dplyr LINQ
---------------------------------------------
@where filter Where
@transform mutate Select (?)
@by GroupBy
@groupby group_by
@based_on summarise/do
@orderby arrange OrderBy
@select select Select
Chaining operations is a useful way to manipulate data. There are several ways to do this. This is still in flux in base Julia (JuliaLang/julia#5571). Here is one option from Lazy.jl by Mike Innes:
x_thread = @> begin
df
@transform(y = 10 * :x)
@where(:a .> 2)
@by(:b, meanX = mean(:x), meanY = mean(:y))
@orderby(:meanX)
@select(:meanX, :meanY, var = :b)
end
The following operations are now included:
-
where(g, d -> mean(d[:a]) > 0)
and@where(g, mean(:a) > 0)
-- Filter groups based on the given criteria. Returns a GroupedDataFrame. -
orderby(g, d -> mean(d[:a]))
and@orderby(g, mean(:a))
-- Sort groups based on the given criteria. Returns a GroupedDataFrame. -
DataFrame(g)
-- Convert groups back to a DataFrame with the same group orderings. Should this beconvert(DataFrame, g)
instead? -
DataFrames.based_on(g, d -> DataFrame(z = [mean(d[:a])]))
and@based_on(g, z = mean(:a))
-- Summarize results within groups. Returns a DataFrame. -
transform(g, d -> y = d[:a] - mean(d[:a]))
and@transform(g, y = :a - mean(:a))
-- Transform a DataFrame based on operations within a group. Returns a DataFrame.
You can also index on GroupedDataFrames. g[1]
is the first group,
returned as a SubDataFrame. g[[1,4,5]]
or
g[[true, false, true, false, false]]
return subsets of groups as a
GroupedDataFrame. You can also iterate over GroupedDataFrames.
The most general split-apply-combine approach is based on map
.
map(fun, g)
returns a GroupApplied object with keys and vals. This
can be used with combine
. {This functionality is not all fleshed out
and could use more work.}
@with
works by parsing the expression body for all columns indicated
by symbols (e.g. :colA
). Then, a function is created that wraps the
body and passes the columns as function arguments. This function is
then called. Operations are efficient because:
- A pseudo-anonymous function is defined, so types are stable.
- Columns are passed as references, eliminating DataFrame indexing.
All of the other macros are based on @with
.
A CompositeDataFrame is an AbstractDataFrame built using Composite types. The advantages of this are:
-
Accessing columns
df[:colA]
is more type stable, so code should be faster (without@with
tricks). There is still the function boundary to worry about. -
You can access single columns directly using df.colA.
-
All indexing operations can be done currently.
Some downsides include:
-
As an abuse of the type system, creating a new type for each change to a CompositeDataFrame may waste memory.
-
You cannot change the structure of a CompositeDataFrame once created. You have to treat it (almost) like an immutable object. For example to add a column, you need to do something like:
transform(df, newcol = df.colA + 5)
An advantage of this is that the API becomes more functional. All manipulations of the CompositeDataFrame return a new object. Normally, this doesn't create much more memory.
Everything here is experimental.
Right now, here's my judgement on the advantages of this approach
- The approach is quite expressive and flexible.
- Use of macros improves run-time efficiency.
- The API is relatively consistent.
- I have not run into any show-stoppers like we had with expression-based indexing.
- The code is relatively concise.
The main disadvantages are:
- The syntax is a little noisy with all of the
@something
macro calls. {This is my main gripe.} - As with most macros, there's a certain amount of magic going on.
Right now, @with
works for both AbstractDataFrames and Associative
types. @ix
really only works for AbstractDataFrames. Because
macros are not type specific, it would be nice to make these
metaprogramming tools as general as possible.
Instead of :colA
to refer to a member of the type, another option is
to use *colA
or ^colA
or something else that isn't defined in
Julia (but can be parsed). Then, it'd be easier to mix use of symbols
with column references. :colA
is most consistent in that it has (I
think) the tightest precedence, so you don't have to worry about using
parentheses.
From the user's point of view, it'd be nice to swap the
"dereferencing", so in @with(df, colA + :outsideVariable)
, colA
is
a column, and :outsideVariable
is an external variable. That is
quite difficult to do, though. You have to parse the expression tree
and replace all quoted variables with the "right thing". Here's an
example showing some of the difficulties:
@with df begin
y = 1 + x + :z # z is supposed to be an outside variable; x is a column
fun(x) = x + 1 # don't want to substitute this x
fun(y + x) # don't want to substitute this y
end
For performance, we should check to see if this can play nicely with
@devectorize
for use on columns.