Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge two DataFrames only to missing values #2243

Open
chedieck opened this issue May 11, 2020 · 36 comments
Open

Merge two DataFrames only to missing values #2243

chedieck opened this issue May 11, 2020 · 36 comments
Milestone

Comments

@chedieck
Copy link

Suppose I have an dataframe called df_missing:

6×3 DataFrame
│ Row │ id    │ val     │ other |
│     │ Int64 │ Int64?  │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1   │ 5     │ 1       │ 4     │
│ 2   │ 2     │ missing │ 4     │
│ 3   │ 1     │ 3       │ 3     │
│ 4   │ 4     │ 8       │ 4     │
│ 5   │ 6     │ 2       │ 4     │
│ 6   │ 8     │ missing │ 3     │

and I also have another dataset, called df_completion:

2x2 DataFrame
│ Row │ id    │ val   |
│     │ Int64 │ Int64 │ 
├─────┼───────┼───────┤
│ 1   │ 2     │ 5     │ 
│ 2   │ 8     │ 13    │ 

So my suggestion is: couldn't there be a more straightforward way to replace the missing values using the common id of the two dataframes, thus creating df_full?

6×3 DataFrame
│ Row │ id    │ val     │ other |
│     │ Int64 │ Int64?  │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1   │ 5     │ 1       │ 4     │
│ 2   │ 2     │ 5       │ 4     │
│ 3   │ 1     │ 3       │ 3     │
│ 4   │ 4     │ 8       │ 4     │
│ 5   │ 6     │ 2       │ 4     │
│ 6   │ 8     │ 13      │ 3     │

The two current ways that seem to be the best ones are:
julia> df_missing[in(df_completion.id).(df_missing.id), :val] = df_completion.val
or

julia> df_full = leftjoin(df_missing, df_completion, on = :id, makeunique = true);

julia> df_full.val = map(df_full.val, df_full.val_1) do a, b
       ismissing(a) ? b : a
       end;

julia> select!(df_full, Not(:val_1))

besides, of course, creating a loop. Maybe there could be a method for merging two DataFrames in this way? I think it's a very common problem.

@bkamins
Copy link
Member

bkamins commented May 11, 2020

I do not think you can do any better than joining and then combining the results. In particular in general (if if were to become a standard library function) you would have to define what should happen if either of the data frames has a duplicate entry in :id, what should happen if the right data frame has some entry, and the left data frame also contains it but it is not missing etc.

I general I believe that there are so many possible options that it is hard to define an API for such a function, simply an appropriate join should be used. Note then that the last operation you mention can be written as df_full.val .= coalesce.(df_full.val, df_full.val_1) which will be fast and update df_full.val in place.

@pdeffebach
Copy link
Contributor

Stata has a good API for this via update

update: update missing values of same-named variables in master with values
from using

I think Bogumil's idea for this is elegant, and is good for 1.0. But I think following Stata on this could solve the coordination problem for what a good option is.

My proposal would be

leftjoin(df1, df2, on = :id, update = :val)

This would perform the coalesce automatically. update could be a column identifier or a vector thereof. Such columns would be excempt from makeunique = true.

A quick google search shows that data.table has something similar to this but dplyr does not .

@bkamins
Copy link
Member

bkamins commented May 11, 2020

What does Stata do when:

  • duplicate entries in df2 are present? Throw an error?
  • duplicate entries in df1 key are present and for one of them the value is non-missing, while for the other the value is missing? Is this also an error? (or maybe the on column must be unique in general also in df1 in this case)?

@pdeffebach
Copy link
Contributor

  • duplicate entries in df2 are present? Throw an error?

Interestingly, this will add both observations

. use df_completion_dups, clear

. list

     +----------+
     | id   val |
     |----------|
  1. |  2   220 |
  2. |  2   240 |
     +----------+

. use df_missing_normal.dta, clear

. qui merge 1:m id using df_completion_dups, update

. list

     +--------------------------------+
     | id   val                _merge |
     |--------------------------------|
  1. |  1   100       master only (1) |
  2. |  2   220   missing updated (4) |
  3. |  2   240   missing updated (4) |
     +--------------------------------+
  • duplicate entries in df1 key are present and for one of them the value is non-missing, while for the other the value is missing? Is this also an error? (or maybe the on column must be unique in general also in df1 in this case)?

You can specify whether or not the matches need to be unique. merge 1:1 will throw an error if there are any duplicates, merge m:1 (many to 1) will throw an error if the RHS (called using) has duplicates etc.

The default behavior in this case is also to add new rows. See below.

. use df_completion_dups, clear

. list

     +----------+
     | id   val |
     |----------|
  1. |  2   220 |
  2. |  2   240 |
     +----------+

. use df_missing_normal.dta, clear

. qui merge 1:m id using df_completion_dups, update

. list

     +--------------------------------+
     | id   val                _merge |
     |--------------------------------|
  1. |  1   100       master only (1) |
  2. |  2   220   missing updated (4) |
  3. |  2   240   missing updated (4) |
     +--------------------------------+

I will have to think more about the mapping of Stata's 1:m syntax to our joins and whether or not the behavior I propose would be intuitive.

@bkamins
Copy link
Member

bkamins commented May 11, 2020

OK - let us try to find out a consistent design requirements (like we did select/transform/combine combo) and then we can implement it.

(and I will be commenting from a "defensive position" 😄)

In particular I think that this operation is most intuitive if also allowed to be done "in place" (but we need API for this).

@pdeffebach
Copy link
Contributor

Thanks. I will try and work on a proposal that is similar to stata but matches our current infrastructure. This is yet another opportunity to combine the best of stata and dplyr.

@Nosferican
Copy link
Contributor

Nosferican commented May 14, 2020

I guess I am not that late to the party. Here would be the approach from SQL

UPDATE A
SET a = B.x, b = B.y
FROM B
WHERE A.id = B.id

@bkamins
Copy link
Member

bkamins commented May 14, 2020

Some kind of functionality like this, in general an "update" would be really nice to add. We just need to find a general design that is flexible and future proof (i.e. to have a swiss army knife function like e.g. combine not just function that does only one specific thing).

As proposed above maybe we can add to leftjoin an update kwarg which takes a:

  • pair (:left => :right)
  • single value as a shorthand for :val => :val
  • vector of the above (if more updates are required in one operation)

@kescobo
Copy link
Contributor

kescobo commented May 26, 2020

As proposed above maybe we can add to leftjoin an update kwarg which takes a:
• pair ( :left => :right )
• single value as a shorthand for :val => :val
• vector of the above (if more updates are required in one operation)

This works as long as you're not supporting various options on the update itself. Adding kwargs that only modify one other kwarg seems like a bad idea and cries out for splitting.

This is not a feature I've ever needed to use, so I don't have strong opinions on it, but the fact that it's found in stata and SQL makes a compelling case to add it as a separate function that could have its own options.

(i.e. to have a swiss army knife function like e.g. combine not just function that does only one specific thing).

Not sure if you've ever owned a Swiss army knife, but the downside of being able to do everything is that you don't do any of them super well. Not sure how much to torture the metaphor, but as a kid I always wanted the big fat ones with tons of different tools, because they seemed so cool, but my (Swiss) grandfather's with 2 blades and a bottle opener was almost always a better choice for actually doing anything with.

@bkamins
Copy link
Member

bkamins commented May 26, 2020

As discussed on Slack - we have a tension here. We can either leave joins "as is" and add a new method doing updates of existing columns and working in place (something like update!) or we can extend existing functions with these functionalities. I am not really sure what would be the best approach.

Adding a new function is probably easier to understand. On the other hand it would do almost the same as leftjoin "on steroids" so I am not really sure what would be best.

@pdeffebach
Copy link
Contributor

I think another function might be best. Since I was also going to bring up that not only does Stata have an update option, it also has a replace option.

So adding that would risk bloating the function even further. I would propose update and overwrite as added functions.

@bkamins
Copy link
Member

bkamins commented May 26, 2020

I would propose update and overwrite as added functions.

What would be the exact difference between them? I was thinking of an API of the kind:

update!(df1, df2, on=..., args) and then args would be things that would specify the operations you want (so you could mix different operations you want: just add a column, replace, fill missings, whatever other ideas we have)

@pdeffebach
Copy link
Contributor

One only replaces if missing, the other overwrites the whole columns. Maybe they could be the same function, though.

@bkamins
Copy link
Member

bkamins commented May 26, 2020

I would prefer the same function and just have some nice way to specify the operation you want to do, I guess there are 4 natural operations:

  • fill missings
  • replace where second data frame has a value
  • fully replace (if second data frame does not have a value fill with missing)
  • add a column

@pdeffebach
Copy link
Contributor

Actually, I've just thought of a problem with this approach. Lets say you have df1 and df2 where df2 has many new columns you are interested :age, :marital_status but also compete information for some columns in df1. So df1 and df2 both have the varriable :income but :income has missing values in df1

When you do

innerjoin(df1, df2, on = :id, makeunique = true)

you get :income and :income_1. A separate update function wouldn't help with this

  • No need to use update after the join happens: you can clean it up yourself easily
  • Doing update before the join happens will still result in either an error or :income_1
  • You can't do just the update since then you would lose the demographic variables of interest.

This kind of scenario is an argument for an update argument in innerjoin.

@bkamins
Copy link
Member

bkamins commented May 26, 2020

I do not think it would be a problem, as adding demographic columns, would be a type 4 of the operation (just add columns).

So in a sense update would be a kind of select from df2 into df1. If you just pass a column - it behaves in the same way as join (except that by default nothing is added - join is a kind of transform in this sense), but you are allowed to do 3 different operations (update existing column in modes: only missing, everywhere where present and just replace old column).

@pdeffebach
Copy link
Contributor

In Stata terms this is the keepusing keyword argument, where you choose to keep certain variables from the using (RHS) data set.

I'm not sure I totally agree with the approach above because generally the number of columns I want to be added and not overridden is an order of magnitude larger than the columns I do want overridden. I would inevitably do update(df1, df2, on = :id, overwrite = [:income], add = :) because I don't want to specify all of the demographic variables to tack on individually.

An additional problem: We have 5 different join functions, and all of them, save maybe antijoin, could use this kind of feature. I'm worried that we will have to re-invent complicated rules for the creation of new columns.

@bkamins
Copy link
Member

bkamins commented May 26, 2020

Yeah, as usual - given the Stata experience - is it possible to put some proposal to start with?

@Nosferican
Copy link
Contributor

The SQL syntax I find it easier (example of in-place),

UPDATE 
    t1
SET 
    t1.c1 = COALESCE(t1.c1, t2.c1)
FROM 
    t1
    INNER JOIN t2
    ON t1.id = t2.id
;

@bkamins
Copy link
Member

bkamins commented May 26, 2020

@Nosferican this is mostly what I have imagined ideally but without metaprogramming it is hard to come up with a concise syntax, i.e. it could be something like:

update!(fun, df1, df2, on=:id)

where fun would get two data frames as an argument having only matching rows in df1 and df2 and would follow the same rules of processing output as in combine(fun, df). Then what it produces would be appended to df1 and if the produced result would have columns that match the existing column then the column would be updated in matching places.

So your example would read:

update!(t1, t2, on=:id) do (t1, t2)
    return (c1 = coalesce.(t1.c1, t2.c2),)
end

This would be almost fully flexible (it would not allow full column replacement, but it is easy enough to almost achieve it using leftjoin).

Still I am not fully happy with this (nor any other syntax I could come up with). But maybe we will find the update! I propose "good enough"?

@bkamins
Copy link
Member

bkamins commented May 26, 2020

OK. I have thought about it and here are some conclusions. Joins in DataFrames.jl work in three steps (I omit implementation details):

  1. ROW MATCHING: appropriately resize df1 and df2 in terms of rows, drop from df2 joining columns, create two vectors indicating if given row was originally present in df1 and df2
  2. TABLE MERGING: merge resized df1 and df2
  3. POST PROCESSING: optionally add a column which is a transformation of indicator column

What we discuss here is essentially replacing TABLE MERGING step by a custom function. This means (I give na example on leftjoin but it applies to all other joins) that:

leftjoin(df1, df2, on=cols)

can be seen as calling

leftjoin(default_joiner, df1, df2, on=cols)

where default_joiner is a function taking 4 arguments (resized df1, resized df2, vector of row indicators for df1, vector of row indicators for df2) and performs a merge of these two data frames (by default just hcat-ing them; note that default_joiner mutates the first argument passed in place).

And actually we could expose this leftjoin(default_joiner, df1, df2, on=cols) as a public API and allow for users to provide custom joiners (where all things that are asked for can be implemented, most probably using do block).

Now if we went this way it is also natural how leftjoin! should be defined. Currently leftjoin passes to default_joiner a copy of df1. Now leftjoin! would manipulate df1 in place before passing it to default_joiner and by this the effect would be that leftjoin! would mutate df1 in place as needed.

The only problem is that leftjoin(custom_joiner, df1, df2, on=cols) is a very low level API, but maybe it is good enough for practical purposes?

@babaq
Copy link

babaq commented Jul 26, 2020

I also found this feature would be much helpful in our work.

the update! indicates the contents of df1 will be refreshed by the corresponding contents of df2, which match the leftjoin, while merge! means the content of df1 and df2 should be combined, which match the outerjoin. so I think the join APIs are appropriate to include this feature.

the problems here are how to handle content conflicts. current implementation will throw error or if makeunique=true leave conflicts to user.

but for some common use cases, for example, to update/leftjoin df1, if conflicts occurs, we usually want to overwrite content of df1 with the corresponding content of df2. From a user perspective, keywords makeunique=false, overwrite=true will be clear enough to express the intentions. of course, the content here would be single cell in a spreadsheet, not the whole column. if missing is to filled as much as possible, means if df1 have missing and df2 not, missing is replaced, but if df1 not and df2 have missing, data will not be evacuated by missing, another keywords writemissing=false should be ok.

for flexibility, a relatively low-level API @bkamins mentioned may be needed, but to the most users, I think join would be a good target to incorporate this feature.

@hongtaoh
Copy link

hongtaoh commented Jun 29, 2021

I do not think you can do any better than joining and then combining the results. In particular in general (if if were to become a standard library function) you would have to define what should happen if either of the data frames has a duplicate entry in :id, what should happen if the right data frame has some entry, and the left data frame also contains it but it is not missing etc.

I general I believe that there are so many possible options that it is hard to define an API for such a function, simply an appropriate join should be used. Note then that the last operation you mention can be written as df_full.val .= coalesce.(df_full.val, df_full_val_1) which will be fast and update df_full.val in place.

I guess df_full_val_1 should be df_full.val_1?

If it's not rude to ask: May I know what does df_full.val_1 mean? A duplicate of df_full.val? Also, could I know the mechanism underlying df_full.val .= coalesce.(df_full.val, df_full.val_1) ? I didn't find a clue either in Julia's Doc or in DataFrames's Doc.

Thanks for your tremendous contribution to DataFrames, @bkamins !

@bkamins
Copy link
Member

bkamins commented Jun 29, 2021

I guess df_full_val_1 should be df_full.val_1?

Yes, it was a typo; fixed

May I know what does df_full.val_1 mean?

It gives you a vector stored as a column val_1 in data frame df_full. I have quoted it as it was created in the original question in : leftjoin(df_missing, df_completion, on = :id, makeunique = true) operation. This style is not recommended (because it can be confusing as you witness) but it is supported. The thing is that df_missing and df_completion have the same column name, so makeunique=true de-duplicates them by adding _1 suffix to the column name of the right data frame as it is explained in the docstring of leftjoin function.

Also, could I know the mechanism underlying df_full.val .= coalesce.(df_full.val, df_full.val_1)

This operation, mostly, has nothing to do with DataFrames.jl. It is a standard Julia Base broadcasting syntax as explained here in the Julia Manual, and coalesce function is documented here in the Julia Manual. Additionally, broadcasting rules specific to DataFrames.jl are documented here.

@hongtaoh
Copy link

I guess df_full_val_1 should be df_full.val_1?

Yes, it was a typo; fixed

May I know what does df_full.val_1 mean?

It gives you a vector stored as a column val_1 in data frame df_full. I have quoted it as it was created in the original question in : leftjoin(df_missing, df_completion, on = :id, makeunique = true) operation. This style is not recommended (because it can be confusing as you witness) but it is supported. The thing is that df_missing and df_completion have the same column name, so makeunique=true de-duplicates them by adding _1 suffix to the column name of the right data frame as it is explained in the docstring of leftjoin function.

Also, could I know the mechanism underlying df_full.val .= coalesce.(df_full.val, df_full.val_1)

This operation, mostly, has nothing to do with DataFrames.jl. It is a standard Julia Base broadcasting syntax as explained here in the Julia Manual, and coalesce function is documented here in the Julia Manual. Additionally, broadcasting rules specific to DataFrames.jl are documented here.

Thank you so much @bkamins for this quick response! Even though df_full = leftjoin(df_missing, df_completion, on = :id, makeunique = true) operation is not desirable, we have to run it before running df_full.val .= coalesce.(df_full.val, df_full.val_1). Otherwise, we won't be able to get df_full in the first place.

Thank you for pointing me to the documentation for the doc syntax, coalesce and broadcasting. It helped me a lot!

@bkamins bkamins added this to the 1.x milestone Aug 25, 2021
@bkamins bkamins modified the milestones: 1.x, 1.5 Dec 11, 2022
@bkamins
Copy link
Member

bkamins commented Dec 11, 2022

OK. I think we are ready to add it in DataFrames.jl 1.5 release.

I was thinking about the signature and came up with (in-place version):

update!(df1, df2, cols...; on, matchmissing=:error, rule=:all)

(also a copying update version would be added; also maybe the order of df1 and df2 should be reversed? - more on it below)

The update! function, like leftjoin! would require that df2 has unique matching rows (so if duplicates in matching rows are present an error is thrown)

The meaning of the arguments is:

  • cols positional arguments specifying columns to be updated; if multiple cols arguments are passed they are processed sequentially left to right. Any cols argument can be:
    • any column selector (single column, vector, Cols, Not, ...) that will be applied to df2 table; then columns from df1 with the same name will be updated (if columns are missing in df1 they are ignored)
    • a pair df2_col => df1_col indicating which column from df2 should be used to update which column in df1 (again - maybe the order should be reversed; in this case both selectors must be single column selectors); if df1_col is missing error
  • on, matchmissing: as in other joins
  • rule defining how to handle the update. The options would be:
    • function - takes two arguments the value from df2 and the value from df1 and returning the desired value (the first passed value will be from df2 the second from df1 as this is the director of updating - but we might want to reverse this order)
    • :all - update all matching rows in df1 with rows from df2, equivalent to (x, y) -> x
    • :missing - only update rows in df1 that have a missing value with a value from df2, equivalent to passing coalesce function

Now the things to discuss are:

  • in three places the order of df1 and df2 matters; first is update! call (if the updated table should go first or second); then there is a syntax in cols for doing column name matching; finally the rule function also depends on the order of arguments; I was not sure which order to use
  • are we OK to error if duplicate keys in df2 are found (I find it natural, but maybe there are cases when this is not desirable)
  • I proposed to ignore situation when there are columns in df2 that user wants to use that are missing in df1, but throw an error if pair notation is used. The alternative behaviors could be to have an extra kwarg with three options: 1) error, 2) ignore, 3) add new column to df1 if it is missing (and assume it had all missing values initially)

CC @nalimilan

@bkamins
Copy link
Member

bkamins commented Dec 27, 2022

@jariji - this is another important PR where API design is a key challenge.

@bkamins bkamins modified the milestones: 1.5, 1.6 Feb 5, 2023
@alfaromartino
Copy link
Contributor

alfaromartino commented Feb 21, 2023

There's an operation I perform a lot. I'm adding here because I think it's related. Maybe this is worth considering now that the API is under development. Alternatively, it'd be nice to consider this functionality.

Basically, the type operation is as follows

gdf = groupby(df[condition,:], groups)

result = combine(gdf, col => func => pre_alloc)
leftjoin!(df, result, on=groups)

A MWE

dff = DataFrame(industry = ["beer", "beer", "wine", "wine"], 
                is_public =[true,false, true,false], 
                firm_revenue = collect(10:10:40))

4×3 DataFrame
 Row │ industry  is_public  firm_revenue 
     │ String    Bool       Int64
─────┼───────────────────────────────────
   1 │ beer           true            10
   2 │ beer          false            20
   3 │ wine           true            30
   4 │ wine          false            40


gdf = groupby(dff[dff.is_public .== true,:], :industry)
result = combine(gdf, :firm_revenue => sum => :revenue_of_all_public_firms)

2×2 DataFrame
 Row │ industry  revenue_of_all_public_firms 
     │ String    Int64
─────┼───────────────────────────────────────
   1 │ beer                               10
   2 │ wine                               30

and then i "update" the original dataframe

leftjoin!(dff, result, on=:industry)
4×4 DataFrame
 Row │ industry  is_public  firm_revenue  revenue_of_all_public_firms 
     │ String    Bool       Int64         Union{Missing, Int64}
─────┼────────────────────────────────────────────────────────────────
   1 │ beer           true            10                           10
   2 │ beer          false            20                           10
   3 │ wine           true            30                           30
   4 │ wine          false            40                           30

I guess update could potentially handle this operation more efficiently than leftjoin?
I was thinking of updating an empty column with the results of combine. Notice here, unlike the case updating a dataframe using a separate dataframe, we'd update values with results that come from the parent DataFrame. Usually leftjoin! is quite expensive in my work, but don't know if it's room for improvement for this particular application.

Thanks!!!

@bkamins
Copy link
Member

bkamins commented Feb 21, 2023

Currently you can almost do what you want efficiently with:

julia> dff = DataFrame(industry = ["beer", "beer", "wine", "wine"],
                       is_public =[true,false, true,false],
                       firm_revenue = collect(10:10:40))
4×3 DataFrame
 Row │ industry  is_public  firm_revenue
     │ String    Bool       Int64
─────┼───────────────────────────────────
   1 │ beer           true            10
   2 │ beer          false            20
   3 │ wine           true            30
   4 │ wine          false            40

julia> gdf = groupby(subset(dff, :is_public, view=true), :industry)
GroupedDataFrame with 2 groups based on key: industry
First Group (1 row): industry = "beer"
 Row │ industry  is_public  firm_revenue
     │ String    Bool       Int64
─────┼───────────────────────────────────
   1 │ beer           true            10
⋮
Last Group (1 row): industry = "wine"
 Row │ industry  is_public  firm_revenue
     │ String    Bool       Int64
─────┼───────────────────────────────────
   1 │ wine           true            30

julia> transform!(gdf, :firm_revenue => sum => :revenue_of_all_public_firms)
2×4 SubDataFrame
 Row │ industry  is_public  firm_revenue  revenue_of_all_public_firms
     │ String    Bool       Int64         Union{Missing, Int64}
─────┼────────────────────────────────────────────────────────────────
   1 │ beer           true            10                           10
   2 │ wine           true            30                           30

julia> dff
4×4 DataFrame
 Row │ industry  is_public  firm_revenue  revenue_of_all_public_firms
     │ String    Bool       Int64         Union{Missing, Int64}
─────┼────────────────────────────────────────────────────────────────
   1 │ beer           true            10                           10
   2 │ beer          false            20                      missing
   3 │ wine           true            30                           30
   4 │ wine          false            40                      missing

almost - because, as you can see, we will have missing in the rows that do not match the selection (this can be fixed in post-processing currently).

Another approach, but it is less general is:

julia> dff = DataFrame(industry = ["beer", "beer", "wine", "wine"],
                       is_public =[true,false, true,false],
                       firm_revenue = collect(10:10:40))
4×3 DataFrame
 Row │ industry  is_public  firm_revenue
     │ String    Bool       Int64
─────┼───────────────────────────────────
   1 │ beer           true            10
   2 │ beer          false            20
   3 │ wine           true            30
   4 │ wine          false            40

julia> gdf = groupby(dff, :industry)
GroupedDataFrame with 2 groups based on key: industry
First Group (2 rows): industry = "beer"
 Row │ industry  is_public  firm_revenue
     │ String    Bool       Int64
─────┼───────────────────────────────────
   1 │ beer           true            10
   2 │ beer          false            20
⋮
Last Group (2 rows): industry = "wine"
 Row │ industry  is_public  firm_revenue
     │ String    Bool       Int64
─────┼───────────────────────────────────
   1 │ wine           true            30
   2 │ wine          false            40

julia> transform!(gdf, [:firm_revenue, :is_public] => ((a, b) -> sum(a[b])) => :revenue_of_all_public_firms) # this would be shorter with DataFramesMeta.jl
4×4 DataFrame
 Row │ industry  is_public  firm_revenue  revenue_of_all_public_firms
     │ String    Bool       Int64         Int64
─────┼────────────────────────────────────────────────────────────────
   1 │ beer           true            10                           10
   2 │ beer          false            20                           10
   3 │ wine           true            30                           30
   4 │ wine          false            40                           30

@alfaromartino
Copy link
Contributor

Yes, that was my point. I use combine + leftjoin! to avoid having missing values, and because methods like the second example are less general and only applicable to specific cases. My solution is simple in terms of syntax, but really slow. That's why I thought that update could handle this.

I was thinking along the lines of using:

  • create view of df with some condition
  • create gdf with groups
  • use combine and get gdf_combine
  • update(parent_df, gdf_combine, on=groups).
    It'd be like filling fill the parent df for each combination of groups.

These operations are quite common I think. Basically, you put all industry information in columns to compute each firm's result relative to industry measures (e.g., a market share, although computing market shares can obviously be done in a simpler way than all this).

Maybe update needs to handle more general cases, and this should be done through a separate function. Just wanted to comment this, so you could have it in mind.

Thanks!!!

@pdeffebach
Copy link
Contributor

@alfaromartino I still don't understand, why would you not do transform with a dataset grouped by industry?

@bkamins
Copy link
Member

bkamins commented Feb 21, 2023

Because @alfaromartino wants to perform some operation on a filtered data frame and the approach like:

transform!(gdf, [:firm_revenue, :is_public] => ((a, b) -> sum(a[b])) => :revenue_of_all_public_firms)

is not wanted (this is my understanding)

@pdeffebach
Copy link
Contributor

Ah I see. This would give like 0 for non-public firms, or potentially throw an error.

@bkamins
Copy link
Member

bkamins commented Feb 21, 2023

give like 0 for non-public firms, or potentially throw an error.

or - in other words, leftjoin! handles empty groups cleanly, while in the above code you would have to handle this manually.

@pdeffebach
Copy link
Contributor

Yes. Hopefully in DataFramesMeta.jl we can make this solution a bit better with

@transform gd @when(:is_public) @transform :revenue_of_all_firms = sum(:firm_revenue)

@bkamins bkamins modified the milestones: 1.6, 1.7 Jul 10, 2023
@leei
Copy link

leei commented Jul 31, 2023

To add to the ongoing discussion here, I have a non-breaking patch that replaces makeunique=true throughout with a dupcol keyword that has three possible values:

  • :error which raises an error if the DataFrame constructor or join has duplicate column names (the default),
  • :makeunique which replicates the functionality of makeunique=true, and
  • :update which consolidates the duplicate columns with non-missing values from the right-hand columns.

It also marks the use of makeunique=true as deprecated in favour of this newer model.

N.B. This approach is potentially extensible to other actions to combine duplicate columns e.g. :overwrite or :ignore.

PR here #3366

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants