Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcast erroneously claims cast_on and onto_on are equivalents of left_on and right_on parameters of pandas.merge. #37

Open
toliwaga opened this issue Apr 26, 2018 · 5 comments

Comments

@toliwaga
Copy link

toliwaga commented Apr 26, 2018

They used to be, since there were passed through as of at least orca 1.1.

But now, they are assumed to be type str (or None) whereas pandas merge accepts label or list, or array-like.

So the following, which worked under orca 1.1, no longer works:

orca.broadcast(
               cast='persons',
               onto='trips',
               cast_on=['hh_id', 'person_num'],
               onto_on=['hh_id', 'person_num'])

At least the documentation should be revise, whcih currently reads:

def broadcast(cast, onto, cast_on=None, onto_on=None,
              cast_index=False, onto_index=False):
    """
    Register a rule for merging two tables by broadcasting one onto
    the other.

    Parameters
    ----------
    cast, onto : str
        Names of registered tables.
    cast_on, onto_on : str, optional
        Column names used for merge, equivalent of ``left_on``/``right_on``
        parameters of pandas.merge.
    """
@jiffyclub
Copy link
Member

Are you getting a specific error?

The cast_on and onto_on values are passed straight through to pandas.merge here:

onto_table = pd.merge(
    onto_table, cast_table,
    suffixes=['_'+onto, '_'+cast],
    left_on=bc.onto_on, right_on=bc.cast_on,
    left_index=bc.onto_index, right_index=bc.cast_index)

However, using sequences may break other parts of the code, for example this looks like it might not properly handle lists of column names.

@toliwaga
Copy link
Author

The problem is on line 1808. I don't recall the exact error, but this is where things go wrong:

                # intersection is ok if it's the join key
                intersection.discard(bc.onto_on)
                intersection.discard(bc.cast_on)

The comment for broadcast pretty much says it all:

    cast_on, onto_on : str, optional
        Column names used for merge, equivalent of ``left_on``/``right_on``
        parameters of pandas.merge.

It wants expects that cast_on and onto_on are strings, but pandas.merge also accepts lists.

If you want the exact error, I can easily generate it.

@toliwaga
Copy link
Author

The error is:

  File ".../orca/orca.py", line 1803, in merge_tables
    intersection.discard(bc.onto_on)
TypeError: unhashable type: 'list'

generated by passing lists of strs to broadcast:

orca.broadcast(cast='persons_merged',
               onto='disaggregate_trips',
               cast_on=['hh_id', 'person_idx'],
               onto_on=['hh_id', 'person_idx'])

@jiffyclub
Copy link
Member

Thanks, that's helpful. Looks like that code was added in e52fe6c as part of #20. That and the code I mentioned above both look like they wouldn't properly handle lists as cast_on/onto_on arguments.

@toliwaga
Copy link
Author

Yes.

@smmaurer smmaurer mentioned this issue Aug 17, 2020
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants