Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas Column Typing #143

Open
gdj0nes opened this issue Sep 18, 2017 · 11 comments
Open

Pandas Column Typing #143

gdj0nes opened this issue Sep 18, 2017 · 11 comments

Comments

@gdj0nes
Copy link

gdj0nes commented Sep 18, 2017

Overview: Add functionality to get column types, including date, and sortkeys from Redshift meta info for Pandas IO

I modified the client to allow me to pass arguments to the pd.read_csv on line 227 in civis/io/_tables.py. I gather the column types using client.tables.get after which I map the SQL types to their corresponding Python types. Doing this eliminates the automatic str to int conversion, e.g. zipcodes, and automatically identifies the date columns. Is there any interest in adding this in a more thoughtful way to the code?

@beckermr
Copy link
Contributor

Yes this is a good idea! I am not sure who has time to get to it, but if you want to make a PR feel free!

@stephen-hoover
Copy link

+1 ! You can tag me for review. I've dealt with similar issues in my own code.

@gdj0nes
Copy link
Author

gdj0nes commented Sep 18, 2017

##Found Answer##

I have the time right now but I'm not super familiar with the code. Basically what's the fastest way to get the table meta given the database and table names because that seems to be readily available in the existing code? I currently do the following in my code:

  1. Get the table in the database by searching over the tables:
    client.tables.list(database_id=DATABASE_META['id'], limit=1000)
  2. Use the table id to get column data
    table_cols = client.tables.get(table['id'])['columns']

@beckermr
Copy link
Contributor

Hmmm. Those calls might require the table scanner. You could query the database for the info directly.

@stephen-hoover
Copy link

The types will be populated without a full table scanner run. You can use something like

db_id = client.get_database_id(database)
tables = client.tables.list(database_id=db_id, schema=schema, name=table)
columns = client.tables.get(tables[0].id).columns
dtypes = {col.name: redshift_to_py_type(col.sql_type)
              for col in columns}

@gdj0nes
Copy link
Author

gdj0nes commented Sep 18, 2017

Here is the function I plan on using:

from civis.APIClient import get_table_id

@gdj0nes
Copy link
Author

gdj0nes commented Sep 18, 2017

Initial code where table_id is passed from read_civis into read_civis_sql

    if use_pandas:
        table_meta = client.tables.get(table_id)
        # The split is to handle e.g. DECIMAL(17,9)
        redshift_types = {col['name']: col['sql_type'].split('(')[0]
                              for col in table_meta['columns']}
        py_types = {name : SQL_PANDAS_MAP[type_]
                    for (name, type_) in redshift_types 
                    if type_ not in DATE_TYPES}
        date_cols = [name for (name, type_) in redshift_types 
                     if type_ in DATE_TYPES]
                     
        sortkeys = table_meta['sortkeys'].split(',')
        # Not sure about the best way to handle null fields
        # or how they may appear
        data = pd.read_csv(url, parse_dates=date_cols,
                                index_col=sort_keys,
                                converters=py_types)

@beckermr
Copy link
Contributor

Cool. Go ahead and make a branch and start the PR. We can do actual code review there.

@keithing
Copy link
Contributor

This would be a major improvement. The difficult I've had with this before is that there doesn't seem to be a good way to get the type information from a general query passed into read_civis_sql. If someone has a suggestion, I would love to hear it, but I think we could only provide types for read_civis.

That said, it would still be useful to implement this for just read_civis. I could imagine somewhere around here adding the code @stephen-hoover wrote above. Then we could pass dtypes to read_civis_sql here, since the kwargs in read_civis_sql ultimately get passed to pandas.read_csv.

@keithing
Copy link
Contributor

Oof, we might have another problem. read_civis accepts columns, which could be arbitrary SQL. So if someone were to call

columns = ['my_var = 1']
read_civis('table', 'database', columns = columns)

We simply wouldn't have the type information.

@stephen-hoover
Copy link

Hm. Good point. pandas.read_csv accepts a dtype argument, which is how we'd implement this. The dtype doesn't have to include all columns, and it can include columns which aren't present. I think the only time we'd run into trouble would be if users tried to export columns with the same name as a column in the actual table, but with a different type. I suppose that could happen, couldn't it?

I'm not sure what the right answer is. We could give a boolean switch to enable type inference when reading with pandas. That would let users at least turn it off if they're transforming types. Or we could provide examples of how to do this yourself, or helper functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants