Pandas Column Typing #143

gdj0nes · 2017-09-18T14:11:52Z

Overview: Add functionality to get column types, including date, and sortkeys from Redshift meta info for Pandas IO

I modified the client to allow me to pass arguments to the pd.read_csv on line 227 in civis/io/_tables.py. I gather the column types using client.tables.get after which I map the SQL types to their corresponding Python types. Doing this eliminates the automatic str to int conversion, e.g. zipcodes, and automatically identifies the date columns. Is there any interest in adding this in a more thoughtful way to the code?

beckermr · 2017-09-18T14:31:29Z

Yes this is a good idea! I am not sure who has time to get to it, but if you want to make a PR feel free!

stephen-hoover · 2017-09-18T14:33:24Z

+1 ! You can tag me for review. I've dealt with similar issues in my own code.

gdj0nes · 2017-09-18T16:04:55Z

##Found Answer##

I have the time right now but I'm not super familiar with the code. Basically what's the fastest way to get the table meta given the database and table names because that seems to be readily available in the existing code? I currently do the following in my code:

Get the table in the database by searching over the tables:
client.tables.list(database_id=DATABASE_META['id'], limit=1000)
Use the table id to get column data
table_cols = client.tables.get(table['id'])['columns']

beckermr · 2017-09-18T16:31:21Z

Hmmm. Those calls might require the table scanner. You could query the database for the info directly.

stephen-hoover · 2017-09-18T16:38:06Z

The types will be populated without a full table scanner run. You can use something like

db_id = client.get_database_id(database)
tables = client.tables.list(database_id=db_id, schema=schema, name=table)
columns = client.tables.get(tables[0].id).columns
dtypes = {col.name: redshift_to_py_type(col.sql_type)
              for col in columns}

gdj0nes · 2017-09-18T16:39:29Z

Here is the function I plan on using:

from civis.APIClient import get_table_id

gdj0nes · 2017-09-18T17:00:34Z

Initial code where table_id is passed from read_civis into read_civis_sql

    if use_pandas:
        table_meta = client.tables.get(table_id)
        # The split is to handle e.g. DECIMAL(17,9)
        redshift_types = {col['name']: col['sql_type'].split('(')[0]
                              for col in table_meta['columns']}
        py_types = {name : SQL_PANDAS_MAP[type_]
                    for (name, type_) in redshift_types 
                    if type_ not in DATE_TYPES}
        date_cols = [name for (name, type_) in redshift_types 
                     if type_ in DATE_TYPES]
                     
        sortkeys = table_meta['sortkeys'].split(',')
        # Not sure about the best way to handle null fields
        # or how they may appear
        data = pd.read_csv(url, parse_dates=date_cols,
                                index_col=sort_keys,
                                converters=py_types)

beckermr · 2017-09-18T17:03:10Z

Cool. Go ahead and make a branch and start the PR. We can do actual code review there.

keithing · 2017-09-18T22:04:29Z

This would be a major improvement. The difficult I've had with this before is that there doesn't seem to be a good way to get the type information from a general query passed into read_civis_sql. If someone has a suggestion, I would love to hear it, but I think we could only provide types for read_civis.

That said, it would still be useful to implement this for just read_civis. I could imagine somewhere around here adding the code @stephen-hoover wrote above. Then we could pass dtypes to read_civis_sql here, since the kwargs in read_civis_sql ultimately get passed to pandas.read_csv.

keithing · 2017-09-18T22:09:43Z

Oof, we might have another problem. read_civis accepts columns, which could be arbitrary SQL. So if someone were to call

columns = ['my_var = 1']
read_civis('table', 'database', columns = columns)

We simply wouldn't have the type information.

stephen-hoover · 2017-09-18T22:25:57Z

Hm. Good point. pandas.read_csv accepts a dtype argument, which is how we'd implement this. The dtype doesn't have to include all columns, and it can include columns which aren't present. I think the only time we'd run into trouble would be if users tried to export columns with the same name as a column in the actual table, but with a different type. I suppose that could happen, couldn't it?

I'm not sure what the right answer is. We could give a boolean switch to enable type inference when reading with pandas. That would let users at least turn it off if they're transforming types. Or we could provide examples of how to do this yourself, or helper functions.

gdj0nes added the enhancement label Sep 18, 2017

stephen-hoover added the I/O label Sep 18, 2017

gdj0nes mentioned this issue Sep 19, 2017

Pandas IO Typing #144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas Column Typing #143

Pandas Column Typing #143

gdj0nes commented Sep 18, 2017

beckermr commented Sep 18, 2017

stephen-hoover commented Sep 18, 2017

gdj0nes commented Sep 18, 2017 •

edited

Loading

beckermr commented Sep 18, 2017

stephen-hoover commented Sep 18, 2017

gdj0nes commented Sep 18, 2017

gdj0nes commented Sep 18, 2017

beckermr commented Sep 18, 2017

keithing commented Sep 18, 2017

keithing commented Sep 18, 2017

stephen-hoover commented Sep 18, 2017

Pandas Column Typing #143

Pandas Column Typing #143

Comments

gdj0nes commented Sep 18, 2017

beckermr commented Sep 18, 2017

stephen-hoover commented Sep 18, 2017

gdj0nes commented Sep 18, 2017 • edited Loading

beckermr commented Sep 18, 2017

stephen-hoover commented Sep 18, 2017

gdj0nes commented Sep 18, 2017

gdj0nes commented Sep 18, 2017

beckermr commented Sep 18, 2017

keithing commented Sep 18, 2017

keithing commented Sep 18, 2017

stephen-hoover commented Sep 18, 2017

gdj0nes commented Sep 18, 2017 •

edited

Loading