-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: set columns numeric datatypes when exporting to excel #27229
Conversation
Approving CI run🤞 |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #27229 +/- ##
===========================================
+ Coverage 60.48% 71.70% +11.21%
===========================================
Files 1931 1928 -3
Lines 76236 81499 +5263
Branches 8568 8324 -244
===========================================
+ Hits 46114 58439 +12325
+ Misses 28017 21007 -7010
+ Partials 2105 2053 -52
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Looks like you need to run the pre-commit hook for formatting/linting. |
Thank you for the fix @squalou. Could you add a test for it to avoid future regressions? |
yeah sorry about that. Should be fixed now, my IDE for this project is not on "auto format", totally forgot.
I'd love to ! really. But I admit I' a bit lost in superset architecture and test framework, I'm not a python pro dev to begin with which does not help. I've tried to guess where 'export csv' might be tested and enhance from there but did not find anything. Got myself lost in viz_tests.py and didn't go any further. Any help / pointer to where to begin with would be appreciated. |
You can create a test in tests/unit_tests/common similarly to test_get_aggregated_join_column.py |
Thank you ! will have a look asap (on vacation break this weak :D ) |
[EDIT] simply writing things down here helped me realize I should break the tested function in two, and only test for the changes really introduced. I submitted "something". Now ... I do not test in these tests that the excel export really works (it does, see further), and that's still a shame, maybe ? Leaving below for "excel export" testability issues I encountered. (original post below line) Hi, I manage to run tests locally and I'm trying to add one... but I have (at last) two issues. You'll see from these that ... I'm full of good will but a desperate python newbie :) First : BaseDatasource Following test example I initiate this query_context_processor = QueryContextProcessor(
QueryContext(
datasource=BaseDatasource(),
queries=[],
result_type=ChartDataResultType.COLUMNS,
form_data={},
slice_=None,
result_format=ChartDataResultFormat.XLSX,
cache_values={},
)
) But then, in context_query_processor.py, method get_data(), theres' a call
that ends up calling "uid", and finally ... raise a lovely NotImplementedError. Seing the code - unless there's an object overriding teh type method somewhere - I have no idea what to do to avoid this. @property
def type(self) -> str:
raise NotImplementedError()
@property
def uid(self) -> str:
"""Unique id across datasource types"""
return f"{self.id}__{self.type}"
So, in order to go on a bit, I put locally a try/catch around verbose_map thing, which may be a solution eventually ? Which leads to my even bigger issue ... Second : how to check the method I must test returns this result = excel.df_to_excel(ndf, **config["EXCEL_EXPORT"])
return result or "" which is not very testable. a log output gave me as expected a binary content that looks like this, so, yeah, the test runs.
Is there a way to "mock" or "override" the Or else, is it advisable to somehow refactor the method, extract a more testable one before the call to excel conversion ? thanks for any advice ! regards |
0aa2191
to
2cf4443
Compare
(damn, import sort was not set, fixed and re-pushed) |
@@ -571,11 +574,22 @@ def get_data(self, df: pd.DataFrame) -> str | list[dict[str, Any]]: | |||
df, index=include_index, **config["CSV_EXPORT"] | |||
) | |||
elif self._query_context.result_format == ChartDataResultFormat.XLSX: | |||
result = excel.df_to_excel(df, **config["EXCEL_EXPORT"]) | |||
ndf = self.copy_df_for_excel_export(df, coltypes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@squalou Should this change be actually implemented in superset.utils.excel.df_to_excel
to ensure consistent behavior? You could move your tests to tests/unit_tests/utils/excel_tests
. We already have one there called test_timezone_conversion
which is a similar type of conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes a lot of sense, sure.
What would be better,
add "coltypes" as optional parameter in excel.df_to_excel(...) method ?
Or move the method as-is in superset.utils.excel. package.
First option seems better but I'm not at ease with changing method signatures there without an advice, I really have no global overview of what may call what and consequences.
EDIT : from a quick check the method is called in one place and one test so ... let's go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked that coltypes
are extracted from the DataFrame
here.
superset/utils/excel.py
Outdated
output = io.BytesIO() | ||
|
||
# timezones are not supported | ||
for column in df.select_dtypes(include=["datetimetz"]).columns: | ||
df[column] = df[column].astype(str) | ||
|
||
ndf = copy_df_with_datatype_conversion(df, coltypes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@squalou Can the column types be extracted from the DataFrame
like we are doing for timezones
for column in df.select_dtypes(include=["datetimetz"]).columns:
df[column] = df[column].astype(str)
or are they not matching what's defined in coltypes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thing is : you could have data with leading zeros for instance, defined as string in the original datasource, and you want them as string in the end, or else excel would omit leading zeros, maybe round things, whatever.
I don't know much about DataFrame object, and the whole issue I'm trying to solve is to get the datatype as it was defined as the column type to begin with.
If you known of a way to get this data from the DataFrame itself, it sure would be MUCH more elegant than forwarding the column_types all way down to excel export.
To be honest, I discovered panda objects only to make up this patch ... so if you think this metadata can be stored / handled differently, I'll be glad to learn.
I also did not want to take any risk of changing the original dataframe object as I have no idea of global superset architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I almost forgot the most important part : I have absolutely no idea how to know what dtypes are 'stored' in the DataFrame at the beginning :) so I used the 'coltypes' that seemed to be there for a reason :)
|
||
|
||
def test_timezone_conversion() -> None: | ||
""" | ||
Test that columns with timezones are converted to a string. | ||
""" | ||
df = pd.DataFrame({"dt": [datetime(2023, 1, 1, 0, 0, tzinfo=timezone.utc)]}) | ||
contents = df_to_excel(df) | ||
contents = df_to_excel(df, []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
contents = df_to_excel(df, []) | |
contents = df_to_excel(df, [GenericDataType.TEMPORAL]) |
@@ -571,11 +574,22 @@ def get_data(self, df: pd.DataFrame) -> str | list[dict[str, Any]]: | |||
df, index=include_index, **config["CSV_EXPORT"] | |||
) | |||
elif self._query_context.result_format == ChartDataResultFormat.XLSX: | |||
result = excel.df_to_excel(df, **config["EXCEL_EXPORT"]) | |||
ndf = self.copy_df_for_excel_export(df, coltypes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked that coltypes
are extracted from the DataFrame
here.
superset/utils/excel.py
Outdated
|
||
return output.getvalue() | ||
|
||
|
||
def convert_df_with_datatypes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we could make it more explicit that the DataFrame
can be modified and avoid the copy overhead. To do that, we need to separate column conversion from exporting. Something like:
def df_to_excel(df: pd.DataFrame, **kwargs: Any) -> bytes:
output = io.BytesIO()
# pylint: disable=abstract-class-instantiated
with pd.ExcelWriter(output, engine="xlsxwriter") as writer:
df.to_excel(writer, **kwargs)
return output.getvalue()
def apply_column_types(
df: pd.DataFrame, column_types: list[GenericDataType]
) -> pd.DataFrame:
for column, column_type in zip(df.columns, column_types):
if column_type == GenericDataType.NUMERIC:
df[column] = pd.to_numeric(df[column])
elif pd.api.types.is_datetime64tz_dtype(df[column]):
# timezones are not supported
df[column] = df[column].astype(str)
else:
df[column] = df[column].astype(column_type)
return df
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably over-conservative but I have one remaining concern
else : df[column] = df[column].astype(column_type)
... again I'm a bit scared of the effect on excel export side, because I don't know how it works. Id' rather "leave untouched" data not to be converted, but here again you probably know better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addendum : a quick test regarding the "else" conversion, it fails with this error
TypeError: Cannot interpret '<GenericDataType.STRING: 1>' as a data type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.to_numeric may raise exception, from documentation the "ignore error" param is deprecated and it's recommended to catch exceptions.
Good point. Here we have two options:
- Catch and convert to a default. We also need to alert the user that some values were modified when exporting.
- Don't catch and show a good message clearly indicating the column/row that is preventing the export.
else : df[column] = df[column].astype(column_type) ... again I'm a bit scared of the effect on excel export side, because I don't know how it works. Id' rather "leave untouched" data not to be converted, but here again you probably know better.
I'm not sure what you mean by untouched data but I'm assuming that when exporting users want the data to comply with what is defined as the column type. That was your original problem for numeric types, so I'm assuming that other non-numerical columns that are not matching their defined types should also be converted.
Let me ask for @betodealmeida and @eschutho input here as they have more context on this feature and might help us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As en end user, I much more prefer having an xlsx file with data I have to manually reformat, than having nothing at all.
Catch and convert to a default. We also need to alert the user that some values were modified when exporting.
probably same as "dates with timezone" fix : df[column].astype(str)
Alerting the user is faaaaaaaar beyond anything I would be able to do, and agani as an end user I think I would'nt care. Nothing is said about timezone conversion either and I bet nobody cares.
Don't catch and show a good message clearly indicating the column/row that is preventing the export.
As said, as an end user I would rather have a not-perfectly formatted file, but will all data in it, than no data at all.
- About "untouched fields" :
I don't really remember all the details (because it dates a bit and it's not my passion tbh :) ), but at the time, I manually opened the produced xlsx files, meaning : unzip it and see the xml-ish data inside and try to understand the differences when converted data or not. From there, open in libreoffice or MS and see the behaviors, and try to find the best.
Numeric values are stored "in the cells", while others have references in the cell and values mapped somewhere else. Or maybe it's the other way'round - can't even remember which is which. Anyway there's a clear separation between numeric and 'other'. Then, inside the 'other' category of course there may be further differences.
From my understanding then, the best thing one can do to expect any office software to deal with these files is to make sure numerics are numerics, anything doesn't matter that much and autodetection of strings work.
That's probably why df[column] = df[column].astype(str)
in the date conversion has no real bad effect on usability.
That's all I know, which is not much for such a long text :)
damn, I still have a minor pylint issue ("exception as e" as a bad habit, moreover e is not used :) ) I have an issue though with an error in a file I did not change
that's weird. I rebased just in case. |
damn ! my "black" local version doesn't behave exactly the same as the one on CI ! Will be fixed on next run. |
CI Run Fail : hm, no idea how this could be related with the PR, could it be failing on master before the changes ?
|
Agreed, seems unrelated... I'll rerun it in hopes that it's just a flaky test. |
I've been away a bit, I'm back. Is there anything I should do ? right now I'll wait for further instructions, rather than adding mess to the situation :) |
Hah... I'm rerunning the failing test now in hopes that it's just flaky. Fingers crossed. |
damn, from what I remember previous crash was not this one. (do we have history of previous run somewhere to be sure?)
|
It seems that GitHub Actions runners are random machines from different clusters... these things are hard to nail down, but certain machines don't match, and have troubles. It makes the tests look flaky, which is really annoying. I'm re-running that 1/3 of the E2E tests now 🤞 |
should I try a rebase or is it useless ? |
Wow, we really fell off the wagon on this one. @squalou if you want to rebase, I'm optimistic we can get this through. CI has stabilized quite a bit lately, and it seems the conflicts might not be too crazy. |
I just fixed the conflict for you, so I think that should fix CI. |
Not sure why the mysql test is still flaky here, but I'll close/reopen to kick it again. Hopefully we can (finally!) get this across the finish line. |
Sorry, totally missed the notification, and thank you @eschutho for the rebase ! (and many thanks to all the others too for giving this long story a happy end :) ) |
Co-authored-by: Elizabeth Thompson <[email protected]> (cherry picked from commit ce72a0a)
SUMMARY
When exporting to Excel, currently datatypes are not specified, leading to decimal numbers being stored as strings, which may lead to issues when opening the file.
Depending on the locale, the issue may not even be visible, Excel managing to convert things. (that's the case in us/en locale). When using another locale (say fr), then numbers, output with '.' as decimal separator and stored as strings won't be usable as numbers in Excel.
The idea here is to use the same "datatype guessing" method already existing, and use it to convert dataframe columns types when required before exporting.
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
not applicable
TESTING INSTRUCTIONS
<s>
markup is used containing an id to the stringEditing the sheet1.xml file is the safest way to check the issue, due to magical operations softwares like Excel, LibreOffice Calc or others perform when opening files that may hide the issue.
ADDITIONAL INFORMATION