fix: set columns numeric datatypes when exporting to excel #27229

squalou · 2024-02-23T13:01:27Z

SUMMARY

When exporting to Excel, currently datatypes are not specified, leading to decimal numbers being stored as strings, which may lead to issues when opening the file.

Depending on the locale, the issue may not even be visible, Excel managing to convert things. (that's the case in us/en locale). When using another locale (say fr), then numbers, output with '.' as decimal separator and stored as strings won't be usable as numbers in Excel.

The idea here is to use the same "datatype guessing" method already existing, and use it to convert dataframe columns types when required before exporting.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

not applicable

TESTING INSTRUCTIONS

Create a Table dashboard with numercical and decimal numbers in it. Ideally add some strings and dates.
Export to Excel
you can then unzip the .xlsx file, and open sheet1.xml in a text editor to check the export
Numbers appear directly in xml cell reference
Strings are not visible directly, instead a <s> markup is used containing an id to the string

Editing the sheet1.xml file is the safest way to check the issue, due to magical operations softwares like Excel, LibreOffice Calc or others perform when opening files that may hide the issue.

ADDITIONAL INFORMATION

Has associated issue: Fixes Datatype not correctly exported to xlsx - with patch suggestion #23375
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

rusackas · 2024-02-23T19:53:15Z

Approving CI run🤞

codecov · 2024-02-23T19:58:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.70%. Comparing base (76d897e) to head (9369edd).
Report is 635 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master   #27229       +/-   ##
===========================================
+ Coverage   60.48%   71.70%   +11.21%     
===========================================
  Files        1931     1928        -3     
  Lines       76236    81499     +5263     
  Branches     8568     8324      -244     
===========================================
+ Hits        46114    58439    +12325     
+ Misses      28017    21007     -7010     
+ Partials     2105     2053       -52

Flag	Coverage Δ
hive	`49.63% <25.00%> (+0.46%)`	⬆️
mysql	`76.76% <81.25%> (?)`
postgres	`77.81% <81.25%> (?)`
presto	`54.28% <81.25%> (+0.47%)`	⬆️
python	`84.77% <100.00%> (+21.28%)`	⬆️
sqlite	`77.29% <81.25%> (?)`
unit	`61.14% <81.25%> (+3.51%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rusackas · 2024-02-23T20:07:14Z

Looks like you need to run the pre-commit hook for formatting/linting.

michael-s-molina · 2024-02-23T20:10:42Z

Thank you for the fix @squalou. Could you add a test for it to avoid future regressions?

squalou · 2024-02-24T09:40:37Z

Looks like you need to run the pre-commit hook for formatting/linting.

yeah sorry about that. Should be fixed now, my IDE for this project is not on "auto format", totally forgot.

Thank you for the fix @squalou. Could you add a test for it to avoid future regressions?

I'd love to ! really. But I admit I' a bit lost in superset architecture and test framework, I'm not a python pro dev to begin with which does not help.

I've tried to guess where 'export csv' might be tested and enhance from there but did not find anything. Got myself lost in viz_tests.py and didn't go any further.

Any help / pointer to where to begin with would be appreciated.

michael-s-molina · 2024-02-26T13:34:23Z

Any help / pointer to where to begin with would be appreciated.

You can create a test in tests/unit_tests/common similarly to test_get_aggregated_join_column.py

squalou · 2024-02-27T19:04:16Z

Any help / pointer to where to begin with would be appreciated.

You can create a test in tests/unit_tests/common similarly to test_get_aggregated_join_column.py

Thank you ! will have a look asap (on vacation break this weak :D )

squalou · 2024-03-05T17:52:46Z

[EDIT] simply writing things down here helped me realize I should break the tested function in two, and only test for the changes really introduced.

I submitted "something".

Now ... I do not test in these tests that the excel export really works (it does, see further), and that's still a shame, maybe ?

Leaving below for "excel export" testability issues I encountered.

(original post below line)

Hi,

I manage to run tests locally and I'm trying to add one... but I have (at last) two issues.

You'll see from these that ... I'm full of good will but a desperate python newbie :)

First : BaseDatasource

Following test example I initiate this

query_context_processor = QueryContextProcessor(
    QueryContext(
        datasource=BaseDatasource(),
        queries=[],
        result_type=ChartDataResultType.COLUMNS,
        form_data={},
        slice_=None,
        result_format=ChartDataResultFormat.XLSX,
        cache_values={},
    )
)

But then, in context_query_processor.py, method get_data(), theres' a call

verbose_map = self._qc_datasource.data.get("verbose_map", {})

that ends up calling "uid", and finally ... raise a lovely NotImplementedError. Seing the code - unless there's an object overriding teh type method somewhere - I have no idea what to do to avoid this.

    @property
    def type(self) -> str:
        raise NotImplementedError()

    @property
    def uid(self) -> str:
        """Unique id across datasource types"""
        return f"{self.id}__{self.type}"

superset/common/query_context_processor.py:569: in get_data
   verbose_map = self._qc_datasource.data.get("verbose_map", {})
superset/connectors/sqla/models.py:376: in data
   "uid": self.uid,
superset/connectors/sqla/models.py:266: in uid
   return f"{self.id}__{self.type}"
   
self = <superset.connectors.sqla.models.BaseDatasource object at 0x74ce68123fa0>

   @property
   def type(self) -> str:
>       raise NotImplementedError()
E       NotImplementedError

So, in order to go on a bit, I put locally a try/catch around verbose_map thing, which may be a solution eventually ?

Which leads to my even bigger issue ...

Second : how to check

the method I must test returns this

                result = excel.df_to_excel(ndf, **config["EXCEL_EXPORT"])
            return result or ""

which is not very testable.

a log output gave me as expected a binary content that looks like this, so, yeah, the test runs.

 2024-03-05 18:46:01,425:INFO:root:b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x00\x00?\x00a]I:O\x01\x00\x00\x8f\x04\x00\x00\x13\x00\x00\x00[Content_Types].xml\xad\x94\xcbn\xc20\x10E\xf7\xfd\x8a\xc8\xdb*1tQU\x15\x81E\x1f\xcb\x16\xa9\xf4\x03\\{B,\x

Is there a way to "mock" or "override" the excel.df_to_excel call so that the test can get comparable datas ?

Or else, is it advisable to somehow refactor the method, extract a more testable one before the call to excel conversion ?

thanks for any advice !

regards

squalou · 2024-03-06T07:27:23Z

(damn, import sort was not set, fixed and re-pushed)

michael-s-molina · 2024-03-06T18:44:40Z

superset/common/query_context_processor.py

@@ -571,11 +574,22 @@ def get_data(self, df: pd.DataFrame) -> str | list[dict[str, Any]]:
                    df, index=include_index, **config["CSV_EXPORT"]
                )
            elif self._query_context.result_format == ChartDataResultFormat.XLSX:
-                result = excel.df_to_excel(df, **config["EXCEL_EXPORT"])
+                ndf = self.copy_df_for_excel_export(df, coltypes)


@squalou Should this change be actually implemented in superset.utils.excel.df_to_excel to ensure consistent behavior? You could move your tests to tests/unit_tests/utils/excel_tests. We already have one there called test_timezone_conversion which is a similar type of conversion.

Makes a lot of sense, sure.

What would be better,
add "coltypes" as optional parameter in excel.df_to_excel(...) method ?
Or move the method as-is in superset.utils.excel. package.

First option seems better but I'm not at ease with changing method signatures there without an advice, I really have no global overview of what may call what and consequences.

EDIT : from a quick check the method is called in one place and one test so ... let's go.

I checked that coltypes are extracted from the DataFrame here.

michael-s-molina · 2024-03-07T13:26:59Z

superset/utils/excel.py

    output = io.BytesIO()

    # timezones are not supported
    for column in df.select_dtypes(include=["datetimetz"]).columns:
        df[column] = df[column].astype(str)

+    ndf = copy_df_with_datatype_conversion(df, coltypes)


@squalou Can the column types be extracted from the DataFrame like we are doing for timezones

for column in df.select_dtypes(include=["datetimetz"]).columns: df[column] = df[column].astype(str)

or are they not matching what's defined in coltypes?

Thing is : you could have data with leading zeros for instance, defined as string in the original datasource, and you want them as string in the end, or else excel would omit leading zeros, maybe round things, whatever.

I don't know much about DataFrame object, and the whole issue I'm trying to solve is to get the datatype as it was defined as the column type to begin with.

If you known of a way to get this data from the DataFrame itself, it sure would be MUCH more elegant than forwarding the column_types all way down to excel export.

To be honest, I discovered panda objects only to make up this patch ... so if you think this metadata can be stored / handled differently, I'll be glad to learn.

I also did not want to take any risk of changing the original dataframe object as I have no idea of global superset architecture.

I almost forgot the most important part : I have absolutely no idea how to know what dtypes are 'stored' in the DataFrame at the beginning :) so I used the 'coltypes' that seemed to be there for a reason :)

michael-s-molina · 2024-03-07T14:41:15Z

tests/unit_tests/utils/excel_tests.py



 def test_timezone_conversion() -> None:
    """
    Test that columns with timezones are converted to a string.
    """
    df = pd.DataFrame({"dt": [datetime(2023, 1, 1, 0, 0, tzinfo=timezone.utc)]})
-    contents = df_to_excel(df)
+    contents = df_to_excel(df, [])


Suggested change

contents = df_to_excel(df, [])

contents = df_to_excel(df, [GenericDataType.TEMPORAL])

michael-s-molina · 2024-03-07T14:42:33Z

superset/common/query_context_processor.py

@@ -571,11 +574,22 @@ def get_data(self, df: pd.DataFrame) -> str | list[dict[str, Any]]:
                    df, index=include_index, **config["CSV_EXPORT"]
                )
            elif self._query_context.result_format == ChartDataResultFormat.XLSX:
-                result = excel.df_to_excel(df, **config["EXCEL_EXPORT"])
+                ndf = self.copy_df_for_excel_export(df, coltypes)


I checked that coltypes are extracted from the DataFrame here.

michael-s-molina · 2024-03-07T16:48:05Z

superset/utils/excel.py


    return output.getvalue()
+
+
+def convert_df_with_datatypes(


I believe we could make it more explicit that the DataFrame can be modified and avoid the copy overhead. To do that, we need to separate column conversion from exporting. Something like:

def df_to_excel(df: pd.DataFrame, **kwargs: Any) -> bytes: output = io.BytesIO() # pylint: disable=abstract-class-instantiated with pd.ExcelWriter(output, engine="xlsxwriter") as writer: df.to_excel(writer, **kwargs) return output.getvalue() def apply_column_types( df: pd.DataFrame, column_types: list[GenericDataType] ) -> pd.DataFrame: for column, column_type in zip(df.columns, column_types): if column_type == GenericDataType.NUMERIC: df[column] = pd.to_numeric(df[column]) elif pd.api.types.is_datetime64tz_dtype(df[column]): # timezones are not supported df[column] = df[column].astype(str) else: df[column] = df[column].astype(column_type) return df

I'm probably over-conservative but I have one remaining concern

else : df[column] = df[column].astype(column_type) ... again I'm a bit scared of the effect on excel export side, because I don't know how it works. Id' rather "leave untouched" data not to be converted, but here again you probably know better.

Addendum : a quick test regarding the "else" conversion, it fails with this error

TypeError: Cannot interpret '<GenericDataType.STRING: 1>' as a data type

pd.to_numeric may raise exception, from documentation the "ignore error" param is deprecated and it's recommended to catch exceptions.

Good point. Here we have two options:

Catch and convert to a default. We also need to alert the user that some values were modified when exporting.

Don't catch and show a good message clearly indicating the column/row that is preventing the export.

else : df[column] = df[column].astype(column_type) ... again I'm a bit scared of the effect on excel export side, because I don't know how it works. Id' rather "leave untouched" data not to be converted, but here again you probably know better.

I'm not sure what you mean by untouched data but I'm assuming that when exporting users want the data to comply with what is defined as the column type. That was your original problem for numeric types, so I'm assuming that other non-numerical columns that are not matching their defined types should also be converted.

Let me ask for @betodealmeida and @eschutho input here as they have more context on this feature and might help us.

As en end user, I much more prefer having an xlsx file with data I have to manually reformat, than having nothing at all.

Catch and convert to a default. We also need to alert the user that some values were modified when exporting.

probably same as "dates with timezone" fix : df[column].astype(str)

Alerting the user is faaaaaaaar beyond anything I would be able to do, and agani as an end user I think I would'nt care. Nothing is said about timezone conversion either and I bet nobody cares.

Don't catch and show a good message clearly indicating the column/row that is preventing the export.

As said, as an end user I would rather have a not-perfectly formatted file, but will all data in it, than no data at all.

About "untouched fields" :

I don't really remember all the details (because it dates a bit and it's not my passion tbh :) ), but at the time, I manually opened the produced xlsx files, meaning : unzip it and see the xml-ish data inside and try to understand the differences when converted data or not. From there, open in libreoffice or MS and see the behaviors, and try to find the best.
Numeric values are stored "in the cells", while others have references in the cell and values mapped somewhere else. Or maybe it's the other way'round - can't even remember which is which. Anyway there's a clear separation between numeric and 'other'. Then, inside the 'other' category of course there may be further differences.

From my understanding then, the best thing one can do to expect any office software to deal with these files is to make sure numerics are numerics, anything doesn't matter that much and autodetection of strings work.

That's probably why df[column] = df[column].astype(str) in the date conversion has no real bad effect on usability.

That's all I know, which is not much for such a long text :)

squalou · 2024-03-07T16:53:15Z

damn, I still have a minor pylint issue ("exception as e" as a bad habit, moreover e is not used :) )

I have an issue though with an error in a file I did not change

superset/commands/base.py:61:0: I0021: Useless suppression of 'too-few-public-methods' (useless-suppression)

that's weird.

I rebased just in case.

…r catch instead

squalou · 2024-03-08T07:34:47Z

damn ! my "black" local version doesn't behave exactly the same as the one on CI ! Will be fixed on next run.

…lt to str

squalou · 2024-03-11T21:13:49Z

CI Run Fail :
Python-Integration / test-mysql (3.9) (pull_request) Failing after 6m

hm, no idea how this could be related with the PR, could it be failing on master before the changes ?

=================================== FAILURES ===================================
__________________ TestPostChartDataApi.test_chart_data_async __________________

self = <tests.integration_tests.charts.data.api_tests.TestPostChartDataApi testMethod=test_chart_data_async>

    @with_feature_flags(GLOBAL_ASYNC_QUERIES=True)
    @pytest.mark.usefixtures("load_birth_names_dashboard_with_slices")
    def test_chart_data_async(self):
        self.logout()
        app._got_first_request = False
        async_query_manager_factory.init_app(app)
        self.login("admin")
        rv = self.post_assert_metric(CHART_DATA_URI, self.query_context_payload, "data")
>       self.assertEqual(rv.status_code, 202)
E       AssertionError: 200 != 202

tests/integration_tests/charts/data/api_tests.py:707: AssertionError

rusackas · 2024-03-12T00:27:18Z

hm, no idea how this could be related with the PR, could it be failing on master before the changes ?

Agreed, seems unrelated... I'll rerun it in hopes that it's just a flaky test.

squalou · 2024-03-15T16:40:34Z

I've been away a bit, I'm back. Is there anything I should do ?
Maybe try a rebase master and shoot again ?

right now I'll wait for further instructions, rather than adding mess to the situation :)

rusackas · 2024-03-15T18:15:36Z

Hah... I'm rerunning the failing test now in hopes that it's just flaky. Fingers crossed.

squalou · 2024-03-19T07:35:07Z

Hah... I'm rerunning the failing test now in hopes that it's just flaky. Fingers crossed.

damn, from what I remember previous crash was not this one. (do we have history of previous run somewhere to be sure?)

SqlLab query tabs
  
  We detected that the Chromium Renderer process just crashed.

rusackas · 2024-03-19T14:49:39Z

We detected that the Chromium Renderer process just crashed.

It seems that GitHub Actions runners are random machines from different clusters... these things are hard to nail down, but certain machines don't match, and have troubles. It makes the tests look flaky, which is really annoying. I'm re-running that 1/3 of the E2E tests now 🤞

squalou · 2024-04-02T09:14:05Z

should I try a rebase or is it useless ?

rusackas · 2024-07-09T20:49:19Z

Wow, we really fell off the wagon on this one. @squalou if you want to rebase, I'm optimistic we can get this through. CI has stabilized quite a bit lately, and it seems the conflicts might not be too crazy.

eschutho · 2024-07-11T23:41:44Z

I just fixed the conflict for you, so I think that should fix CI.

rusackas · 2024-08-22T22:24:22Z

Not sure why the mysql test is still flaky here, but I'll close/reopen to kick it again. Hopefully we can (finally!) get this across the finish line.

squalou · 2024-08-24T08:28:50Z

Wow, we really fell off the wagon on this one. @squalou if you want to rebase, I'm optimistic we can get this through. CI has stabilized quite a bit lately, and it seems the conflicts might not be too crazy.

Sorry, totally missed the notification, and thank you @eschutho for the rebase !

(and many thanks to all the others too for giving this long story a happy end :) )

Co-authored-by: Elizabeth Thompson <[email protected]> (cherry picked from commit ce72a0a)

pull-request-size bot added the size/S label Feb 23, 2024

squalou mentioned this pull request Feb 23, 2024

fix(excel export): #23375 set columns numeric datatypes on excel export #23476

Closed

9 tasks

john-bodley added the review:checkpoint Last PR reviewed during the daily review standup label Feb 23, 2024

john-bodley requested a review from betodealmeida February 23, 2024 18:00

rusackas changed the title ~~#23375 set columns numeric datatypes when exporting to excel~~ Fix(23375): set columns numeric datatypes when exporting to excel Feb 23, 2024

michael-s-molina changed the title ~~Fix(23375): set columns numeric datatypes when exporting to excel~~ fix: set columns numeric datatypes when exporting to excel Feb 23, 2024

squalou force-pushed the master branch from 24d5270 to b4b10e7 Compare February 24, 2024 09:34

rusackas removed the review:checkpoint Last PR reviewed during the daily review standup label Feb 26, 2024

pull-request-size bot added size/L and removed size/S labels Mar 5, 2024

squalou force-pushed the master branch 2 times, most recently from 0aa2191 to 2cf4443 Compare March 6, 2024 07:26

michael-s-molina reviewed Mar 6, 2024

View reviewed changes

michael-s-molina reviewed Mar 7, 2024

View reviewed changes

squalou added 4 commits March 7, 2024 17:59

fix #23375 set columns numeric datatypes when exporting to excel

623abfb

fix #23375 isolate changed code, add some unit tests

cef5a71

fix #23375 move changes to utils/excel

b6f43ff

fix #23375 typo + panda.to_numeric ignore errors is deprecated, prope…

8d38974

…r catch instead

squalou force-pushed the master branch from 267c9fe to 8d38974 Compare March 7, 2024 17:00

michael-s-molina requested review from rusackas, john-bodley and eschutho March 7, 2024 17:33

fix #23375 separate df data conversion from excel export

bf0c308

pull-request-size bot added size/M and removed size/L labels Mar 7, 2024

fix #23375 put back try/catch around panda data conversion with defau…

b641e13

…lt to str

Merge branch 'master' into master

9369edd

rusackas closed this Aug 22, 2024

rusackas reopened this Aug 22, 2024

rusackas approved these changes Aug 22, 2024

View reviewed changes

rusackas merged commit ce72a0a into apache:master Aug 23, 2024
70 of 71 checks passed

sadpandajoe added the v4.1 Label added by the release manager to track PRs to be included in the 4.1 branch label Aug 27, 2024

sadpandajoe pushed a commit that referenced this pull request Aug 27, 2024

fix: set columns numeric datatypes when exporting to excel (#27229)

5411d40

Co-authored-by: Elizabeth Thompson <[email protected]> (cherry picked from commit ce72a0a)

	contents = df_to_excel(df, [])
	contents = df_to_excel(df, [GenericDataType.TEMPORAL])

fix: set columns numeric datatypes when exporting to excel #27229

fix: set columns numeric datatypes when exporting to excel #27229

Conversation

squalou commented Feb 23, 2024

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

rusackas commented Feb 23, 2024

codecov bot commented Feb 23, 2024 • edited Loading

Codecov Report

rusackas commented Feb 23, 2024

michael-s-molina commented Feb 23, 2024

squalou commented Feb 24, 2024 • edited Loading

michael-s-molina commented Feb 26, 2024

squalou commented Feb 27, 2024

squalou commented Mar 5, 2024 • edited Loading

squalou commented Mar 6, 2024

Choose a reason for hiding this comment

squalou Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squalou Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squalou Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squalou commented Mar 7, 2024 • edited Loading

squalou commented Mar 8, 2024

squalou commented Mar 11, 2024 • edited Loading

rusackas commented Mar 12, 2024

squalou commented Mar 15, 2024

rusackas commented Mar 15, 2024

squalou commented Mar 19, 2024

rusackas commented Mar 19, 2024

squalou commented Apr 2, 2024

rusackas commented Jul 9, 2024

eschutho commented Jul 11, 2024

rusackas commented Aug 22, 2024

squalou commented Aug 24, 2024

codecov bot commented Feb 23, 2024 •

edited

Loading

squalou commented Feb 24, 2024 •

edited

Loading

squalou commented Mar 5, 2024 •

edited

Loading

squalou Mar 6, 2024 •

edited

Loading

squalou Mar 7, 2024 •

edited

Loading

squalou Mar 7, 2024 •

edited

Loading

squalou commented Mar 7, 2024 •

edited

Loading

squalou commented Mar 11, 2024 •

edited

Loading