-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(csv_export): use custom CSV_EXPORT parameters in pd.read_csv for pivot table #30961
base: master
Are you sure you want to change the base?
Conversation
…tiindex rows and columns
UPDATE: Export Pivot Tables into CSV FormatIn the last commit, I made a small change to export pivot tables without flattening multi-index rows/columns. This feature, initially implemented by the community for managing JSON export, is particularly unhelpful when dealing with pivot tables exported into CSV format with a large number of columns corresponding to a multi-index. To better explain the issue, I have attached an example. If I create a pivot table with many columns associated with multi-index rows, the flattening process transforms them into a single column with a field value obtained by concatenating all fields separated by a space. This approach is not very effective if we want to use these exports in Excel or other tools. Example:I created this pivot table using your example dataset: The AS-IS behavior generates this type of file, where the MultiIndex columns get collapsed into one row and MultiIndex Rows get merged into a single column, obtained by concatenating all fields separated by a space. This export is not very easy to use in Excel for the next steps. export_as_is_20241119_101212.csv The TO-BE behavior generates this type of file, where the MultiIndex rows/columns get preserved: export_to_be_20241119_100853.csv Code Completedef apply_post_process(
result: dict[Any, Any],
form_data: Optional[dict[str, Any]] = None,
datasource: Optional[Union["BaseDatasource", "Query"]] = None,
) -> dict[Any, Any]:
form_data = form_data or {}
viz_type = form_data.get("viz_type")
if viz_type not in post_processors:
return result
post_processor = post_processors[viz_type]
for query in result["queries"]:
if query["result_format"] not in (rf.value for rf in ChartDataResultFormat):
raise Exception( # pylint: disable=broad-exception-raised
f"Result format {query['result_format']} not supported"
)
data = query["data"]
if isinstance(data, str):
data = data.strip()
if not data:
# do not try to process empty data
continue
if query["result_format"] == ChartDataResultFormat.JSON:
df = pd.DataFrame.from_dict(data)
elif query["result_format"] == ChartDataResultFormat.CSV:
df = pd.read_csv(StringIO(data),
sep=csv_export_settings.get('sep', ','),
encoding=csv_export_settings.get('encoding', 'utf-8'),
decimal=csv_export_settings.get('decimal', '.'))
# convert all columns to verbose (label) name
if datasource:
df.rename(columns=datasource.data["verbose_map"], inplace=True)
processed_df = post_processor(df, form_data, datasource)
query["colnames"] = list(processed_df.columns)
query["indexnames"] = list(processed_df.index)
query["coltypes"] = extract_dataframe_dtypes(processed_df, datasource)
query["rowcount"] = len(processed_df.index)
if query["result_format"] == ChartDataResultFormat.JSON:
# Flatten hierarchical columns/index since they are represented as
# `Tuple[str]`. Otherwise encoding to JSON later will fail because
# maps cannot have tuples as their keys in JSON.
processed_df.columns = [
" ".join(str(name) for name in column).strip()
if isinstance(column, tuple)
else column
for column in processed_df.columns
]
processed_df.index = [
" ".join(str(name) for name in index).strip()
if isinstance(index, tuple)
else index
for index in processed_df.index
]
query["data"] = processed_df.to_dict()
elif query["result_format"] == ChartDataResultFormat.CSV:
buf = StringIO()
processed_df.to_csv(buf,
sep=csv_export_settings.get('sep', ','),
encoding=csv_export_settings.get('encoding', 'utf-8'),
decimal=csv_export_settings.get('decimal', '.'))
buf.seek(0)
query["data"] = buf.getvalue()
return result |
I don't understand what I have to do, can you check if there are issues? |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #30961 +/- ##
===========================================
+ Coverage 60.48% 83.81% +23.33%
===========================================
Files 1931 536 -1395
Lines 76236 38930 -37306
Branches 8568 0 -8568
===========================================
- Hits 46114 32631 -13483
+ Misses 28017 6299 -21718
+ Partials 2105 0 -2105
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I was trying to help with fixing (not fishing, as I mis-typed) some of the lining issues... hope that helps. |
@rusackas No ephemeral environment action detected. Please use '/testenv up' or '/testenv down'. View workflow run. |
Hi @kgabryje, can you review my pull request? Pre commit checks give me an failure (attached failure details) but my changes work. I cannot understand how to fix this issue... mypy.....................................................................Failed
superset/charts/post_processing.py:29: error: Name "app" already defined (by an import) [no-redef] |
Title: fix(csv_export): use custom CSV_EXPORT parameters in pd.read_csv
Bug description
Function: apply_post_process
The issue is that
pd.read_csv
uses the default values of pandas instead of the parameters defined inCSV_EXPORT
insuperset_config
. This problem is rarely noticeable when using the separator,
and the decimal.
. However, with the configurationCSV_EXPORT='{"encoding": "utf-8", "sep": ";", "decimal": ","}'
, the issue becomes evident. This change ensures thatpd.read_csv
uses the parameters defined inCSV_EXPORT
.Steps to reproduce error:
CSV_EXPORT
with the following parameters:Click on Download > Export to Pivoted .CSV
Download is blocked by an error.
Cause: The error is generated by an anomaly in the input DataFrame df, which has the following format (a single column with all distinct fields separated by a semicolon separator):
Fix: Added a bug fix to read data with right CSV_EXPORT settings
Code Changes:
Complete Code