Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(csv_export): use custom CSV_EXPORT parameters in pd.read_csv for pivot table #30961

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

frlm
Copy link

@frlm frlm commented Nov 18, 2024

Title: fix(csv_export): use custom CSV_EXPORT parameters in pd.read_csv

Bug description

Function: apply_post_process

The issue is that pd.read_csv uses the default values of pandas instead of the parameters defined in CSV_EXPORT in superset_config. This problem is rarely noticeable when using the separator , and the decimal .. However, with the configuration CSV_EXPORT='{"encoding": "utf-8", "sep": ";", "decimal": ","}', the issue becomes evident. This change ensures that pd.read_csv uses the parameters defined in CSV_EXPORT.

Steps to reproduce error:

  • Configure CSV_EXPORT with the following parameters:
    CSV_EXPORT = {
        "encoding": "utf-8",
        "sep": ";",
        "decimal": ","
    }
  • Open a default chart in Superset of the Pivot Table type. In this example, we are using Pivot Table v2 within the USA Births Names dashboard:

image

  • Click on Download > Export to Pivoted .CSV
    image

  • Download is blocked by an error.

Cause: The error is generated by an anomaly in the input DataFrame df, which has the following format (a single column with all distinct fields separated by a semicolon separator):

,state;name;sum__num
0,other;Michael;1047996
1,other;Christopher;803607
2,other;James;749686

Fix: Added a bug fix to read data with right CSV_EXPORT settings

Code Changes:

        elif query["result_format"] == ChartDataResultFormat.CSV:
            df = pd.read_csv(StringIO(data), 
                             delimiter=superset_config.CSV_EXPORT.get('sep'),
                             encoding=superset_config.CSV_EXPORT.get('encoding'),
                             decimal=superset_config.CSV_EXPORT.get('decimal'))

Complete Code

def apply_post_process(
    result: dict[Any, Any],
    form_data: Optional[dict[str, Any]] = None,
    datasource: Optional[Union["BaseDatasource", "Query"]] = None,
) -> dict[Any, Any]:
    form_data = form_data or {}

    viz_type = form_data.get("viz_type")
    if viz_type not in post_processors:
        return result

    post_processor = post_processors[viz_type]

    for query in result["queries"]:
        if query["result_format"] not in (rf.value for rf in ChartDataResultFormat):
            raise Exception(  # pylint: disable=broad-exception-raised
                f"Result format {query['result_format']} not supported"
            )

        data = query["data"]

        if isinstance(data, str):
            data = data.strip()

        if not data:
            # do not try to process empty data
            continue

        if query["result_format"] == ChartDataResultFormat.JSON:
            df = pd.DataFrame.from_dict(data)
        elif query["result_format"] == ChartDataResultFormat.CSV:
            df = pd.read_csv(StringIO(data), 
                             delimiter=superset_config.CSV_EXPORT.get('sep'),
                             encoding=superset_config.CSV_EXPORT.get('encoding'),
                             decimal=superset_config.CSV_EXPORT.get('decimal'))
            
        # convert all columns to verbose (label) name
        if datasource:
            df.rename(columns=datasource.data["verbose_map"], inplace=True)

        processed_df = post_processor(df, form_data, datasource)

        query["colnames"] = list(processed_df.columns)
        query["indexnames"] = list(processed_df.index)
        query["coltypes"] = extract_dataframe_dtypes(processed_df, datasource)
        query["rowcount"] = len(processed_df.index)

        # Flatten hierarchical columns/index since they are represented as
        # `Tuple[str]`. Otherwise encoding to JSON later will fail because
        # maps cannot have tuples as their keys in JSON.
        processed_df.columns = [
            " ".join(str(name) for name in column).strip()
            if isinstance(column, tuple)
            else column
            for column in processed_df.columns
        ]
        processed_df.index = [
            " ".join(str(name) for name in index).strip()
            if isinstance(index, tuple)
            else index
            for index in processed_df.index
        ]

        if query["result_format"] == ChartDataResultFormat.JSON:
            query["data"] = processed_df.to_dict()
        elif query["result_format"] == ChartDataResultFormat.CSV:
            buf = StringIO()
            processed_df.to_csv(buf)
            buf.seek(0)
            query["data"] = buf.getvalue()

    return result

@dosubot dosubot bot added the data:csv Related to import/export of CSVs label Nov 18, 2024
@frlm frlm mentioned this pull request Nov 18, 2024
3 tasks
@frlm frlm changed the title Pivoted CSV export fix when CSV_EXPORT values are not default ones Pivoted CSV export fix Nov 19, 2024
@frlm
Copy link
Author

frlm commented Nov 19, 2024

UPDATE: Export Pivot Tables into CSV Format

In the last commit, I made a small change to export pivot tables without flattening multi-index rows/columns. This feature, initially implemented by the community for managing JSON export, is particularly unhelpful when dealing with pivot tables exported into CSV format with a large number of columns corresponding to a multi-index.

To better explain the issue, I have attached an example. If I create a pivot table with many columns associated with multi-index rows, the flattening process transforms them into a single column with a field value obtained by concatenating all fields separated by a space. This approach is not very effective if we want to use these exports in Excel or other tools.

Example:

I created this pivot table using your example dataset:

pivot_table

The AS-IS behavior generates this type of file, where the MultiIndex columns get collapsed into one row and MultiIndex Rows get merged into a single column, obtained by concatenating all fields separated by a space. This export is not very easy to use in Excel for the next steps.

image

export_as_is_20241119_101212.csv

The TO-BE behavior generates this type of file, where the MultiIndex rows/columns get preserved:

image

export_to_be_20241119_100853.csv

Code Complete

def apply_post_process(
    result: dict[Any, Any],
    form_data: Optional[dict[str, Any]] = None,
    datasource: Optional[Union["BaseDatasource", "Query"]] = None,
) -> dict[Any, Any]:
    form_data = form_data or {}

    viz_type = form_data.get("viz_type")
    if viz_type not in post_processors:
        return result

    post_processor = post_processors[viz_type]

    for query in result["queries"]:
        if query["result_format"] not in (rf.value for rf in ChartDataResultFormat):
            raise Exception(  # pylint: disable=broad-exception-raised
                f"Result format {query['result_format']} not supported"
            )

        data = query["data"]

        if isinstance(data, str):
            data = data.strip()

        if not data:
            # do not try to process empty data
            continue

        if query["result_format"] == ChartDataResultFormat.JSON:
            df = pd.DataFrame.from_dict(data)
        elif query["result_format"] == ChartDataResultFormat.CSV:
            df = pd.read_csv(StringIO(data), 
                             sep=csv_export_settings.get('sep', ','), 
                             encoding=csv_export_settings.get('encoding', 'utf-8'), 
                             decimal=csv_export_settings.get('decimal', '.'))
            
        # convert all columns to verbose (label) name
        if datasource:
            df.rename(columns=datasource.data["verbose_map"], inplace=True)

        processed_df = post_processor(df, form_data, datasource)

        query["colnames"] = list(processed_df.columns)
        query["indexnames"] = list(processed_df.index)
        query["coltypes"] = extract_dataframe_dtypes(processed_df, datasource)
        query["rowcount"] = len(processed_df.index)
                
        if query["result_format"] == ChartDataResultFormat.JSON:
            # Flatten hierarchical columns/index since they are represented as
            # `Tuple[str]`. Otherwise encoding to JSON later will fail because
            # maps cannot have tuples as their keys in JSON.
            processed_df.columns = [
                " ".join(str(name) for name in column).strip()
                if isinstance(column, tuple)
                else column
                for column in processed_df.columns
            ]
            processed_df.index = [
                " ".join(str(name) for name in index).strip()
                if isinstance(index, tuple)
                else index
                for index in processed_df.index
            ]
            query["data"] = processed_df.to_dict()
            
        elif query["result_format"] == ChartDataResultFormat.CSV:
            buf = StringIO()
            processed_df.to_csv(buf, 
                                sep=csv_export_settings.get('sep', ','), 
                                encoding=csv_export_settings.get('encoding', 'utf-8'), 
                                decimal=csv_export_settings.get('decimal', '.'))
            buf.seek(0)
            query["data"] = buf.getvalue()

    return result

@sadpandajoe sadpandajoe changed the title Pivoted CSV export fix fix(csv_export): use custom CSV_EXPORT parameters in pd.read_csv for pivot table Nov 19, 2024
@frlm
Copy link
Author

frlm commented Nov 21, 2024

I don't understand what I have to do, can you check if there are issues?

Copy link

codecov bot commented Nov 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.81%. Comparing base (76d897e) to head (8640f72).
Report is 1065 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #30961       +/-   ##
===========================================
+ Coverage   60.48%   83.81%   +23.33%     
===========================================
  Files        1931      536     -1395     
  Lines       76236    38930    -37306     
  Branches     8568        0     -8568     
===========================================
- Hits        46114    32631    -13483     
+ Misses      28017     6299    -21718     
+ Partials     2105        0     -2105     
Flag Coverage Δ
hive 48.81% <33.33%> (-0.36%) ⬇️
javascript ?
mysql 76.66% <33.33%> (?)
postgres 76.71% <33.33%> (?)
presto 53.29% <33.33%> (-0.51%) ⬇️
python 83.81% <100.00%> (+20.33%) ⬆️
sqlite 76.18% <33.33%> (?)
unit 60.85% <100.00%> (+3.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rusackas
Copy link
Member

I was trying to help with fixing (not fishing, as I mis-typed) some of the lining issues... hope that helps.

Copy link
Contributor

@rusackas No ephemeral environment action detected. Please use '/testenv up' or '/testenv down'. View workflow run.

@frlm
Copy link
Author

frlm commented Dec 3, 2024

Hi @kgabryje, can you review my pull request? Pre commit checks give me an failure (attached failure details) but my changes work. I cannot understand how to fix this issue...

mypy.....................................................................Failed

  • hook id: mypy
  • exit code: 1

superset/charts/post_processing.py:29: error: Name "app" already defined (by an import) [no-redef]
superset/charts/post_processing.py:48: error: Module has no attribute "config" [attr-defined]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:csv Related to import/export of CSVs size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants