Skip to content

add encoding support for CSV files #767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MonRani
Copy link

@MonRani MonRani commented Jun 27, 2025


name: Pull Request
about: Create a pull request to contribute to the project
title: 'Fix CSV encoding error for ISO-8859-1/Latin-1 encoded files'
labels: 'bug-fix, csv, encoding'

Related Issue
Fixes #649

Description of Changes
This PR adds encoding support to CSV file loading to handle ISO-8859-1 (Latin-1) encoded files that were previously failing with "Invalid Input Error: CSV Error". The changes include:

  • Added optional encoding parameter to CSVConfig with default "utf-8" for backward compatibility
  • Updated CSVSource to pass encoding parameter to DuckDB's read_csv_auto function
  • Modified configuration parsing to support encoding from TOML files
  • Enhanced documentation with encoding parameter examples and usage instructions
  • Added comprehensive test suite to verify encoding support

The fix maintains full backward compatibility while enabling support for legacy encoded files that contain special characters.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • New example
  • Test improvement

Testing

  • Created test suite (simple_encoding_test.py) that verifies:
    • Default UTF-8 encoding works with regular files
    • UTF-8 encoding fails appropriately with Latin-1 files
    • Latin-1 encoding successfully loads Latin-1 files with special characters
  • Verified backward compatibility with existing UTF-8 encoded files
  • Tested with files containing special characters (José, François, Müller, etc.)

Usage Example
To load an ISO-8859-1 encoded CSV file, add to your preswald.toml:

[data.mobile_dataset]
type = "csv"
path = "data/mobile_dataset.csv"
encoding = "latin-1"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Encoding error when read csv
1 participant