Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization: convert all encodings to unicode #120

Open
seasidesparrow opened this issue Aug 12, 2024 · 0 comments
Open

Normalization: convert all encodings to unicode #120

seasidesparrow opened this issue Aug 12, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@seasidesparrow
Copy link
Member

Is your feature request related to a problem? Please describe.
Publishers will often send data that has been encoded with alternate character sets (e.g. latin-1, windows-125X). We want to normalize these data before we start processing. Some of our existing legacy code has issues with alternate encodings, and so we want to catch and replace these data with unicode equivalents whenever possible.

Describe the solution you'd like
We need a pre-parsing operation at some point between reading the file and parsing the contents that checks for the encoding, and if possible, automatically converts the data to unicode. One possible method of doing this is BeautifulSoup's bs4.UnicodeDammit module.

Additional context
We are encountering this issue when parsing reference data originating from ADSImportPipeline/ADSManualParser, and it is resulting in unmatched references solely because of publisher encoding problems.

@seasidesparrow seasidesparrow added the enhancement New feature or request label Aug 12, 2024
@seasidesparrow seasidesparrow self-assigned this Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant