Normalization: convert all encodings to unicode #120

seasidesparrow · 2024-08-12T17:29:04Z

Is your feature request related to a problem? Please describe.
Publishers will often send data that has been encoded with alternate character sets (e.g. latin-1, windows-125X). We want to normalize these data before we start processing. Some of our existing legacy code has issues with alternate encodings, and so we want to catch and replace these data with unicode equivalents whenever possible.

Describe the solution you'd like
We need a pre-parsing operation at some point between reading the file and parsing the contents that checks for the encoding, and if possible, automatically converts the data to unicode. One possible method of doing this is BeautifulSoup's bs4.UnicodeDammit module.

Additional context
We are encountering this issue when parsing reference data originating from ADSImportPipeline/ADSManualParser, and it is resulting in unmatched references solely because of publisher encoding problems.

seasidesparrow added the enhancement New feature or request label Aug 12, 2024

seasidesparrow self-assigned this Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization: convert all encodings to unicode #120

Normalization: convert all encodings to unicode #120

seasidesparrow commented Aug 12, 2024

Normalization: convert all encodings to unicode #120

Normalization: convert all encodings to unicode #120

Comments

seasidesparrow commented Aug 12, 2024