Releases: jawah/charset_normalizer
Releases · jawah/charset_normalizer
Version 2.1.1
2.1.1 (2022-08-19)
Deprecated
- Function
normalize
scheduled for removal in 3.0
Changed
- Removed useless call to decode in fn is_unprintable (#206)
Fixed
- Third-party library (i18n xgettext) crashing not recognizing utf_8 (PEP 263) with underscore from @aleksandernovikov (#204)
Version 3.0.0b1
3.0.0b1 (2022-08-15)
Changed
- Optional: Module
md.py
can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1
Removed
- Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches
- Breaking: Top-level function
normalize
- Breaking: Properties
chaos_secondary_pass
,coherence_non_latin
andw_counter
from CharsetMatch - Support for the backport
unicodedata2
Version 2.1.0
2.1.0 (2022-06-19)
Added
- Output the Unicode table version when running the CLI with
--version
(PR #194)
Changed
- Re-use decoded buffer for single byte character sets from @nijel (PR #175)
- Fixing some performance bottlenecks from @deedy5 (PR #183)
Fixed
- Workaround potential bug in cpython with Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space (PR #175)
- CLI default threshold aligned with the API threshold from @oleksandr-kuzmenko (PR #181)
Removed
- Support for Python 3.5 (PR #192)
Deprecated
- Use of backport unicodedata from
unicodedata2
as Python is quickly catching up, scheduled for removal in 3.0 (PR #194)
Version 2.0.12
Version 2.0.11
Version 2.0.10
Version 2.0.9
Version 2.0.8
Changed
- Improvement over Vietnamese detection (PR #126)
- MD improvement on trailing data and long foreign (non-pure latin) data (PR #124)
- Efficiency improvements in cd/alphabet_languages from @adbar (PR #122)
- call sum() without an intermediary list following PEP 289 recommendations from @adbar (PR #129)
- Code style as refactored by Sourcery-AI (PR #131)
- Minor adjustment on the MD around european words (PR #133)
- Remove and replace SRTs from assets / tests (PR #139)
- Initialize the library logger with a
NullHandler
by default from @nmaynes (PR #135) - Setting kwarg
explain
to True will add provisionally (bounded to function lifespan) a specific stream handler (PR #135)
Fixed
- Fix large (misleading) sequence giving UnicodeDecodeError (PR #137)
- Avoid using too insignificant chunk (PR #137)
Added
- Add and expose function
set_logging_handler
to configure a specific StreamHandler from @nmaynes (PR #135) - Add
CHANGELOG.md
entries, format is based on Keep a Changelog (PR #141)
Version 2.0.7
We arrived in a pretty stable state.
Changes:
- Addition: 🍱 Add support for Kazakh (Cyrillic) language detection #109
- Improvement: ❇️ Further improve inferring the language from a given code page (single-byte) #112
- Removed: 🔥 Remove redundant logging entry about detected language(s) #115
- Miscellaneous: 🔧 Trying to leverage PEP263 when PEP3120 is not supported #116
- While I do not think that this (116) will actually fix something, it will rather raise a
SyntaxError
(Not about ASCII decoding error) for those trying to install this package using a non-supported Python version
- While I do not think that this (116) will actually fix something, it will rather raise a
- Improvement: ⚡ Refactoring for potential performance improvements in loops #113 @adbar
- Improvement: ✨ Various detection improvement (MD+CD) #117
- Bugfix: 🐛 Fix a minor inconsistency between Python 3.5 and other versions regarding language detection #117 #102
This version pushes forward the detection-coverage to 98%! https://github.com/Ousret/charset_normalizer/runs/3863881150
The great filter (cannot be better than) shall be 99% in conjunction with the current dataset. In future releases.