Skip to content

Fix PEP 8 naming violations: add is_thai_char/is_thai/count_thai; create pythainlp.chunk with CRFChunkParser; enforce private API boundaries#1339

Open
Copilot wants to merge 9 commits intodevfrom
copilot/refactor-inconsistent-naming-conventions
Open

Fix PEP 8 naming violations: add is_thai_char/is_thai/count_thai; create pythainlp.chunk with CRFChunkParser; enforce private API boundaries#1339
Copilot wants to merge 9 commits intodevfrom
copilot/refactor-inconsistent-naming-conventions

Conversation

Copy link
Contributor

Copilot AI commented Mar 15, 2026

Public API had several snake_case violations, inconsistent class naming, inappropriate name mangling, internal symbols leaking into the public namespace, and chunk-parsing logic housed in the wrong module.

What do these changes do

  • Adds is_thai_char(), is_thai(), count_thai() as properly named replacements for isthaichar(), isthai(), countthai()
  • Creates a new pythainlp.chunk module (following the NLTK nltk.chunk convention) containing:
    • CRFChunkParser — canonical CRF-based chunk parser class (moved from pythainlp.tag.crfchunk)
    • chunk_parse() — convenience function (moved from pythainlp.tag)
  • Makes internal-only helpers explicitly private (adds _ prefix):
    • extract_features_extract_features in tag/crfchunk.py and tokenize/crfcut.py
    • ThaiNameTagger.__extract_features_extract_features in tag/thainer.py (removes unnecessary name mangling)
    • THAI_CHAR_NAMES_THAI_CHAR_NAMES in util/thai.py, additionally wrapped in types.MappingProxyType to make it read-only at runtime
  • Removes cross-module imports of private symbols:
    • tokenize/nlpo3.py no longer imports _THAI_WORDS_FILENAME from corpus/common
    • Tests now use the public corpus_path() function instead of _CORPUS_PATH
  • Updates pythainlp.tools.warn_deprecation: message format made generic (applies to both functions and classes); parameters renamed from deprecated_func/replacing_func to deprecated_symbol/replacing_symbol
  • Exports new names from pythainlp.util
  • Updates all internal call sites, docstrings, and code examples

What was wrong

isthaichar, isthai, and countthai violated PEP 8 snake_case and NLTK naming conventions (is_* for boolean checks). Chunk-parsing logic lived in pythainlp.tag even though it is a chunker, not a tagger — the correct home is pythainlp.chunk, mirroring NLTK's nltk.chunk module.

Additionally, several internal-only symbols lacked a _ prefix, making them appear public. ThaiNameTagger.__extract_features used Python name mangling (double underscore) unnecessarily, making the method harder to discover and debug. _THAI_CHAR_NAMES was a plain mutable dict, allowing accidental mutation of a constant. Private constants (_THAI_WORDS_FILENAME, _CORPUS_PATH) were being imported directly by other modules and tests, creating hidden coupling to internal implementation details. warn_deprecation's message said "function" even when deprecating classes.

How this fixes it

New Pythonic names are the canonical implementations. Old names become deprecated wrappers emitting DeprecationWarning (scheduled for removal in 6.0):

# Before (still works, now deprecated)
from pythainlp.util import isthaichar, isthai, countthai
isthaichar("ก")   # DeprecationWarning → use is_thai_char
isthai("ไทย")     # DeprecationWarning → use is_thai
countthai("ไทย")  # DeprecationWarning → use count_thai

# After
from pythainlp.util import is_thai_char, is_thai, count_thai
is_thai_char("ก")   # True
is_thai("ไทย")      # True
count_thai("ไทย")   # 100.0

pythainlp.chunk is the new canonical location for chunk parsing:

# Before (still works, now deprecated)
from pythainlp.tag import chunk_parse
from pythainlp.tag.crfchunk import CRFchunk

# After
from pythainlp.chunk import chunk_parse, CRFChunkParser
result = chunk_parse(tokens_pos)

with CRFChunkParser() as parser:
    result = parser.parse(tokens_pos)

CRFchunk (lowercase c) is kept as a deprecated subclass of CRFChunkParser that emits the same pattern of warning on instantiation. CRFChunk was introduced in this same PR so it is renamed directly to CRFChunkParser without a separate deprecation path. pythainlp.tag.chunk_parse is kept as a deprecated wrapper around pythainlp.chunk.chunk_parse.

_THAI_CHAR_NAMES is wrapped in types.MappingProxyType — a stdlib read-only proxy with negligible overhead (~2%) and no new dependencies — so accidental mutation raises TypeError at runtime.

Internal helpers are renamed with _ prefix so they are clearly not part of the public API and can be changed freely in future releases. Cross-module private imports are replaced with either local constants or public accessor functions, so the internal implementation of each module is no longer exposed to its callers.

All five static type checkers pass with zero errors on changed files: ruff, mypy, pyright, flake8, and pyrefly.

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test
Original prompt

Scan the code and looking for inconsistent property names, class names, function/method signatures, including ordering of parameters.

Make these consistent, Pythonic, following best practices, and NLTK conventions, allowing developers to predict the pattern. Public API must be in English.

Provide smooth deprecation path and documentation when possible.
Breaking changes are allowed.

The goal is to have a coherent and cohesive internal code and public API at the same time, reducing cognitive complexity and reduce mistakes from misunderstanding.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits March 15, 2026 23:32
…Fchunk to CRFChunk

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI changed the title [WIP] Refactor inconsistent property names and method signatures Fix PEP 8 naming violations: add is_thai_char, is_thai, count_thai; rename CRFchunkCRFChunk Mar 15, 2026
Copilot AI requested a review from bact March 15, 2026 23:34
Copilot finished work on behalf of bact March 15, 2026 23:34
@bact
Copy link
Member

bact commented Mar 16, 2026

@copilot also check the uses of name mangling, private members. See if all them are appropriate.

The goal is to have consistent stable public API and try not to expose anything internal or unnecessary (that would make it difficult to change the API in the future)

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 16, 2026

@copilot also check the uses of name mangling, private members. See if all them are appropriate.

The goal is to have consistent stable public API and try not to expose anything internal or unnecessar...

Done in commit 1f99736. Here's what was fixed:

Name mangling removed:

  • ThaiNameTagger.__extract_features_extract_features in tag/thainer.py (double-underscore name mangling was unnecessary and made debugging harder)

Internal helpers made private (added _ prefix):

  • extract_features_extract_features in tag/crfchunk.py (only used by CRFChunk.parse)
  • extract_features_extract_features in tokenize/crfcut.py (only used internally)
  • THAI_CHAR_NAMES_THAI_CHAR_NAMES in util/thai.py (internal lookup dict, not part of public API)

Cross-module private imports removed:

  • tokenize/nlpo3.py was importing _THAI_WORDS_FILENAME from corpus/common — now uses a local constant
  • tests/core/test_util.py was importing _CORPUS_PATH and _THAI_WORDS_FILENAME — now uses the public corpus_path() function and the literal filename "words_th.txt"

Copilot AI changed the title Fix PEP 8 naming violations: add is_thai_char, is_thai, count_thai; rename CRFchunkCRFChunk Fix PEP 8 naming violations: add is_thai_char, is_thai, count_thai; rename CRFchunkCRFChunk; enforce private API boundaries Mar 16, 2026
Copilot finished work on behalf of bact March 16, 2026 19:06
@coveralls
Copy link

coveralls commented Mar 16, 2026

Coverage Status

coverage: 66.45% (+0.1%) from 66.312%
when pulling 4d863d7 on copilot/refactor-inconsistent-naming-conventions
into fcb9210 on dev.

@bact bact added the refactoring a technical improvement which does not add any new features or change existing features. label Mar 18, 2026
@bact
Copy link
Member

bact commented Mar 18, 2026

@wannaphong Do we like to have the THAI_CHAR_NAMES as public ?

@wannaphong
Copy link
Member

@wannaphong Do we like to have the THAI_CHAR_NAMES as public ?

No, I think I forget to add.

Updated changelog for version 5.3.2, including security improvements and deprecated functions.
Copy link
Member

@bact bact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Deprecate pythainlp.tag.crfchunk.CRFchunk and use pythainlp.chunk.CRFChunkParser instead. Apply suggestions in comments.

…ers pass

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 19, 2026

@copilot Deprecate pythainlp.tag.crfchunk.CRFchunk and use pythainlp.chunk.CRFChunkParser instead. Apply suggestions in comments.

Done in commit 21b214b. Here's what was implemented:

New pythainlp.chunk module (following NLTK nltk.chunk convention):

  • pythainlp/chunk/crfchunk.py — canonical CRFChunkParser class (implementation moved from CRFChunk)
  • pythainlp/chunk/__init__.py — exports CRFChunkParser and chunk_parse()

Deprecated wrappers (removed in 6.0):

  • pythainlp.tag.crfchunk.CRFchunkpythainlp.chunk.CRFChunkParser (since 5.3.2)
  • pythainlp.tag.chunk_parse()pythainlp.chunk.chunk_parse() (since 5.3.2)

Note: CRFChunk was only introduced in this PR, so no deprecation path is needed for it — it was renamed directly to CRFChunkParser in pythainlp.chunk.

Deprecation consistency fixes:

  • warn_deprecation params renamed: deprecated_funcdeprecated_symbol, replacing_funcreplacing_symbol; message made generic (no longer says "function")
  • All new deprecations use version "5.3.2" (not the incorrect "5.1")

_THAI_CHAR_NAMES now frozen: wrapped in types.MappingProxyType — read-only at runtime, ~2% overhead, no new dependencies. frozendict (third-party, hashable) was considered but rejected since the dict is never used as a key.

All 5 type checkers pass: ruff ✓ mypy ✓ pyright ✓ flake8 ✓ pyrefly ✓. 85 tests pass.

Copilot AI changed the title Fix PEP 8 naming violations: add is_thai_char, is_thai, count_thai; rename CRFchunkCRFChunk; enforce private API boundaries Fix PEP 8 naming violations: add is_thai_char/is_thai/count_thai; create pythainlp.chunk with CRFChunkParser; enforce private API boundaries Mar 19, 2026
Copilot finished work on behalf of bact March 19, 2026 10:24
Copilot AI requested a review from bact March 19, 2026 10:24
@sonarqubecloud
Copy link

@bact bact marked this pull request as ready for review March 19, 2026 10:31
@bact bact modified the milestone: 5.3.2 Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactoring a technical improvement which does not add any new features or change existing features.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants