fix: prevent path traversal (CWE-22) in corpus and data path functions#1342
fix: prevent path traversal (CWE-22) in corpus and data path functions#1342
Conversation
…pus, and get_path_folder_corpus Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds path traversal protections to path-construction helpers used for PyThaiNLP data/corpus access, and introduces security regression tests to validate rejection of traversal attempts.
Changes:
- Add base-directory containment validation to
get_full_data_path(),path_pythainlp_corpus(), andget_path_folder_corpus(). - Add/expand unit tests for traversal rejection on
get_full_data_path()andpath_pythainlp_corpus(). - Document the new
ValueErrorbehavior in docstrings and the changelog.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
pythainlp/tools/path.py |
Add traversal validation to get_full_data_path() and document new ValueError. |
pythainlp/corpus/core.py |
Add traversal validation to bundled-corpus path helper and downloaded-corpus folder path helper. |
tests/core/test_security.py |
Add regression tests covering safe inputs and traversal rejections for the updated helpers. |
CHANGELOG.md |
Document the security hardening behavior change. |
You can also share your feedback on Copilot code review. Take the survey.
|
@copilot fix all the issues as addressed in this thread #1342 (review) Also mind a possible code redundancy / opportunities for code/function reuse in relevant functions like get_pythainlp_data_path(), get_path_folder_corpus(), get_corpus_path(), get_full_data_path(), path_pythainlp_corpus(). The path transversal could be rejected at some common point that some of these functions has to rely/depend on, without the need to repeat the path transversal check in a lot of places. Also mind that typical string .endsWith("file.txt") will also match "afile.txt", aside from "dir/file.txt". When refactor, thinking about security implications as the security is our main goal here. |
…tests Co-authored-by: bact <128572+bact@users.noreply.github.com>
Done in 503f2c2. Here's what changed: Refactoring — single check point: Normalized return value: Fixed weak test assertions: New tests for |
|
@copilot Remove redundant functions, remove the attack surfaces. Consider removing get_path_folder_corpus() and path_pythainlp_corpus() functions completely from the library.
This can be considered as "non-breaking" changes, because these functions were originally intended for internal use. After changes are made, review again for security, zip-safety, and documentation/comment consistency. |
…e _safe_path_join inline Co-authored-by: bact <128572+bact@users.noreply.github.com>
…lp.tools Co-authored-by: bact <128572+bact@users.noreply.github.com>
Done in 6271e24.
|
Co-authored-by: bact <128572+bact@users.noreply.github.com>
|
@copilot Run unittest tests.noauto_onnx test suite, to confirm there is no issue with the new pythainlp/tag/wangchanberta_onnx.py. Run mypy, pyright, pyrefly, ruff, flake8, black, and any static type checkers. Fix any issue. Review code and documentation again. The goal is to completely avoid path traversal (CWE-22) and related path and security issues. |
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Done in cad6cdd. Here's the summary:
Static analysis results:
Security ( |
This release addresses security issues related to path traversal by validating that paths stay within their expected base directory.
|
|
@wannaphong please review Note that |



Path-building functions accepted external input without verifying the resolved path stayed inside its intended base directory, enabling path traversal attacks (CWE-22). Two internal helper functions (
path_pythainlp_corpusandget_path_folder_corpus) were identified as redundant attack surfaces and removed.What do these changes do
safe_path_join(base, *parts)as a public helper inpythainlp/tools/path.pyand exported frompythainlp.tools— the single authoritative traversal guard used throughout the library, eliminating code duplicationget_full_data_path(path)— delegates tosafe_path_join; now rejects anypaththat resolves outside the PyThaiNLP data directory and returns a normalized absolute pathpath_pythainlp_corpus()— its one internal call site inget_corpus_default_db()now callssafe_path_join(corpus_path(), filename)directlyget_path_folder_corpus()— its one call site inpythainlp/tag/wangchanberta_onnx.pynow callsget_corpus_path()once to obtain the base path, then usessafe_path_join(base, file)for each file access; relies onget_corpus_path()to raiseFileNotFoundErrorfor missing corpora (including offline-mode checks); a minimalif not _corpus_baseguard prevents aTypeErrorwhenget_corpus_path()returnsNoneon download failurepythainlp/corpus/__init__.pytests/core/test_security.pycovering safe inputs and traversal rejections, using_is_within_directoryfor proper containment assertionsblackformatting topythainlp/tag/wangchanberta_onnx.py; fixed a flake8 E303 warning intests/core/test_security.pyWhat was wrong
Path-building helpers called
os.path.join(base, user_input)and returned the raw joined string without normalization or containment validation. A caller supplying"../../etc/passwd"or similar would silently receive a path outside the intended directory. The traversal check was also duplicated inline in multiple functions rather than sharing a common implementation. Two helper functions (path_pythainlp_corpus,get_path_folder_corpus) each had only a single internal call site, making them unnecessary intermediaries that widened the attack surface.How this fixes it
safe_path_join(base, *parts)normalizesbasewithos.path.abspath, joins it with*parts, normalizes the result, and asserts the result starts with the base prefix before returning the normalized absolute path. All path-building code delegates to this single public helper. Removingpath_pythainlp_corpusandget_path_folder_corpuseliminates two redundant layers through which external input could flow toward the filesystem.get_corpus_path()is the sole authority for resolving corpus locations and raisingFileNotFoundError(including offline-mode handling), avoiding duplicated error logic at call sites.Your checklist for this pull request
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.