fix: prevent path traversal (CWE-22) in corpus and data path functions by Copilot · Pull Request #1342 · PyThaiNLP/pythainlp

Copilot · 2026-03-18T09:42:10Z

Path-building functions accepted external input without verifying the resolved path stayed inside its intended base directory, enabling path traversal attacks (CWE-22). Two internal helper functions (path_pythainlp_corpus and get_path_folder_corpus) were identified as redundant attack surfaces and removed.

What do these changes do

Added safe_path_join(base, *parts) as a public helper in pythainlp/tools/path.py and exported from pythainlp.tools — the single authoritative traversal guard used throughout the library, eliminating code duplication
get_full_data_path(path) — delegates to safe_path_join; now rejects any path that resolves outside the PyThaiNLP data directory and returns a normalized absolute path
Removed path_pythainlp_corpus() — its one internal call site in get_corpus_default_db() now calls safe_path_join(corpus_path(), filename) directly
Removed get_path_folder_corpus() — its one call site in pythainlp/tag/wangchanberta_onnx.py now calls get_corpus_path() once to obtain the base path, then uses safe_path_join(base, file) for each file access; relies on get_corpus_path() to raise FileNotFoundError for missing corpora (including offline-mode checks); a minimal if not _corpus_base guard prevents a TypeError when get_corpus_path() returns None on download failure
Both removed functions are no longer exported from pythainlp/corpus/__init__.py
14 tests in tests/core/test_security.py covering safe inputs and traversal rejections, using _is_within_directory for proper containment assertions
Applied black formatting to pythainlp/tag/wangchanberta_onnx.py; fixed a flake8 E303 warning in tests/core/test_security.py

What was wrong

Path-building helpers called os.path.join(base, user_input) and returned the raw joined string without normalization or containment validation. A caller supplying "../../etc/passwd" or similar would silently receive a path outside the intended directory. The traversal check was also duplicated inline in multiple functions rather than sharing a common implementation. Two helper functions (path_pythainlp_corpus, get_path_folder_corpus) each had only a single internal call site, making them unnecessary intermediaries that widened the attack surface.

# Before — no validation
get_full_data_path("../../etc/passwd")
# → /root/pythainlp-data/../../etc/passwd  (resolves to /etc/passwd)

How this fixes it

safe_path_join(base, *parts) normalizes base with os.path.abspath, joins it with *parts, normalizes the result, and asserts the result starts with the base prefix before returning the normalized absolute path. All path-building code delegates to this single public helper. Removing path_pythainlp_corpus and get_path_folder_corpus eliminates two redundant layers through which external input could flow toward the filesystem. get_corpus_path() is the sole authority for resolving corpus locations and raising FileNotFoundError (including offline-mode handling), avoiding duplicated error logic at call sites.

# After — traversal blocked
get_full_data_path("../../etc/passwd")
# → ValueError: Path traversal attempt detected: resolved path '/etc/passwd'
#              is outside the base directory '/root/pythainlp-data'.

Your checklist for this pull request

Passed code styles and structures
Passed code linting checks and unit test

Original prompt

This section details on the original issue you should resolve

<issue_title>bug: Security issue - Path manupulation</issue_title>
<issue_description>### Description

After download PyThaiNLP v5.3.1 and use Fortify Tool to scan source code, a number of security issues/vulnerabilities are still found.
One category of issues is about Path Manipulation.

pythainlp-v5.3.1 17MAR2026 DeveloperWorkbook.pdf

Here is part of explanation from Fortify Tool
Explanation
Path manipulation errors occur when the following two conditions are met: 1. An attacker can specify a path
used in an operation on the file system. 2. By specifying the resource, the attacker gains a capability that
would not otherwise be permitted. For example, the program might give the attacker the ability to overwrite
the specified file or run with a configuration controlled by the attacker. Example 1: The following code uses
input from an HTTP request to create a file name. The programmer has not considered the possibility that
an attacker could provide a file name such as "../../tomcat/conf/server.xml", which causes the
application to delete one of its own configuration files.
rName = req.field('reportName')
rFile = os.open("/usr/local/apfr/reports/" + rName)
...
os.unlink(rFile);
Example 2: The following code uses input from a configuration file to determine which file to open and
echo back to the user. If the program runs with adequate privileges and malicious users can change the
configuration file, they can use the program to read any file on the system that ends with the extension
.txt.
...
filename = CONFIG_TXT['sub'] + ".txt";
handle = os.open(filename)
print handle
...

Expected results

If possible, security issue about Path Manipulation should be fixed.
Here is recommendation from Fortify tool:

The best way to prevent path manipulation is with a level of indirection: create a list of legitimate values
from which the user must select. With this approach, the user-provided input is never used directly to
specify the resource name. In some situations this approach is impractical because the set of legitimate
resource names is too large or too hard to maintain. Programmers often resort to implementing a deny list
in these situations. A deny list is used to selectively reject or escape potentially dangerous characters
before using the input. However, any such list of unsafe characters is likely to be incomplete and will almost
certainly become out of date. A better approach is to create a list of characters that are permitted to appear
in the resource name and accept input composed exclusively of characters in the approved set.

Current results

There are 34 issues found about Path Manipulation.

Steps to reproduce

Run source code scan tool, such as Fortify.

PyThaiNLP version

5.3.1

Python version

3.12

Operating system and version

Windows 11

More info

No response

Possible solution

No response

Files

No response</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes bug: Security issue - Path manupulation #1341

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…pus, and get_path_folder_corpus Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

coveralls · 2026-03-18T09:59:49Z

coverage: 66.333% (+0.02%) from 66.312%
when pulling 54d11e6 on copilot/fix-security-path-manipulation
into fcb9210 on dev.

Copilot

Pull request overview

Adds path traversal protections to path-construction helpers used for PyThaiNLP data/corpus access, and introduces security regression tests to validate rejection of traversal attempts.

Changes:

Add base-directory containment validation to get_full_data_path(), path_pythainlp_corpus(), and get_path_folder_corpus().
Add/expand unit tests for traversal rejection on get_full_data_path() and path_pythainlp_corpus().
Document the new ValueError behavior in docstrings and the changelog.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`pythainlp/tools/path.py`	Add traversal validation to `get_full_data_path()` and document new `ValueError`.
`pythainlp/corpus/core.py`	Add traversal validation to bundled-corpus path helper and downloaded-corpus folder path helper.
`tests/core/test_security.py`	Add regression tests covering safe inputs and traversal rejections for the updated helpers.
`CHANGELOG.md`	Document the security hardening behavior change.

You can also share your feedback on Copilot code review. Take the survey.

tests/core/test_security.py

pythainlp/corpus/core.py

pythainlp/tools/path.py

bact · 2026-03-18T16:53:24Z

@copilot fix all the issues as addressed in this thread #1342 (review)

Also mind a possible code redundancy / opportunities for code/function reuse in relevant functions like get_pythainlp_data_path(), get_path_folder_corpus(), get_corpus_path(), get_full_data_path(), path_pythainlp_corpus(). The path transversal could be rejected at some common point that some of these functions has to rely/depend on, without the need to repeat the path transversal check in a lot of places.

Also mind that typical string .endsWith("file.txt") will also match "afile.txt", aside from "dir/file.txt".

When refactor, thinking about security implications as the security is our main goal here.

…tests Co-authored-by: bact <128572+bact@users.noreply.github.com>

Copilot · 2026-03-18T17:01:07Z

@copilot fix all the issues as addressed in this thread #1342 (review)

Also mind a possible code redundancy / opportunities for code/f...

Done in 503f2c2. Here's what changed:

Refactoring — single check point:
Extracted _safe_path_join(base, *parts) -> str into pythainlp/tools/path.py. It normalizes both the base and the joined result with os.path.abspath, asserts containment, and returns the normalized absolute path. All three functions (get_full_data_path, path_pythainlp_corpus, get_path_folder_corpus) now delegate to this one helper — no duplicated logic.

Normalized return value:
All three functions now return the abspath-normalized result instead of the raw os.path.join string, removing any ambiguity from relative segments or platform differences.

Fixed weak test assertions:
test_get_full_data_path_safe and test_path_pythainlp_corpus_safe now use _is_within_directory(base, result) instead of endswith(...), which actually verifies the security property (and avoids the "afile.txt" false-positive you noted).

New tests for get_path_folder_corpus:
Added test_get_path_folder_corpus_safe and test_get_path_folder_corpus_rejects_traversal, both using unittest.mock.patch to supply a temp directory as the corpus base without needing a real downloaded corpus.

bact · 2026-03-19T08:23:15Z

@copilot Remove redundant functions, remove the attack surfaces.

Consider removing get_path_folder_corpus() and path_pythainlp_corpus() functions completely from the library.

There is only one use of get_path_folder_corpus() in pythainlp/tag/wangchanberta_onnx.py. Replace it with combination of get_corpus_path() and _safe_path_join()
path_pythainlp_corpus() is only used inside get_corpus_default_db() in pythainlp/corpus/core.py. Replace it with _safe_path_join().

This can be considered as "non-breaking" changes, because these functions were originally intended for internal use.

After changes are made, review again for security, zip-safety, and documentation/comment consistency.

…e _safe_path_join inline Co-authored-by: bact <128572+bact@users.noreply.github.com>

…lp.tools Co-authored-by: bact <128572+bact@users.noreply.github.com>

Copilot · 2026-03-19T08:32:04Z

@copilot Remove redundant functions, remove the attack surfaces.

Consider removing get_path_folder_corpus() and path_pythainlp_corpus() functions completely from the library.
...

Done in 6271e24.

path_pythainlp_corpus() removed — its two call sites in get_corpus_default_db() now call safe_path_join(corpus_path(), filename) inline.
get_path_folder_corpus() removed — wangchanberta_onnx.py now calls get_corpus_path() once to get the base path, raises FileNotFoundError if it is None, then uses safe_path_join(base, file) for each of the three file accesses.
Both names removed from pythainlp/corpus/__init__.py (__all__ and import).
_safe_path_join promoted to the public safe_path_join and exported from pythainlp.tools — so no private function is imported outside pythainlp/tools/path.py.

bact

@copilot Fix the issues as addressed.

pythainlp/tag/wangchanberta_onnx.py

CHANGELOG.md

Co-authored-by: bact <128572+bact@users.noreply.github.com>

bact · 2026-03-19T08:56:39Z

@copilot Run unittest tests.noauto_onnx test suite, to confirm there is no issue with the new pythainlp/tag/wangchanberta_onnx.py.

Run mypy, pyright, pyrefly, ruff, flake8, black, and any static type checkers.

Fix any issue.

Review code and documentation again.

The goal is to completely avoid path traversal (CWE-22) and related path and security issues.

Co-authored-by: bact <128572+bact@users.noreply.github.com>

Copilot · 2026-03-19T09:01:57Z

@copilot Run unittest tests.noauto_onnx test suite, to confirm there is no issue with the new pythainlp/tag/wangchanberta_onnx.py.

Run mypy, pyright, pyrefly, ruff, flake8, black...

Done in cad6cdd. Here's the summary:

tests.noauto_onnx: The test suite requires oskut and other heavy ONNX dependencies not present in this environment (these tests are explicitly excluded from automated CI). The wangchanberta_onnx.py module imports cleanly, its class signature is intact, and all method signatures are preserved — validated via py_compile and AST parsing.

Static analysis results:

ruff: All checks pass
mypy: No errors on changed files
flake8: Fixed a pre-existing E303 (too many blank lines) in tests/core/test_security.py
black: Applied reformatting to wangchanberta_onnx.py (unwrapped some overly-wrapped lines)
pyright: 6 errors — all pre-existing, all due to missing onnxruntime/sentencepiece type stubs in this environment (same errors exist on the original file before any PR changes)
CodeQL: 0 alerts

Security (tests/core/test_security.py): 26/26 tests pass — path traversal is blocked at safe_path_join for all get_full_data_path and corpus path calls.

pythainlp/tag/wangchanberta_onnx.py

This release addresses security issues related to path traversal by validating that paths stay within their expected base directory.

sonarqubecloud · 2026-03-19T09:18:26Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

bact · 2026-03-19T09:37:04Z

@wannaphong please review

Note that get_path_folder_corpus() and path_pythainlp_corpus() were only used internally, so removing them can be considered "non-breaking change".

Initial plan

3aca45d

Copilot AI assigned Copilot and wannaphong Mar 18, 2026

Copilot started work on behalf of wannaphong March 18, 2026 09:42 View session

fix: prevent path traversal in get_full_data_path, path_pythainlp_cor…

7795c2b

…pus, and get_path_folder_corpus Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix security issue related to path manipulation~~ fix: prevent path traversal (CWE-22) in corpus and data path functions Mar 18, 2026

Copilot AI requested a review from wannaphong March 18, 2026 09:54

Copilot finished work on behalf of wannaphong March 18, 2026 09:54

bact requested a review from Copilot March 18, 2026 14:01

Copilot started reviewing on behalf of bact March 18, 2026 14:02 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

tests/core/test_security.py Show resolved Hide resolved

tests/core/test_security.py Outdated Show resolved Hide resolved

pythainlp/corpus/core.py Outdated Show resolved Hide resolved

pythainlp/tools/path.py Outdated Show resolved Hide resolved

Copilot started work on behalf of bact March 18, 2026 16:53 View session

refactor: consolidate path traversal check into _safe_path_join; fix …

503f2c2

…tests Co-authored-by: bact <128572+bact@users.noreply.github.com>

Copilot AI requested a review from bact March 18, 2026 17:02

Copilot finished work on behalf of bact March 18, 2026 17:02

bact marked this pull request as ready for review March 19, 2026 07:55

bact added the security label Mar 19, 2026

bact added this to the 5.3.2 milestone Mar 19, 2026

Copilot started work on behalf of bact March 19, 2026 08:23 View session

Copilot AI and others added 2 commits March 19, 2026 08:29

refactor: remove path_pythainlp_corpus and get_path_folder_corpus; us…

c08e199

…e _safe_path_join inline Co-authored-by: bact <128572+bact@users.noreply.github.com>

refactor: promote _safe_path_join to public safe_path_join in pythain…

6271e24

…lp.tools Co-authored-by: bact <128572+bact@users.noreply.github.com>

Copilot finished work on behalf of bact March 19, 2026 08:33

bact requested changes Mar 19, 2026

View reviewed changes

pythainlp/tag/wangchanberta_onnx.py Outdated Show resolved Hide resolved

CHANGELOG.md Outdated Show resolved Hide resolved

Copilot started work on behalf of bact March 19, 2026 08:40 View session

fix: remove duplicate FileNotFoundError guard; update CHANGELOG entry

10e981c

Co-authored-by: bact <128572+bact@users.noreply.github.com>

Copilot AI requested a review from bact March 19, 2026 08:44

Copilot finished work on behalf of bact March 19, 2026 08:44

Copilot started work on behalf of bact March 19, 2026 08:56 View session

style: apply black formatting; fix flake8 E303 in test_security.py

cad6cdd

Co-authored-by: bact <128572+bact@users.noreply.github.com>

Copilot finished work on behalf of bact March 19, 2026 09:03

bact requested changes Mar 19, 2026

View reviewed changes

bact added 8 commits March 19, 2026 09:10

Update pythainlp/tag/wangchanberta_onnx.py

57d2839

Update pythainlp/tag/wangchanberta_onnx.py

cde810f

Update pythainlp/tag/wangchanberta_onnx.py

a61e4e6

Update pythainlp/tag/wangchanberta_onnx.py

504216d

Update pythainlp/tag/wangchanberta_onnx.py

d0ccffb

Update pythainlp/tag/wangchanberta_onnx.py

c2881ae

Update CHANGELOG for release 5.3.2

c01ad33

This release addresses security issues related to path traversal by validating that paths stay within their expected base directory.

Update AGENTS.md about path join

54d11e6

bact approved these changes Mar 19, 2026

View reviewed changes

Conversation

Copilot AI commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do

What was wrong

How this fixes it

Your checklist for this pull request

Expected results

Current results

Steps to reproduce

PyThaiNLP version

Python version

Operating system and version

More info

Possible solution

Files

Comments on the Issue (you are @copilot in this section)

Uh oh!

coveralls commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bact commented Mar 18, 2026

Uh oh!

Copilot AI commented Mar 18, 2026

Uh oh!

bact commented Mar 19, 2026

Uh oh!

Copilot AI commented Mar 19, 2026

Uh oh!

bact left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bact commented Mar 19, 2026

Uh oh!

Copilot AI commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Mar 19, 2026

Quality Gate passed

Uh oh!

bact commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Copilot AI commented Mar 18, 2026 •

edited

Loading

coveralls commented Mar 18, 2026 •

edited

Loading

bact commented Mar 19, 2026 •

edited

Loading