Skip to content

fix: prevent path traversal (CWE-22) in corpus and data path functions#1342

Open
Copilot wants to merge 15 commits intodevfrom
copilot/fix-security-path-manipulation
Open

fix: prevent path traversal (CWE-22) in corpus and data path functions#1342
Copilot wants to merge 15 commits intodevfrom
copilot/fix-security-path-manipulation

Conversation

Copy link
Contributor

Copilot AI commented Mar 18, 2026

Path-building functions accepted external input without verifying the resolved path stayed inside its intended base directory, enabling path traversal attacks (CWE-22). Two internal helper functions (path_pythainlp_corpus and get_path_folder_corpus) were identified as redundant attack surfaces and removed.

What do these changes do

  • Added safe_path_join(base, *parts) as a public helper in pythainlp/tools/path.py and exported from pythainlp.tools — the single authoritative traversal guard used throughout the library, eliminating code duplication
  • get_full_data_path(path) — delegates to safe_path_join; now rejects any path that resolves outside the PyThaiNLP data directory and returns a normalized absolute path
  • Removed path_pythainlp_corpus() — its one internal call site in get_corpus_default_db() now calls safe_path_join(corpus_path(), filename) directly
  • Removed get_path_folder_corpus() — its one call site in pythainlp/tag/wangchanberta_onnx.py now calls get_corpus_path() once to obtain the base path, then uses safe_path_join(base, file) for each file access; relies on get_corpus_path() to raise FileNotFoundError for missing corpora (including offline-mode checks); a minimal if not _corpus_base guard prevents a TypeError when get_corpus_path() returns None on download failure
  • Both removed functions are no longer exported from pythainlp/corpus/__init__.py
  • 14 tests in tests/core/test_security.py covering safe inputs and traversal rejections, using _is_within_directory for proper containment assertions
  • Applied black formatting to pythainlp/tag/wangchanberta_onnx.py; fixed a flake8 E303 warning in tests/core/test_security.py

What was wrong

Path-building helpers called os.path.join(base, user_input) and returned the raw joined string without normalization or containment validation. A caller supplying "../../etc/passwd" or similar would silently receive a path outside the intended directory. The traversal check was also duplicated inline in multiple functions rather than sharing a common implementation. Two helper functions (path_pythainlp_corpus, get_path_folder_corpus) each had only a single internal call site, making them unnecessary intermediaries that widened the attack surface.

# Before — no validation
get_full_data_path("../../etc/passwd")
# → /root/pythainlp-data/../../etc/passwd  (resolves to /etc/passwd)

How this fixes it

safe_path_join(base, *parts) normalizes base with os.path.abspath, joins it with *parts, normalizes the result, and asserts the result starts with the base prefix before returning the normalized absolute path. All path-building code delegates to this single public helper. Removing path_pythainlp_corpus and get_path_folder_corpus eliminates two redundant layers through which external input could flow toward the filesystem. get_corpus_path() is the sole authority for resolving corpus locations and raising FileNotFoundError (including offline-mode handling), avoiding duplicated error logic at call sites.

# After — traversal blocked
get_full_data_path("../../etc/passwd")
# → ValueError: Path traversal attempt detected: resolved path '/etc/passwd'
#              is outside the base directory '/root/pythainlp-data'.

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test
Original prompt

This section details on the original issue you should resolve

<issue_title>bug: Security issue - Path manupulation</issue_title>
<issue_description>### Description

After download PyThaiNLP v5.3.1 and use Fortify Tool to scan source code, a number of security issues/vulnerabilities are still found.
One category of issues is about Path Manipulation.

pythainlp-v5.3.1 17MAR2026 DeveloperWorkbook.pdf

Here is part of explanation from Fortify Tool
Explanation
Path manipulation errors occur when the following two conditions are met: 1. An attacker can specify a path
used in an operation on the file system. 2. By specifying the resource, the attacker gains a capability that
would not otherwise be permitted. For example, the program might give the attacker the ability to overwrite
the specified file or run with a configuration controlled by the attacker. Example 1: The following code uses
input from an HTTP request to create a file name. The programmer has not considered the possibility that
an attacker could provide a file name such as "../../tomcat/conf/server.xml", which causes the
application to delete one of its own configuration files.
rName = req.field('reportName')
rFile = os.open("/usr/local/apfr/reports/" + rName)
...
os.unlink(rFile);
Example 2: The following code uses input from a configuration file to determine which file to open and
echo back to the user. If the program runs with adequate privileges and malicious users can change the
configuration file, they can use the program to read any file on the system that ends with the extension
.txt.
...
filename = CONFIG_TXT['sub'] + ".txt";
handle = os.open(filename)
print handle
...

Expected results

If possible, security issue about Path Manipulation should be fixed.
Here is recommendation from Fortify tool:

The best way to prevent path manipulation is with a level of indirection: create a list of legitimate values
from which the user must select. With this approach, the user-provided input is never used directly to
specify the resource name. In some situations this approach is impractical because the set of legitimate
resource names is too large or too hard to maintain. Programmers often resort to implementing a deny list
in these situations. A deny list is used to selectively reject or escape potentially dangerous characters
before using the input. However, any such list of unsafe characters is likely to be incomplete and will almost
certainly become out of date. A better approach is to create a list of characters that are permitted to appear
in the resource name and accept input composed exclusively of characters in the approved set.

Current results

There are 34 issues found about Path Manipulation.

Steps to reproduce

Run source code scan tool, such as Fortify.

PyThaiNLP version

5.3.1

Python version

3.12

Operating system and version

Windows 11

More info

No response

Possible solution

No response

Files

No response</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…pus, and get_path_folder_corpus

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix security issue related to path manipulation fix: prevent path traversal (CWE-22) in corpus and data path functions Mar 18, 2026
Copilot AI requested a review from wannaphong March 18, 2026 09:54
@coveralls
Copy link

coveralls commented Mar 18, 2026

Coverage Status

coverage: 66.333% (+0.02%) from 66.312%
when pulling 54d11e6 on copilot/fix-security-path-manipulation
into fcb9210 on dev.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds path traversal protections to path-construction helpers used for PyThaiNLP data/corpus access, and introduces security regression tests to validate rejection of traversal attempts.

Changes:

  • Add base-directory containment validation to get_full_data_path(), path_pythainlp_corpus(), and get_path_folder_corpus().
  • Add/expand unit tests for traversal rejection on get_full_data_path() and path_pythainlp_corpus().
  • Document the new ValueError behavior in docstrings and the changelog.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
pythainlp/tools/path.py Add traversal validation to get_full_data_path() and document new ValueError.
pythainlp/corpus/core.py Add traversal validation to bundled-corpus path helper and downloaded-corpus folder path helper.
tests/core/test_security.py Add regression tests covering safe inputs and traversal rejections for the updated helpers.
CHANGELOG.md Document the security hardening behavior change.

You can also share your feedback on Copilot code review. Take the survey.

@bact
Copy link
Member

bact commented Mar 18, 2026

@copilot fix all the issues as addressed in this thread #1342 (review)

Also mind a possible code redundancy / opportunities for code/function reuse in relevant functions like get_pythainlp_data_path(), get_path_folder_corpus(), get_corpus_path(), get_full_data_path(), path_pythainlp_corpus(). The path transversal could be rejected at some common point that some of these functions has to rely/depend on, without the need to repeat the path transversal check in a lot of places.

Also mind that typical string .endsWith("file.txt") will also match "afile.txt", aside from "dir/file.txt".

When refactor, thinking about security implications as the security is our main goal here.

…tests

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 18, 2026

@copilot fix all the issues as addressed in this thread #1342 (review)

Also mind a possible code redundancy / opportunities for code/f...

Done in 503f2c2. Here's what changed:

Refactoring — single check point:
Extracted _safe_path_join(base, *parts) -> str into pythainlp/tools/path.py. It normalizes both the base and the joined result with os.path.abspath, asserts containment, and returns the normalized absolute path. All three functions (get_full_data_path, path_pythainlp_corpus, get_path_folder_corpus) now delegate to this one helper — no duplicated logic.

Normalized return value:
All three functions now return the abspath-normalized result instead of the raw os.path.join string, removing any ambiguity from relative segments or platform differences.

Fixed weak test assertions:
test_get_full_data_path_safe and test_path_pythainlp_corpus_safe now use _is_within_directory(base, result) instead of endswith(...), which actually verifies the security property (and avoids the "afile.txt" false-positive you noted).

New tests for get_path_folder_corpus:
Added test_get_path_folder_corpus_safe and test_get_path_folder_corpus_rejects_traversal, both using unittest.mock.patch to supply a temp directory as the corpus base without needing a real downloaded corpus.

Copilot AI requested a review from bact March 18, 2026 17:02
Copilot finished work on behalf of bact March 18, 2026 17:02
@bact bact marked this pull request as ready for review March 19, 2026 07:55
@bact bact added the security label Mar 19, 2026
@bact bact added this to the 5.3.2 milestone Mar 19, 2026
@bact
Copy link
Member

bact commented Mar 19, 2026

@copilot Remove redundant functions, remove the attack surfaces.

Consider removing get_path_folder_corpus() and path_pythainlp_corpus() functions completely from the library.

  • There is only one use of get_path_folder_corpus() in pythainlp/tag/wangchanberta_onnx.py. Replace it with combination of get_corpus_path() and _safe_path_join()

  • path_pythainlp_corpus() is only used inside get_corpus_default_db() in pythainlp/corpus/core.py. Replace it with _safe_path_join().

This can be considered as "non-breaking" changes, because these functions were originally intended for internal use.

After changes are made, review again for security, zip-safety, and documentation/comment consistency.

Copilot AI and others added 2 commits March 19, 2026 08:29
…e _safe_path_join inline

Co-authored-by: bact <128572+bact@users.noreply.github.com>
…lp.tools

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 19, 2026

@copilot Remove redundant functions, remove the attack surfaces.

Consider removing get_path_folder_corpus() and path_pythainlp_corpus() functions completely from the library.
...

Done in 6271e24.

  • path_pythainlp_corpus() removed — its two call sites in get_corpus_default_db() now call safe_path_join(corpus_path(), filename) inline.
  • get_path_folder_corpus() removed — wangchanberta_onnx.py now calls get_corpus_path() once to get the base path, raises FileNotFoundError if it is None, then uses safe_path_join(base, file) for each of the three file accesses.
  • Both names removed from pythainlp/corpus/__init__.py (__all__ and import).
  • _safe_path_join promoted to the public safe_path_join and exported from pythainlp.tools — so no private function is imported outside pythainlp/tools/path.py.

Copilot finished work on behalf of bact March 19, 2026 08:33
Copy link
Member

@bact bact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Fix the issues as addressed.

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI requested a review from bact March 19, 2026 08:44
Copilot finished work on behalf of bact March 19, 2026 08:44
@bact
Copy link
Member

bact commented Mar 19, 2026

@copilot Run unittest tests.noauto_onnx test suite, to confirm there is no issue with the new pythainlp/tag/wangchanberta_onnx.py.

Run mypy, pyright, pyrefly, ruff, flake8, black, and any static type checkers.

Fix any issue.

Review code and documentation again.

The goal is to completely avoid path traversal (CWE-22) and related path and security issues.

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 19, 2026

@copilot Run unittest tests.noauto_onnx test suite, to confirm there is no issue with the new pythainlp/tag/wangchanberta_onnx.py.

Run mypy, pyright, pyrefly, ruff, flake8, black...

Done in cad6cdd. Here's the summary:

tests.noauto_onnx: The test suite requires oskut and other heavy ONNX dependencies not present in this environment (these tests are explicitly excluded from automated CI). The wangchanberta_onnx.py module imports cleanly, its class signature is intact, and all method signatures are preserved — validated via py_compile and AST parsing.

Static analysis results:

  • ruff: All checks pass
  • mypy: No errors on changed files
  • flake8: Fixed a pre-existing E303 (too many blank lines) in tests/core/test_security.py
  • black: Applied reformatting to wangchanberta_onnx.py (unwrapped some overly-wrapped lines)
  • pyright: 6 errors — all pre-existing, all due to missing onnxruntime/sentencepiece type stubs in this environment (same errors exist on the original file before any PR changes)
  • CodeQL: 0 alerts

Security (tests/core/test_security.py): 26/26 tests pass — path traversal is blocked at safe_path_join for all get_full_data_path and corpus path calls.

Copilot finished work on behalf of bact March 19, 2026 09:03
@sonarqubecloud
Copy link

@bact
Copy link
Member

bact commented Mar 19, 2026

@wannaphong please review

Note that get_path_folder_corpus() and path_pythainlp_corpus() were only used internally, so removing them can be considered "non-breaking change".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Security issue - Path manupulation

5 participants