Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions tools/hf_to_arch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/usr/bin/env python3
"""Convert HuggingFace config.json to architecture_config format."""

import json
import sys


REQUIRED_MAPPINGS = {
"numLayers": "num_hidden_layers",
"hiddenSize": "hidden_size",
"numAttentionHeads": "num_attention_heads",
}


def convert_hf_config(hf_config: dict) -> dict:
"""Convert HuggingFace config to architecture_config format."""
arch_config = {"type": "transformer"}

for arch_key, hf_key in REQUIRED_MAPPINGS.items():
if hf_key not in hf_config:
raise ValueError(f"missing required field: {hf_key}")
value = hf_config[hf_key]
Comment on lines +15 to +22
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_hf_config assumes hf_config is a dict-like JSON object. If the input JSON is a list/string/etc., the hf_key not in hf_config / hf_config[hf_key] access can raise a TypeError (or behave unexpectedly) and will bypass the current ValueError handling, resulting in a raw stack trace. Add an explicit top-level type check (e.g., require dict) and raise a ValueError with a clear message when the JSON root is not an object.

Copilot uses AI. Check for mistakes.
if not isinstance(value, int) or isinstance(value, bool):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current type check for integers is functionally correct but a bit subtle because it relies on bool being a subclass of int in Python, which can be non-obvious. Using type(value) is not int is a more explicit and clearer way to check that the value is strictly an integer and not a boolean.

Suggested change
if not isinstance(value, int) or isinstance(value, bool):
if type(value) is not int:

raise ValueError(f"field {hf_key} must be an integer, got {type(value).__name__}")
if value < 1:
raise ValueError(f"field {hf_key} must be >= 1, got {value}")
arch_config[arch_key] = value

return arch_config


def main():
if len(sys.argv) != 2:
print(f"usage: {sys.argv[0]} <config.json>", file=sys.stderr)
sys.exit(1)

config_path = sys.argv[1]
Comment on lines +33 to +37
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better command-line interface design and maintainability, consider using the argparse module instead of manually parsing sys.argv. argparse provides automatic help messages, error handling for arguments, and is the standard way to create CLI tools in Python. You will also need to add import argparse at the top of the file.

Suggested change
if len(sys.argv) != 2:
print(f"usage: {sys.argv[0]} <config.json>", file=sys.stderr)
sys.exit(1)
config_path = sys.argv[1]
parser = argparse.ArgumentParser(description="Convert HuggingFace config.json to architecture_config format.")
parser.add_argument("config_path", help="Path to the HuggingFace config.json file.")
args = parser.parse_args()
config_path = args.config_path


try:
with open(config_path, "r") as f:
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open(config_path, "r") relies on the platform default encoding, which can cause failures or mis-parsing on systems where the locale encoding isn’t UTF-8. Since HuggingFace config.json files are UTF-8 JSON, open the file with an explicit encoding="utf-8" for consistent behavior across environments.

Suggested change
with open(config_path, "r") as f:
with open(config_path, "r", encoding="utf-8") as f:

Copilot uses AI. Check for mistakes.
hf_config = json.load(f)
except FileNotFoundError:
print(f"error: file not found: {config_path}", file=sys.stderr)
sys.exit(1)
except json.JSONDecodeError as e:
print(f"error: invalid JSON: {e}", file=sys.stderr)
sys.exit(1)

try:
arch_config = convert_hf_config(hf_config)
except ValueError as e:
print(f"error: {e}", file=sys.stderr)
sys.exit(1)

print(json.dumps(arch_config, indent=2, sort_keys=True))


if __name__ == "__main__":
main()
146 changes: 146 additions & 0 deletions tools/hf_to_arch_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
#!/usr/bin/env python3
"""Tests for hf_to_arch.py"""

import json
import subprocess
import sys
import tempfile
import os

SCRIPT_PATH = os.path.join(os.path.dirname(__file__), "hf_to_arch.py")


def run_script(config_content: str) -> tuple:
"""Run hf_to_arch.py with given config content, return (exitcode, stdout, stderr)."""
with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f:
f.write(config_content)
f.flush()
temp_path = f.name

try:
result = subprocess.run(
[sys.executable, SCRIPT_PATH, temp_path],
capture_output=True,
text=True,
)
return result.returncode, result.stdout, result.stderr
finally:
os.unlink(temp_path)


def test_valid_config():
"""Valid HuggingFace config produces correct output."""
config = json.dumps({
"num_hidden_layers": 32,
"hidden_size": 4096,
"num_attention_heads": 32,
"vocab_size": 32000,
})

exitcode, stdout, stderr = run_script(config)

assert exitcode == 0, f"expected exit 0, got {exitcode}: {stderr}"
output = json.loads(stdout)
assert output == {
"type": "transformer",
"numLayers": 32,
"hiddenSize": 4096,
"numAttentionHeads": 32,
}, f"unexpected output: {output}"
print("PASS: test_valid_config")


def test_missing_field():
"""Missing required field produces error."""
config = json.dumps({
"num_hidden_layers": 32,
"hidden_size": 4096,
})

exitcode, stdout, stderr = run_script(config)

assert exitcode != 0, "expected non-zero exit for missing field"
assert "num_attention_heads" in stderr, f"error should mention missing field: {stderr}"
print("PASS: test_missing_field")


def test_invalid_json():
"""Invalid JSON produces error."""
exitcode, stdout, stderr = run_script("not valid json {")

assert exitcode != 0, "expected non-zero exit for invalid JSON"
assert "invalid JSON" in stderr.lower() or "json" in stderr.lower(), f"error should mention JSON: {stderr}"
print("PASS: test_invalid_json")


def test_file_not_found():
"""Non-existent file produces error."""
result = subprocess.run(
[sys.executable, SCRIPT_PATH, "/nonexistent/path/config.json"],
capture_output=True,
text=True,
)

assert result.returncode != 0, "expected non-zero exit for missing file"
assert "not found" in result.stderr.lower(), f"error should mention file not found: {result.stderr}"
print("PASS: test_file_not_found")


def test_invalid_field_type():
"""Non-integer field produces error."""
config = json.dumps({
"num_hidden_layers": "32",
"hidden_size": 4096,
"num_attention_heads": 32,
})

exitcode, stdout, stderr = run_script(config)

assert exitcode != 0, "expected non-zero exit for invalid type"
assert "integer" in stderr.lower(), f"error should mention type: {stderr}"
print("PASS: test_invalid_field_type")


def test_zero_value():
"""Zero value produces error."""
config = json.dumps({
"num_hidden_layers": 0,
"hidden_size": 4096,
"num_attention_heads": 32,
})

exitcode, stdout, stderr = run_script(config)

assert exitcode != 0, "expected non-zero exit for zero value"
assert ">= 1" in stderr, f"error should mention minimum: {stderr}"
print("PASS: test_zero_value")


def test_bool_value():
"""Boolean value produces error."""
config = json.dumps({
"num_hidden_layers": True,
"hidden_size": 4096,
"num_attention_heads": 32,
})

exitcode, stdout, stderr = run_script(config)

assert exitcode != 0, "expected non-zero exit for bool value"
assert "integer" in stderr.lower(), f"error should mention type: {stderr}"
print("PASS: test_bool_value")


def main():
test_valid_config()
test_missing_field()
test_invalid_json()
test_file_not_found()
test_invalid_field_type()
test_zero_value()
test_bool_value()
print("\nAll tests passed.")


if __name__ == "__main__":
main()
Comment on lines +134 to +146
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test file implements a custom test runner by calling test functions from main() and using print statements to show progress. While this works, adopting a standard test framework like pytest would offer several advantages:

  • No need for a manual runner (main function and if __name__ == '__main__' block).
  • More detailed and readable output on test failures.
  • A rich ecosystem of plugins.
  • Use of simple assert statements with no need for custom failure messages.

Migrating to pytest would involve removing the main function, the if __name__ == '__main__' block, and all print statements for test status. Tests could then be run with the pytest command. This would make the test suite more standard and easier to maintain.

Comment on lines +134 to +146
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test module is a standalone script (custom main() and print-based status) rather than a framework-discoverable test (e.g., pytest/unittest). In the current repo, CI/make test only runs go test, so these tests won’t execute automatically and may silently regress. Consider integrating a Python test runner into CI/Makefile or converting these checks into the repo’s existing Go test harness; if you intend to use pytest later, also rename to test_*.py and remove the manual main().

Suggested change
def main():
test_valid_config()
test_missing_field()
test_invalid_json()
test_file_not_found()
test_invalid_field_type()
test_zero_value()
test_bool_value()
print("\nAll tests passed.")
if __name__ == "__main__":
main()

Copilot uses AI. Check for mistakes.
Loading