[ New features ] : add aiXcoder model implementation with tokenizer and validation #2902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

YoctoHan wants to merge 4 commits into PaddlePaddle:develop from YoctoHan:feat/aixcoder

YoctoHan commented Nov 10, 2025 •

edited

Loading

✨ Pull Request Summary

🚀 Implemented Features

Implemented the AiXcoder model architecture.
Added a custom tokenization logic specific to AiXcoder.
Completed initial pre-training validation tests.
Verified training and fine-tuning pipelines.
Added model usage and reproduction instructions in the README.

🧩 Code Quality

✅ All changes have passed pre-commit checks successfully.
Code format, style, and lint validations fully comply with project conventions.

🧪 To‑Do

Add additional unit tests under the tests folder.
If any codecov coverage issues arise, please include corresponding test cases first.

📂 PR Type

🧱 PR Changes

Models
APIs
Docs
Others

📝 Description

This PR introduces the AiXcoder model to PaddleFormers, including its complete model architecture, tokenizer, and validation for both pre-training and SFT/finetuning workflows.
All code components have passed pre-commit checks and are aligned with PaddleFormers’ coding and documentation standards.
It expands the model zoo with a reproducible AiXcoder training pipeline ready for integration and further evaluation.

🧑‍💻 Checklist

Code passes all pre-commit lint and format checks
Model training/finetuning pipeline validated
Unit tests added for new components
Documentation updated (README, configs, or tutorials)

paddle-bot bot commented Nov 10, 2025

Thanks for your contribution!

CLAassistant commented Nov 10, 2025 •

edited

Loading

All committers have signed the CLA.

paddle-bot bot added the contributor label

YoctoHan force-pushed the feat/aixcoder branch from 310b58c to 45fe78e Compare

November 10, 2025 10:00

codecov-commenter commented Nov 10, 2025 •

edited

Loading

Codecov Report

❌ Patch coverage is 97.00000% with 9 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@daba927). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
paddleformers/transformers/auto/tokenizer.py	25.00%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #2902   +/-   ##
==========================================
  Coverage           ?   31.73%           
==========================================
  Files              ?      425           
  Lines              ?    68827           
  Branches           ?        0           
==========================================
  Hits               ?    21839           
  Misses             ?    46988           
  Partials           ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

YoctoHan force-pushed the feat/aixcoder branch 2 times, most recently from 8db79fe to 68c92d8 Compare

November 11, 2025 07:36

YoctoHan closed this

YoctoHan reopened this

YoctoHan closed this

YoctoHan reopened this

YoctoHan closed this

YoctoHan reopened this

YoctoHan closed this

YoctoHan reopened this

YoctoHan force-pushed the feat/aixcoder branch 2 times, most recently from 49ec0fd to 1b73a8c Compare

November 14, 2025 09:14

YoctoHan closed this

YoctoHan reopened this

YoctoHan force-pushed the feat/aixcoder branch from 1b73a8c to 901cac8 Compare

November 15, 2025 02:48

YoctoHan and others added 2 commits

November 15, 2025 03:11


          feat(model): add new Aixcoder model implementation with tokenizer and…

901cac8

… validation

- Implemented the Aixcoder model architecture
- Added custom tokenization logic
- Completed initial validation tests (pre-training verification)
- Prepared for upcoming training and fine-tuning validation
- Documented usage in README for model reproduction


          Merge branch 'develop' into feat/aixcoder

7e56ae8

WYB27 reviewed

View reviewed changes

paddleformers/transformers/aixcoder/modeling.py Outdated Show resolved Hide resolved

YoctoHan added 2 commits

November 19, 2025 10:56


          temp

ed2aea2


          temp2

de821c4

WYB27 reviewed

View reviewed changes

paddleformers/transformers/aixcoder/modeling.py

+                              # Row Linear
+                              "aixcoder.embed_tokens.weight": partial(fn, is_column=False),
+                              "aixcoder.layers.0.self_attn.o_proj.weight": partial(fn, is_column=False),
+                              "aixcoder.layers.0.mlp.down_proj.weight": partial(fn, is_column=False),

Collaborator

WYB27 Nov 19, 2025

权重的名字确定前缀是aixcoder而不是model吗，我在hugging face仓库里查到的权重名字前缀都是model?

Author

YoctoHan Nov 20, 2025

我们计划在星河社区开源 paddle 版本的权重，这个版本的权重前缀为 aiXcoder，接下来我会将前缀统一为 model，包括组网代码中的部分和权重。

paddleformers/transformers/aixcoder/modeling.py

+                          "aixcoder.layers.0.self_attn.q_proj.weight",
+                          "aixcoder.layers.0.self_attn.k_proj.weight",
+                          "aixcoder.layers.0.self_attn.v_proj.weight",
+                          "aixcoder.layers.0.self_attn.qkv_proj.weight",

Collaborator

WYB27 Nov 19, 2025

同样确认下权重名字前缀的问题

paddleformers/transformers/aixcoder/modeling.py

+                      >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+                      "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+                      ```"""
+                      output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions

Collaborator

WYB27 Nov 19, 2025

output_attentions已经不再需要，辛苦在整个组网中去除这个逻辑

paddleformers/transformers/aixcoder/modeling.py

+                  def set_decoder(self, decoder):
+                      self.aixcoder = decoder
+                  def get_decoder(self):

Collaborator

WYB27 Nov 19, 2025

以上get_xxx/set_xxx函数，请在PretrainedModel里确认，是否还需要写在组网里

PaddleFormers/paddleformers/transformers/model_utils.py

Line 1311 in 9594389

class PretrainedModel(Layer, GenerationMixin, ConversionMixin):

paddleformers/transformers/aixcoder/modeling.py

		]


		class AixcoderPretrainedModel(PretrainedModel):

Collaborator

WYB27 Nov 19, 2025

近期llama组网有重构更新，预计本周之内会合入，辛苦按照新组网的标准更新一下aiXcoder的组网，尤其需要关注_init_weights等在新组网中多余的部分
#2770

tests/transformers/aixcoder/test_modeling.py

Collaborator

WYB27 Nov 19, 2025

请按照库中其他模型的方式，添加test_modeling单测文件，覆盖ModelTest，IntegrationTest等case。
https://github.com/PaddlePaddle/PaddleFormers/blob/develop/tests/transformers/qwen3/test_modeling.py
其他单测文件可以移除

tests/transformers/aixcoder/test_tokenizer.py

Collaborator

WYB27 Nov 19, 2025

tokenizer请按照llama的单测进行添加https://github.com/PaddlePaddle/PaddleFormers/blob/develop/tests/transformers/llama/test_tokenizer.py

paddleformers/transformers/aixcoder/LICENSE

Collaborator

WYB27 Nov 19, 2025

license文件可删除

paddleformers/transformers/aixcoder/README.md

Collaborator

WYB27 Nov 19, 2025

readme文件可删除

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels