[dsv3]Move dsv3 model from paddlenlp-dsv3-sft #2593

Difers · 2025-09-11T07:47:49Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

New features

PR changes

Models

Description

迁移PaddleNLP dsv3 Modle 至 PaddleFormers

当前PR

迁移paddlenlp中deepseek v3组网至formers，保证基本功能正常，并基本符合formers组网规范

后续TODO

添加单卡moe组网,以及对比hf transformers修改组网规范及与hf transformers精度对齐
添加组网单测
添加模型配置和全流程使用文档
添加moe过滤padding token功能

新增部分特性与修改相关PR，供参考

添加offload_optimizer: Dsv3 sft PaddleNLP#10968
添加subbatch: fix use_rms_norm && add subbatch_token_num config PaddleNLP#10974 moelayer with subbatch to reduce memory PaddleNLP#10985
添加sequence parallel: support sequence parallel in deepseek v3 model
修复MOE aux_loss_free 计算: fix aux_loss_alpha && lr value too big problem and add aux update callback and add mtp subatch_recompute PaddleNLP#11062 fix some bugs PaddleNLP#11028
修复global_norm计算: Adapt global_norm_clip for hybrid expert parallel PaddleNLP#11061
修复ep组梯度放大问题: polish code, use callback function to fix ep grad bug PaddleNLP#11072
修复mp不接续问题: [Bug Fix]reduce grad of kv_a_proj_with_mqa and q_a_proj to maintain c… PaddleNLP#11085

4K 下收敛验证

paddle-bot · 2025-09-11T07:47:55Z

Thanks for your contribution!

codecov-commenter · 2025-09-11T08:53:39Z

Codecov Report

❌ Patch coverage is 16.78161% with 362 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@03b533c). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
paddleformers/transformers/deepseek_v2/modeling.py	15.50%	267 Missing ⚠️
paddleformers/trainer/utils/offload_optimizer.py	0.00%	43 Missing ⚠️
paddleformers/trainer/trainer.py	11.76%	15 Missing ⚠️
paddleformers/nn/pp_model.py	30.00%	14 Missing ⚠️
paddleformers/transformers/deepseek_v3/modeling.py	65.21%	8 Missing ⚠️
paddleformers/trainer/trainer_utils.py	14.28%	6 Missing ⚠️
paddleformers/transformers/moe_layer.py	0.00%	6 Missing ⚠️
paddleformers/trl/model_config.py	0.00%	2 Missing ⚠️
paddleformers/transformers/moe_gate.py	0.00%	1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (16.78%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #2593   +/-   ##
==========================================
  Coverage           ?   29.89%           
==========================================
  Files              ?      308           
  Lines              ?    53980           
  Branches           ?        0           
==========================================
  Hits               ?    16136           
  Misses             ?    37844           
  Partials           ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

deepllz · 2025-09-11T11:09:04Z

examples/run_finetune.py


        model_class = AutoModelForCausalLMPipe

+    model_config.using_flex_token = model_args.using_flex_token


设置model_config的逻辑，这部分都需要挪到202行前面。

deepllz · 2025-09-11T11:09:54Z

examples/run_finetune.py

+    if training_args.use_expert_parallel:
+        callbacks += [MoeExpertsGradScaleCallback(training_args)]
+
+    print("callbacks:", callbacks, flush=True)


换成log.info，正式代码不能出现print

deepllz · 2025-09-11T11:13:08Z

paddleformers/transformers/deepseek_v2/modeling_pp.py

+                    attn_mask_startend_row_indices = attn_mask_startend_row_indices[
+                        :,
+                        :,
+                        : -self.config.num_nextn_predict_layers,


数据流支持之后，这里还需要截断嘛？

input_ids，attn_mask_startend_row_indices等数据流输出时维度加了nextn_predict_layers，仍需截断

deepllz · 2025-09-12T07:06:11Z

examples/config/deepseek_v3/sft_128k_argument_dsv3.json

+  "unified_checkpoint": true,
+  "use_flash_attention": true,
+  "flash_mask": true,
+  "using_fake_gate": true,


fake gate改成false

如果不需要fake_gate是否可以删除

megatraon也有fake gate，moe场景都需要fake gate，测性能用的。

deepllz · 2025-09-12T07:06:56Z

examples/config/deepseek_v3/sft_128k_argument_dsv3.json

+  "expert_parallel_degree": 16,
+  "continue_training": true,
+  "pipeline_parallel_config": "enable_delay_scale_loss disable_partial_send_recv disable_batch_p2p_comm",
+  "tensor_parallel_config": "enable_delay_scale_loss",


tp补充"tensor_parallel_config": "sync_param sync_grad"

deepllz · 2025-09-12T07:08:01Z

examples/config/deepseek_v3/sft_4k_argument_dsv3.json

+  "do_eval": false,
+  "disable_tqdm": true,
+  "use_expert_parallel": true,
+  "expert_parallel_degree": 8,


4K的分布式策略是sharding16ep16pp8

lugimzzz · 2025-09-24T13:00:24Z

examples/config/deepseek_v3/sft_128k_argument_dsv3.json

+  "continue_training": true,
+  "pipeline_parallel_config": "enable_delay_scale_loss disable_partial_send_recv disable_batch_p2p_comm",
+  "tensor_parallel_config": "enable_delay_scale_loss",
+  "load_best_model_at_end": true,


"load_best_model_at_end": true, "metric_for_best_model": "loss",可以去掉了

lugimzzz · 2025-09-24T13:17:48Z

examples/config/deepseek_v3/sft_128k_argument_dsv3.json

+  "sharding": "stage1",
+  "unified_checkpoint": true,
+  "use_flash_attention": true,
+  "flash_mask": true,


"use_flash_attention": true,
"flash_mask": true 这两个开关去掉

现在使用attn_impl: flashmask 控制atten类型了https://github.com/PaddlePaddle/PaddleFormers/blob/develop/examples/config/sft_full.yaml#L15

lugimzzz · 2025-09-24T13:20:23Z

examples/config/deepseek_v3/sft_128k_argument_dsv3.json

+  "use_flash_attention": true,
+  "flash_mask": true,
+  "using_fake_gate": true,
+  "using_flex_token": true,


组网里只保留deepep这套就行，把all2all的删掉，无需再写flex_token判断

为什么不需要alltoall版本了？

这套MOE全参训练有hang住问题，并且这套效率不高不再考虑维护

lugimzzz · 2025-09-24T13:24:55Z

examples/config/deepseek_v3/sft_128k_argument_dsv3.json

+  "flash_mask": true,
+  "using_fake_gate": true,
+  "using_flex_token": true,
+  "use_fused_rms_norm": true,


使用config.fuse_rms_norm，使用通用模块接入https://github.com/PaddlePaddle/PaddleFormers/blob/develop/paddleformers/nn/norm.py#L50

lugimzzz · 2025-09-24T13:26:12Z

examples/config/deepseek_v3/sft_4k_argument_dsv3.json

@@ -0,0 +1,62 @@
+{
+  "model_name_or_path": "/root/paddlejob/tmpspace/huggingface_model/huggingface/deepseek-ai/DeepSeek-V3-bf16/",


直接写模型名字 opensourcerelease/DeepSeek-V3-bf16

lugimzzz · 2025-09-24T13:26:37Z

examples/config/deepseek_v3/sft_4k_argument_dsv3.json

@@ -0,0 +1,62 @@
+{
+  "model_name_or_path": "/root/paddlejob/tmpspace/huggingface_model/huggingface/deepseek-ai/DeepSeek-V3-bf16/",
+  "dataset_name_or_path": "/root/paddlejob/tmpspace/chenzhichao/PaddleNLP-SFT/llm/en_data",


注意路径写一个通用路径，全局注意修改

lugimzzz · 2025-09-24T13:28:00Z

paddleformers/nn/norm.py

 from paddle.incubate.nn.functional import fused_rms_norm_ext

 from ..generation.configuration_utils import PretrainedConfig
+from ..transformers.llama import fusion_ops


是不是可以去除fusion_ops？

lugimzzz · 2025-09-24T13:28:15Z

paddleformers/nn/norm.py

 from ..generation.configuration_utils import PretrainedConfig
+from ..transformers.llama import fusion_ops
 from ..utils.log import logger
+from ..utils.tools import get_env_device


为什么需要import get_env_device

lugimzzz · 2025-09-24T13:39:52Z

paddleformers/nn/mlp.py

-                ),
-            )
-            self.gate_proj = getattr(self, gate_proj_name)
+        def linear_type_gaurd():


这个FP8处理是不是放在paddleformers.nn.linear 更合适，如果想要控制moe部分不为TP linear，建议create MLP传入config.tensor_parallel_degree = 1
https://github.com/PaddlePaddle/PaddleFormers/blob/develop/paddleformers/transformers/ernie4_5_moe/modeling.py#L134

lugimzzz · 2025-09-25T03:26:05Z

examples/config/deepseek_v3/sft_128k_argument_dsv3.json

+  "unified_checkpoint": true,
+  "use_flash_attention": true,
+  "flash_mask": true,
+  "using_fake_gate": true,


如果不需要fake_gate是否可以删除

lugimzzz · 2025-09-25T03:27:52Z

examples/config/deepseek_v3/sft_128k_argument_dsv3.json

+  "sequence_parallel": true,
+  "tensor_parallel_output": true,
+  "amp_master_grad": true,
+  "sharding_parallel_config": "split_param",


开sharding stage1 v2场景+ tensorwise_offload_optimizer 试过UC热启续接正常吗？

lugimzzz · 2025-09-25T03:28:05Z

examples/config/deepseek_v3/sft_128k_argument_dsv3.json

+  "amp_master_grad": true,
+  "sharding_parallel_config": "split_param",
+  "num_nextn_predict_layers": 1,
+  "convert_from_hf": true


convert_from_hf 这个可以去掉了，默认是True

lugimzzz · 2025-09-25T03:28:30Z

examples/config/deepseek_v3/sft_128k_argument_dsv3.json

@@ -0,0 +1,60 @@
+{


json换成yaml，具体可参考 https://github.com/PaddlePaddle/PaddleFormers/tree/develop/examples/config

lugimzzz · 2025-09-25T03:29:42Z

examples/config/deepseek_v3/sft_128k_argument_dsv3.json

@@ -0,0 +1,60 @@
+{
+  "model_name_or_path": "/root/paddlejob/tmpspace/huggingface_model/huggingface/deepseek-ai/DeepSeek-V3-bf16/",


建议针对具体模型可以写一个examples/best_practice/deepseek_v3_sft/... 放具体模型配置和全流程使用文档

lugimzzz · 2025-09-25T03:38:11Z

examples/run_finetune.py

    model_config.num_nextn_predict_layers = model_args.num_nextn_predict_layers
    model_config._attn_implementation = model_args.attn_impl
    model_config.moe_subbatch_token_num = model_args.moe_subbatch_token_num
+    model_config.gradient_accumulation_steps = training_args.gradient_accumulation_steps


这个的作用？

lugimzzz · 2025-09-25T03:41:24Z

paddleformers/nn/lm_head.py

            param.split_axis = 0

-    def forward(self, hidden_states, tensor_parallel_output=None):
+    def forward(self, hidden_states, tensor_parallel_output=None, gather_hidden_states=True):


虽然没啥关系，但为什么需要改这里？

lugimzzz · 2025-09-25T03:42:35Z

paddleformers/optimizers/moe_hybrid_parallel_optimizer.py

+                group=self._hcg.get_pipe_parallel_group(),
+            )
+
+        # logger.info(


不需要的部分就注释掉

全局注意这个问题

lugimzzz · 2025-09-25T03:43:26Z

paddleformers/optimizers/moe_hybrid_parallel_optimizer.py

+__all__ = []
+
+
+class MoEHybridParallelClipGrad:


MoEHybridParallelClipGrad 这个的作用？

这个是sharding-ep场景正确计算global norm的 clipgrad方法，否则计算的global_norm是错的。原来的dp-moe和mp-moe都可以用原来的。

lugimzzz · 2025-09-25T05:10:01Z

examples/run_finetune.py

+
    if getattr(model_config, "topk_method", None) == "noaux_tc":
-        callbacks += [MoECorrectionBiasAdjustCallback(lr=0)]
+        # deepseek_v3 finetune do not update the bias, so set lr to 0.0


deepseek v3有用到这个策略吗？看起来配置没打开？

看一下上面那个if判断，这个是通过model_config来开启的。

我看topk_method默认是gready？什么情况需要开noaux_tc？

lugimzzz · 2025-09-25T05:32:33Z

paddleformers/nn/pp_model.py

        if self._decoder_layer_cls is None:
            raise ValueError("_decoder_layer_cls must be set before init.")
-        DecoderLayerPipe = make_decoder_layer_pipe(self._decoder_layer_cls)
+


加个判断if config.num_nextn_predict_layers > 0 and _mtp_layer_pipe_cls is None：需要定义 _mtp_layer_pipe_cls

lugimzzz · 2025-09-25T05:33:38Z

paddleformers/nn/pp_model.py

        hidden_states, _, _, _, _ = parse_args(args)
-        hidden_states = super().forward(hidden_states)
-        return hidden_states
+


这段建议暂时写在deepseek组网内，传入一个_rmsnorm_pipe_cls来。暂时不确定这种写法是否适用于其他模型

lugimzzz · 2025-09-25T05:34:20Z

paddleformers/nn/pp_model.py

                [batch_size, sequence_length, vocab_size]
                representing unnormalized log probabilities for each token
        """
+        if self.config.num_nextn_predict_layers > 0:


lugimzzz · 2025-09-25T05:35:00Z

paddleformers/optimizers/moe_hybrid_parallel_optimizer.py

+                group=self._hcg.get_pipe_parallel_group(),
+            )
+
+        # logger.info(


全局注意这个问题

lugimzzz · 2025-09-25T05:35:29Z

paddleformers/optimizers/moe_hybrid_parallel_optimizer.py

+        return self._dygraph_clip(params_grads)
+
+
+class MoEHybridParallelOptimizer(HPBase):


在什么场景需要使用MoEHybridParallelOptimizer？

lugimzzz · 2025-09-25T05:43:01Z

paddleformers/transformers/deepseek_v2/modeling.py

+from ...nn.mlp import MLP as DeepseekV2MLP
+from ...nn.norm import Norm as GeneralNorm
+from ...nn.pp_model import EmbeddingPipe, GeneralModelForCausalLMPipe, parse_args
+


确认一下paddle 3.2以后版本 fused_rotary_position_embedding是否存在

lugimzzz · 2025-09-25T05:44:04Z

paddleformers/transformers/deepseek_v2/modeling.py

 from paddle.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

+from ...nn.criterion.interface import CriterionLayer
+from ...nn.embedding import Embedding as GeneralEmbedding


仿照qwen2添加单测

lugimzzz · 2025-09-25T05:55:58Z

paddleformers/transformers/deepseek_v2/configuration.py

        self.num_key_value_heads = num_key_value_heads
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
+        self.use_rmsnorm = use_rmsnorm


不需要在config中新增use_rmsnorm，创建rmsnorm直接传入type就行 https://github.com/PaddlePaddle/PaddleFormers/blob/develop/paddleformers/transformers/ernie4_5_moe/modeling.py#L320

lugimzzz · 2025-09-25T06:02:32Z

paddleformers/transformers/deepseek_v2/configuration.py

        norm_topk_prob=False,
        scoring_func="softmax",
-        aux_loss_alpha=0.001,
+        aux_loss_alpha=0.0001,


对照一下deepseekv2& v3模型中transformers config和paddleformers config差距，列一下额外多出的config的作用

论文里这个系数是0.0001，之前就填错了。

lugimzzz · 2025-09-25T06:03:22Z

paddleformers/transformers/deepseek_v2/modeling.py

+from ...nn.mlp import MLP as DeepseekV2MLP
+from ...nn.norm import Norm as GeneralNorm
+from ...nn.pp_model import EmbeddingPipe, GeneralModelForCausalLMPipe, parse_args
+


确认哪些没用到，或paddle3.2以后版本已经有了不需要try

对llama依赖是否能够去除

lugimzzz · 2025-09-25T06:13:45Z

解决一下CI和codestyle问题，提交代码前需要pre-commit install

lugimzzz · 2025-09-25T07:51:13Z

paddleformers/transformers/deepseek_v2/modeling.py

        self.vocab_size = config.vocab_size
-        self.lm_head = DeepseekV2LMHead(config)
-        self.criterion = DeepseekV2PretrainingCriterion(config)
+        self.lm_head = GeneralLMHead(config)


deepseek v2没有修改base_model_prefix，已经其他部分对应修改

get_input_embeddings & set_input_embeddings & get_output_embeddings ...等等都可以直接用父类PretrainedModel中的函数

lugimzzz · 2025-09-25T07:58:24Z

paddleformers/trl/model_config.py

    )
+    using_flex_token: bool = field(default=False, metadata={"help": "Whether to use deepep moe_layer"})
+    using_fake_gate: bool = field(default=False, metadata={"help": "Whether to fake gate"})
+    moe_subbatch_token_num: int = field(


已经加到现有PR里面了可以去掉

lugimzzz · 2025-09-25T08:33:35Z

paddleformers/transformers/deepseek_v3/modeling.py

+        "down_proj",
+        "gate",
+        "eh_proj",
+        "lm_head",


去掉lm_head，现在的lm head都是[vocab_size,hidden_states]，不需要transpose了

lugimzzz · 2025-09-25T08:36:11Z

paddleformers/transformers/deepseek_v3/modeling.py


        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-        outputs = self.deepseek_v3(
+        outputs = self.model(


DeepseekV3ForCausalLM 不开PP缩层验证MTP过正常吗？

lugimzzz · 2025-09-25T08:41:41Z

paddleformers/transformers/deepseek_v2/modeling.py

-                    f"Implementation of fused_rms_norm is not available on {get_env_device()}. Please install paddle_xpu to use this feature"
-                )
+        if self.config.use_fused_rms_norm:
+            if get_env_device() == "xpu":


DeepseekV2RMSNorm还需要用到吗？可以直接使用通用

lugimzzz · 2025-09-25T08:43:35Z

paddleformers/transformers/deepseek_v2/modeling.py


        if self.using_flex_token:
            scores, routing_map, exp_counts, l_aux, l_zloss = self.topkgating_nodrop(scores)
+            with paddle.no_grad():


DeepseekV2MoE建议保留一个单卡组网对照transformers和deepep版本的EP并行

lugimzzz · 2025-09-25T08:53:04Z

paddleformers/transformers/deepseek_v2/modeling.py

    "DeepseekV2ForSequenceClassification",
    "DeepseekV2Model",
    "DeepseekV2PretrainedModel",
+    "DeepseekV2ForCausalLMPipe",


对照transformers 组网删除一些冗余代码https://github.com/huggingface/transformers/blob/main/src/transformers/models/deepseek_v2/modeling_deepseek_v2.py

lugimzzz · 2025-09-25T08:54:16Z

paddleformers/transformers/deepseek_v2/modeling.py

-                self.kv_b_proj = ColumnParallelLinear(config.kv_lora_rank, self.num_heads * (self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim), has_bias=False, gather_output=True)
-                self.o_proj = RowParallelLinear(self.num_heads * self.v_head_dim, self.hidden_size, has_bias=config.attention_bias, input_is_parallel=False)
-            self.kv_a_layernorm = DeepseekV2RMSNorm(config=config, hidden_size=config.kv_lora_rank, use_sequence_parallel=False)
+                self.q_proj = GeneralLinear.create(


linear_dtype_gaurd() 写在GeneralLinear用config控制

lugimzzz · 2025-09-25T08:54:51Z

paddleformers/transformers/deepseek_v2/modeling.py

+                            pg.allreduce(param.main_grad).wait()
+                        else:
+                            pg.allreduce(param.grad).wait()
+


grad_allreduce_hook的作用是是什么

lugimzzz · 2025-09-25T09:02:12Z

paddleformers/transformers/deepseek_v2/modeling.py

        )
+
+
+class DeepseekV2MTPLayerPipe(DeepseekV2MTPLayer):


关于pp的东西比较多DeepseekV2MTPLayerPipe、DeepseekV2EmbeddingPipe、DeepseekV2DecoderLayerPipe这些类写在modeling_pp.py里import进来吧

lugimzzz · 2025-09-25T09:13:19Z

paddleformers组网相比与paddlenlp组网有些不同，在写组网尽量复用paddleformers现有模参考paddleformers 中qwen2&ernie4.5组网和transformers中deepseek v3的写法，除了EP训练必须的配置（这些配置可以列一下，看是否其他模型也需要进行通用化），冗余代码都删掉，
https://github.com/PaddlePaddle/PaddleFormers/blob/develop/paddleformers/transformers/qwen2/modeling.py

lugimzzz · 2025-09-25T09:50:50Z

paddleformers/transformers/deepseek_v2/modeling.py


-        self.enorm = DeepseekV2RMSNorm(config)
-        self.hnorm = DeepseekV2RMSNorm(config)
+        self.enorm = GeneralNorm.create(


建议传norm_type指明norm类型

Difers · 2025-09-29T11:19:42Z

/re-run all-failed

lugimzzz · 2025-09-25T12:10:01Z

examples/run_finetune.py

+
    if getattr(model_config, "topk_method", None) == "noaux_tc":
-        callbacks += [MoECorrectionBiasAdjustCallback(lr=0)]
+        # deepseek_v3 finetune do not update the bias, so set lr to 0.0


我看topk_method默认是gready？什么情况需要开noaux_tc？

lugimzzz · 2025-09-30T04:53:39Z

paddleformers/transformers/deepseek_v2/modeling.py

-                self.kv_b_proj = Linear(config.kv_lora_rank, self.num_heads * (self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim), bias_attr=False)
-                self.o_proj = Linear(self.num_heads * self.v_head_dim, self.hidden_size, bias_attr=config.attention_bias)
-            self.kv_a_layernorm = DeepseekV2RMSNorm(config=config, hidden_size=config.kv_lora_rank)
+                self.q_a_proj = GeneralLinear.create(


现在不支持FP8，linear_dtype_guard先去了？

lugimzzz · 2025-09-30T04:57:23Z

paddleformers/transformers/deepseek_v3/modeling.py


    def get_input_embeddings(self):
-        return self.deepseek_v3.embed_tokens
+        return self.model.embed_tokens


这些函数PretrainedModel已经有了不需要重复定义，deepseek v2也同步修改一下https://github.com/PaddlePaddle/PaddleFormers/blob/develop/paddleformers/transformers/model_utils.py#L1455

lugimzzz · 2025-09-30T05:00:12Z

paddleformers/trl/model_config.py

    pp_seg_method: Optional[str] = field(
        default="layer:DecoderLayer|EmptyLayer", metadata={"help": "PP Segmentation Method"}
    )
+    using_fake_gate: bool = field(default=False, metadata={"help": "Whether to fake gate"})


using_fake_gate和aux_loss_alpha都是moe模型可以通用使用的参数建议直接添加到LlmMetaConfig&PretrainedConfig，具体看PaddleFormers 贡献模型示例文档2.1.1

LlmMetaConfig&PretrainedConfig 在这两个添加后就不需要在model_config.py & run_finetune.py新增代码了

lugimzzz · 2025-09-30T05:02:08Z

examples/run_finetune.py

+
    if getattr(model_config, "topk_method", None) == "noaux_tc":
-        callbacks += [MoECorrectionBiasAdjustCallback(lr=0)]
+        # deepseek_v3 finetune do not update the bias, so set lr to 0.0


这个glm4.5也会用到，注释不要指定模型

lugimzzz · 2025-09-30T05:41:59Z

paddleformers/transformers/deepseek_v2/modeling.py

-        return f"hidden_size={self.hidden_size}, dtype={self.weight.dtype}"
-
-
 class DeepseekV2RotaryEmbedding(nn.Layer):


现在rotaryembedding使用的方式是预先计算position_embeddings，然后再传入模型组网中。现在pp_model laye和layer间传递参数解析也是默认有position_embeddings，确认一下当前写法是否会有冗余或隐藏问题，改成新的写法
https://github.com/huggingface/transformers/blob/main/src/transformers/models/deepseek_v2/modeling_deepseek_v2.py#L535

PaddleFormers/paddleformers/nn/pp_model.py

Line 65 in b2b8c33

if len(args) == 5:

lugimzzz · 2025-09-30T05:46:19Z

paddleformers/transformers/deepseek_v2/modeling.py

 from ...utils.tools import get_env_device
-from ..activations import ACT2FN
 from ..conversion_utils import StateDictNameMapping, init_name_mappings
 from ..llama import fusion_ops


fusion_ops.fusion_flash_attention改成使用通用模块
https://github.com/PaddlePaddle/PaddleFormers/blob/develop/paddleformers/transformers/qwen2/modeling.py#L181
https://github.com/huggingface/transformers/blob/main/src/transformers/models/deepseek_v2/modeling_deepseek_v2.py#L386

lugimzzz · 2025-09-30T05:47:24Z

paddleformers/transformers/deepseek_v2/modeling.py

    "DeepseekV2ForSequenceClassification",
    "DeepseekV2Model",
    "DeepseekV2PretrainedModel",
+    "DeepseekV2ForCausalLMPipe",


冗余的函数可以删除，比如get_triangle_upper_mask、assign_kv_heads等等
可对比transformers中实现
https://github.com/huggingface/transformers/blob/main/src/transformers/models/deepseek_v2/modeling_deepseek_v2.py#L386

lugimzzz · 2025-09-30T05:49:26Z

paddleformers/transformers/deepseek_v2/modeling.py

 from ..utils import device_guard
 from . import fp8_linear as linear_utils
 from .configuration import DeepseekV2Config
 from .fp8_linear import Linear


fp8没验证先去掉

lugimzzz · 2025-09-30T05:50:02Z

paddleformers/transformers/deepseek_v2/modeling.py

-from ..activations import ACT2FN
 from ..conversion_utils import StateDictNameMapping, init_name_mappings
 from ..llama import fusion_ops
 from ..llama.modeling import get_use_casual_mask


去掉依赖llama相关内容

lugimzzz · 2025-09-30T05:52:41Z

需要进一步修改代码，符合paddleformers代码规范。如本PR暂不完成，备注留在下个pr修改内容

Difers · 2025-09-30T08:14:01Z

需要进一步修改代码，符合paddleformers代码规范。如本PR暂不完成，备注留在下个pr修改内容

目前很多问题其实来自于需要将由paddlenlp迁移过来的组网需要再和hf transformers的组网做对齐；这一部分已列入todo list，再下一个pr再统一修改

Difers force-pushed the add_dsv3_from_nlp branch 2 times, most recently from 7cc21a4 to a0539fe Compare September 11, 2025 08:22

deepllz reviewed Sep 11, 2025

View reviewed changes

deepllz reviewed Sep 12, 2025

View reviewed changes

Difers force-pushed the add_dsv3_from_nlp branch 4 times, most recently from 6f20001 to d5a1f8e Compare September 24, 2025 13:20

lugimzzz reviewed Sep 24, 2025

View reviewed changes

lugimzzz reviewed Sep 25, 2025

View reviewed changes

Difers force-pushed the add_dsv3_from_nlp branch 2 times, most recently from a1631f3 to 9adce24 Compare September 28, 2025 09:40

Difers closed this Sep 28, 2025

Difers force-pushed the add_dsv3_from_nlp branch from 9adce24 to 2123476 Compare September 28, 2025 09:42

Difers reopened this Sep 28, 2025

Difers closed this Sep 29, 2025

Difers force-pushed the add_dsv3_from_nlp branch from be9a315 to 03b533c Compare September 29, 2025 09:56

add dsv3

b2b8c33

Difers reopened this Sep 29, 2025

lugimzzz reviewed Sep 30, 2025

View reviewed changes


		model_class = AutoModelForCausalLMPipe

		model_config.using_flex_token = model_args.using_flex_token

		@@ -0,0 +1,62 @@
		{
		"model_name_or_path": "/root/paddlejob/tmpspace/huggingface_model/huggingface/deepseek-ai/DeepSeek-V3-bf16/",

		@@ -0,0 +1,60 @@
		{
		"model_name_or_path": "/root/paddlejob/tmpspace/huggingface_model/huggingface/deepseek-ai/DeepSeek-V3-bf16/",

		return self._dygraph_clip(params_grads)


		class MoEHybridParallelOptimizer(HPBase):

		return f"hidden_size={self.hidden_size}, dtype={self.weight.dtype}"


		class DeepseekV2RotaryEmbedding(nn.Layer):

[dsv3]Move dsv3 model from paddlenlp-dsv3-sft #2593

Are you sure you want to change the base?

[dsv3]Move dsv3 model from paddlenlp-dsv3-sft #2593

Uh oh!

Conversation

Difers commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

PR types

PR changes

Description

迁移PaddleNLP dsv3 Modle 至 PaddleFormers

新增部分特性与修改相关PR，供参考

4K 下收敛验证

Uh oh!

paddle-bot bot commented Sep 11, 2025

Uh oh!

codecov-commenter commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lugimzzz Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Difers commented Sep 11, 2025 •

edited

Loading

codecov-commenter commented Sep 11, 2025 •

edited

Loading

lugimzzz Sep 24, 2025 •

edited

Loading