validate `use_remove_padding` when applying sequence parallelism #153

chujiezheng · 2025-01-28T17:39:10Z

This is because the ulysses_sp is activated only when use_remove_padding is enabled:

Lines 71 to 128 in ab525bc

    
           if self.use_remove_padding: 
        
               input_ids_rmpad, indices, *_ = unpad_input(input_ids.unsqueeze(-1), 
        
                                                          attention_mask)  # input_ids_rmpad (total_nnz, ...) 
        
               input_ids_rmpad = input_ids_rmpad.transpose(0, 1)  # (1, total_nnz) 
        
               # unpad the position_ids to align the rotary 
        
               position_ids_rmpad = index_first_axis(rearrange(position_ids.unsqueeze(-1), "b s ... -> (b s) ..."), 
        
                                                     indices).transpose(0, 1) 
        
               # for compute the log_prob 
        
               input_ids_rmpad_rolled = torch.roll(input_ids_rmpad, shifts=-1, dims=1)  # (1, total_nnz) 
        
               # pad and slice the inputs if sp > 1 
        
               if self.use_ulysses_sp: 
        
                   input_ids_rmpad, position_ids_rmpad, pad_size = ulysses_pad_and_slice_inputs(input_ids_rmpad, \ 
        
                                                                                               position_ids_rmpad, \ 
        
                                                                                               sp_size=self.ulysses_sequence_parallel_size) 
        
                   input_ids_rmpad_rolled, _, _ = ulysses_pad_and_slice_inputs(input_ids_rmpad_rolled, None, 
        
                                                                               self.ulysses_sequence_parallel_size) 
        
               input_ids_rmpad_rolled = input_ids_rmpad_rolled.squeeze(0)  # ((total_nnz / sp) + pad) 
        
               # only pass input_ids and position_ids to enable flash_attn_varlen 
        
               output = self.actor_module(input_ids=input_ids_rmpad, 
        
                                          attention_mask=None, 
        
                                          position_ids=position_ids_rmpad, 
        
                                          use_cache=False)  # prevent model thinks we are generating 
        
               logits_rmpad = output.logits.squeeze(0)  # (total_nnz, vocab_size) 
        
               logits_rmpad.div_(temperature) 
        
               # compute entropy 
        
               entropy_rmpad = self.compute_entropy_from_logits(logits_rmpad)  # ((total_nnz / sp) + pad) 
        
               # if use_sp: ((total_nnz / sp) + pad) ; if not use_sp: (batch, seqlen) 
        
               log_probs = logprobs_from_logits(logits=logits_rmpad, labels=input_ids_rmpad_rolled) 
        
               # gather log_prob if sp > 1 
        
               if self.use_ulysses_sp: 
        
                   # gather and unpad for the ulysses sp 
        
                   log_probs = gather_outpus_and_unpad(log_probs, gather_dim=0, unpad_dim=0, padding_size=pad_size) 
        
                   entropy_rmpad = gather_outpus_and_unpad(entropy_rmpad, 
        
                                                           gather_dim=0, 
        
                                                           unpad_dim=0, 
        
                                                           padding_size=pad_size) 
        
               # pad back to (bsz, seqlen) 
        
               full_entropy = pad_input(hidden_states=entropy_rmpad.unsqueeze(-1), 
        
                                        indices=indices, 
        
                                        batch=batch_size, 
        
                                        seqlen=seqlen) 
        
               full_log_probs = pad_input(hidden_states=log_probs.unsqueeze(-1), 
        
                                          indices=indices, 
        
                                          batch=batch_size, 
        
                                          seqlen=seqlen) 
        
               # only return response part: 
        
               entropy = full_entropy.squeeze(-1)[:, -response_length - 1:-1]  # (bsz, response_length) 
        
               log_probs = full_log_probs.squeeze(-1)[:, -response_length - 1:-1]  # (bsz, response_length)

Without this check, users may encounter OOM issues when the set sp_size > 1 but use_remove_padding is mistakenly disabled.

vermouth1992 · 2025-01-29T02:19:58Z

Could you format the code by running bash script/format.sh

chujiezheng · 2025-01-29T02:23:32Z

Done!

vermouth1992 · 2025-01-29T04:47:08Z

verl/utils/config.py

@@ -86,4 +86,13 @@ def check_mutually_exclusive(mbs, mbs_per_gpu, name: str):
            assert config.critic.ppo_mini_batch_size % config.critic.ppo_micro_batch_size == 0
            assert config.critic.ppo_micro_batch_size * sp_size >= n_gpus

+    # Check if use_remove_padding is enabled when using sequence parallelism
+    if config.actor_rollout_ref.actor.ulysses_sequence_parallel_size > 1:


It seems that the correct key is config. actor_rollout_ref.model.use_remove_padding, critic.model.use_remove_padding and reward_model.model.use_remove_padding

Unfortunately, the key is still not correct :(

@chujiezheng The key should be actor_rollout_ref.model.use_remove_padding. The actor and ref are not necessary as they have the same model type.

@vermouth1992 @PeterSH6 Fixed now 🥹

validate when applying sequence parallelism

67b58f0

vermouth1992 approved these changes Jan 29, 2025

View reviewed changes

run format.sh

1a49109

vermouth1992 reviewed Jan 29, 2025

View reviewed changes

chujiezheng added 5 commits January 30, 2025 10:27

fix 'actor.use_remove_padding' to 'actor.model.use_remove_padding'

7c662ef

Merge branch 'volcengine:main' into main

3c170b9

fix config check

440014a

fix typo

4bd7d4c

fix typo

a9f680a

vermouth1992 approved these changes Jan 31, 2025

View reviewed changes

vermouth1992 merged commit fb3793a into volcengine:main Jan 31, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate `use_remove_padding` when applying sequence parallelism #153

validate `use_remove_padding` when applying sequence parallelism #153

chujiezheng commented Jan 28, 2025

vermouth1992 commented Jan 29, 2025

chujiezheng commented Jan 29, 2025

vermouth1992 Jan 29, 2025

chujiezheng Jan 30, 2025

vermouth1992 Jan 30, 2025

PeterSH6 Jan 31, 2025

chujiezheng Jan 31, 2025

	if self.use_remove_padding:
	input_ids_rmpad, indices, *_ = unpad_input(input_ids.unsqueeze(-1),
	attention_mask) # input_ids_rmpad (total_nnz, ...)
	input_ids_rmpad = input_ids_rmpad.transpose(0, 1) # (1, total_nnz)

	# unpad the position_ids to align the rotary
	position_ids_rmpad = index_first_axis(rearrange(position_ids.unsqueeze(-1), "b s ... -> (b s) ..."),
	indices).transpose(0, 1)

	# for compute the log_prob
	input_ids_rmpad_rolled = torch.roll(input_ids_rmpad, shifts=-1, dims=1) # (1, total_nnz)

	# pad and slice the inputs if sp > 1
	if self.use_ulysses_sp:
	input_ids_rmpad, position_ids_rmpad, pad_size = ulysses_pad_and_slice_inputs(input_ids_rmpad, \
	position_ids_rmpad, \
	sp_size=self.ulysses_sequence_parallel_size)
	input_ids_rmpad_rolled, _, _ = ulysses_pad_and_slice_inputs(input_ids_rmpad_rolled, None,
	self.ulysses_sequence_parallel_size)

	input_ids_rmpad_rolled = input_ids_rmpad_rolled.squeeze(0) # ((total_nnz / sp) + pad)

	# only pass input_ids and position_ids to enable flash_attn_varlen
	output = self.actor_module(input_ids=input_ids_rmpad,
	attention_mask=None,
	position_ids=position_ids_rmpad,
	use_cache=False) # prevent model thinks we are generating
	logits_rmpad = output.logits.squeeze(0) # (total_nnz, vocab_size)

	logits_rmpad.div_(temperature)

	# compute entropy
	entropy_rmpad = self.compute_entropy_from_logits(logits_rmpad) # ((total_nnz / sp) + pad)

	# if use_sp: ((total_nnz / sp) + pad) ; if not use_sp: (batch, seqlen)
	log_probs = logprobs_from_logits(logits=logits_rmpad, labels=input_ids_rmpad_rolled)

	# gather log_prob if sp > 1
	if self.use_ulysses_sp:
	# gather and unpad for the ulysses sp
	log_probs = gather_outpus_and_unpad(log_probs, gather_dim=0, unpad_dim=0, padding_size=pad_size)
	entropy_rmpad = gather_outpus_and_unpad(entropy_rmpad,
	gather_dim=0,
	unpad_dim=0,
	padding_size=pad_size)
	# pad back to (bsz, seqlen)
	full_entropy = pad_input(hidden_states=entropy_rmpad.unsqueeze(-1),
	indices=indices,
	batch=batch_size,
	seqlen=seqlen)
	full_log_probs = pad_input(hidden_states=log_probs.unsqueeze(-1),
	indices=indices,
	batch=batch_size,
	seqlen=seqlen)

	# only return response part:
	entropy = full_entropy.squeeze(-1)[:, -response_length - 1:-1] # (bsz, response_length)
	log_probs = full_log_probs.squeeze(-1)[:, -response_length - 1:-1] # (bsz, response_length)

validate use_remove_padding when applying sequence parallelism #153

validate use_remove_padding when applying sequence parallelism #153

Conversation

chujiezheng commented Jan 28, 2025

vermouth1992 commented Jan 29, 2025

chujiezheng commented Jan 29, 2025

vermouth1992 Jan 29, 2025

Choose a reason for hiding this comment

chujiezheng Jan 30, 2025

Choose a reason for hiding this comment

vermouth1992 Jan 30, 2025

Choose a reason for hiding this comment

PeterSH6 Jan 31, 2025

Choose a reason for hiding this comment

chujiezheng Jan 31, 2025

Choose a reason for hiding this comment

validate `use_remove_padding` when applying sequence parallelism #153

validate `use_remove_padding` when applying sequence parallelism #153