Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use thd format qkv with cp + packed_seq_params #1368

Open
Wraythh opened this issue Dec 12, 2024 · 4 comments
Open

How to use thd format qkv with cp + packed_seq_params #1368

Wraythh opened this issue Dec 12, 2024 · 4 comments

Comments

@Wraythh
Copy link

Wraythh commented Dec 12, 2024

If I have a dataset with sequence lengths of [4, 8, 6, 10], and I use cp2 to split the data, I observe that te performs the operation cu_seqlen_q / cp_size on cu_seqlen_q. This means I need to split each subsequence in the sequence into two subsequences and then concatenate them, resulting in two subsequences of [2, 4, 3, 5]. Should I pass cu_seqlen_q as [0, 4, 12, 18, 20] to both cp_rank instances in this case, or is there an issue with this usage?

@xrennvidia
Copy link
Collaborator

Hi @Wraythh

CP splits sequence into CP*2 chunks, and each GPU gets 2 chunks (GPU0 gets first and last chunks, GPU1 gets second and second last chunks, and so on), this is for load balancing with causal masking.

THD+CP implementation in TE splits each individual sequence of the packed sequence into CP2 chunks, so you need to pad each individual sequence to a length that is divisible by CP2. Here is an example of how we split the input.

You should pass [0, 4, 12, 18, 20] to TE API, CP code will handle everything under the hood. You may have paddings after you pad each individual sequence to be divisible by CP*2, then you need cu_seqlens_padded for paddings between sequences.

TE CP unit test is a good reference for you.

Thanks.

@Wraythh
Copy link
Author

Wraythh commented Dec 19, 2024

OK thank you very much. What will happen if each of each individual sequence is not divisible by CP*2? Will it cause a loss crash? I use the tex.thd_get_partitioned_indices API to split my sequence, and pass cu_seqlen_q form like [0, 4, 12, 18, 20] to TE API but I found the loss will become NaN. Everything works fine when I don't pass the cu_seqlen_q parameter.

@xrennvidia
Copy link
Collaborator

You need to pad each individual sequence to be divisible by CP*2 (refer here).

After you pad each sequence to meet the divisible requirement, you need both cu_seqlens and cu_seqlens_padded (refer here).

@Wraythh
Copy link
Author

Wraythh commented Dec 23, 2024

Thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants