Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The provided qkv memory layout is not supported! When using RoPE #681

Closed
1049451037 opened this issue Feb 26, 2024 · 14 comments
Closed

The provided qkv memory layout is not supported! When using RoPE #681

1049451037 opened this issue Feb 26, 2024 · 14 comments
Assignees

Comments

@1049451037
Copy link

1049451037 commented Feb 26, 2024

I think the problem has ever been solved before. But now it occurs again.

How to solve it? I have tested both stable branch and main branch. None of them work.

#544
#455

I just run the official text generation example of Megatron-LM by adding --position-embedding-type rope and --no-position-embedding args:

https://github.com/NVIDIA/Megatron-LM/blob/main/examples/run_text_generation_server_345M.sh

And got the error The provided qkv memory layout is not supported!

Moreover, I use the mcore version model instead of legacy model, so you should change it in text_generation_server.py to reproduce the error.

@1049451037
Copy link
Author

I solved the problem by hard-code...

NVIDIA/Megatron-LM#703 (comment)

@cyanguwa
Copy link
Collaborator

cyanguwa commented Mar 1, 2024

Hi @1049451037 , could you provide more details about how to run the job please? Currently I can start the job but am stuck at

 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 76705792
 * Serving Flask app 'megatron.text_generation_server'
 * Debug mode: off
INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://10.176.10.102:5000
INFO:werkzeug:Press CTRL+C to quit

Could you send me your complete run_text_generation_server_345M.sh script please?

@1049451037
Copy link
Author

I don't see any problem in your log. You run the server. Just need to start a client at: https://github.com/NVIDIA/Megatron-LM/blob/main/tools/text_generation_cli.py

python tools/text_generation_cli.py 10.176.10.102:5000

@stgzr
Copy link

stgzr commented Mar 5, 2024

Same issue. Any solutions?

@cyanguwa
Copy link
Collaborator

cyanguwa commented Mar 9, 2024

I tested with TE main (8255f87) and Megatron-LM main (8957468). I'm not seeing the issue above. Let me know if I'm not using the same run script as you have.

Thanks.

# container: nvcr.io/nvidia/pytorch:24.02-py3
# install the latest TransformerEngine and Megatron-LM

$ cat examples/run_text_generation_server_345M.sh
#!/bin/bash
# This example will start serving the 345M model.
DISTRIBUTED_ARGS="--nproc_per_node 1 \
                  --nnodes 1 \
                  --node_rank 0 \
                  --master_addr localhost \
                  --master_port 6000"

CHECKPOINT="" #<Path to checkpoint (e.g /345m)>
VOCAB_FILE=/code/gpt2-vocab.json
MERGE_FILE=/code/gpt2-merges.txt
DATA_PATH=/code/ds/ThePile/BookCorpus2_ftfy_cleaned_id_shuf_text_document

export CUDA_DEVICE_MAX_CONNECTIONS=1

pip install flask-restful

#torchrun $DISTRIBUTED_ARGS tools/run_text_generation_server.py   \
torchrun tools/run_text_generation_server.py   \
       --tensor-model-parallel-size 1  \
       --pipeline-model-parallel-size 1  \
       --num-layers 2  \
       --hidden-size 1024  \
       --num-attention-heads 16  \
       --max-position-embeddings 1024  \
       --tokenizer-type GPT2BPETokenizer  \
       --position-embedding-type rope \
       --no-position-embedding \
       --fp16  \
       --micro-batch-size 1  \
       --seq-length 1024  \
       --vocab-file $VOCAB_FILE  \
       --merge-file $MERGE_FILE  \
       --data-path $DATA_PATH \
       --use-mcore-models \
       --micro-batch-size 2 \
       --global-batch-size 8 \
       --lr 0.00015 \
       --train-iters 50 \
       --lr-decay-iters 320000 \
       --lr-decay-style cosine \
       --min-lr 1.0e-5 \
       --weight-decay 1e-2 \
       --lr-warmup-fraction .01 \
       --clip-grad 1.0 \
       --seed 42

$ pip list | grep transformer
transformer-engine        1.5.0.dev0

$ bash examples/run_text_generation_server_345M.sh
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://10.32.113.152:5000
INFO:werkzeug:Press CTRL+C to quit
request IP: 10.32.113.152
{"prompts": ["I am"], "tokens_to_generate": 10}
start time:  2024-03-09 04:22:05.314579
INFO:werkzeug:10.32.113.152 - - [09/Mar/2024 04:22:10] "PUT /api HTTP/1.1" 200 -

$ python tools/text_generation_cli.py 10.32.113.152:5000
Enter prompt: I am
Enter number of tokens to generate: 10
Megatron Response: 
I am perennlington vehiclesigning protagonistlon Peng surreal nostalgia ignorant
Enter prompt: 

@1049451037
Copy link
Author

1049451037 commented Mar 13, 2024

You don't have the problem if you just run the example. Because the example inference does not use MCORE model. It just use the legacy model as you can see in model_provider.

@1049451037
Copy link
Author

1049451037 commented Mar 13, 2024

@cyanguwa You may replace the model provider in text generation server with this to reproduce the error:

from megatron.core.models.gpt import GPTModel
import megatron.model
from megatron.training import get_model
from megatron.arguments import core_transformer_config_from_args
from megatron.text_generation_server import MegatronServer
from megatron.text_generation import generate_and_post_process
from megatron.text_generation import beam_search_and_post_process
import torch

from megatron.core.transformer.spec_utils import import_module
from megatron.core.models.gpt.gpt_layer_specs import get_gpt_layer_with_transformer_engine_spec

def model_provider(pre_process=True, post_process=True):
    """Build the model."""
    args = get_args()

    print_rank_0('building GPT model ...')
    config = core_transformer_config_from_args(get_args())

    if args.use_mcore_models:
        print("building megatron core model!!!!!!!!!!!!!!")
        if args.spec is not None:
            transformer_layer_spec = import_module(args.spec)
        else:
            transformer_layer_spec = get_gpt_layer_with_transformer_engine_spec(args.num_experts, args.moe_grouped_gemm)

        model = GPTModel(
            config=config,
            transformer_layer_spec=transformer_layer_spec,
            vocab_size=args.padded_vocab_size,
            max_sequence_length=args.max_position_embeddings,
            pre_process=pre_process,
            post_process=post_process,
            fp16_lm_cross_entropy=args.fp16_lm_cross_entropy,
            parallel_output=False,
            share_embeddings_and_output_weights=not args.untie_embeddings_and_output_weights,
            position_embedding_type=args.position_embedding_type,
            rotary_percent=args.rotary_percent,
        )
    else:
        print("building megatron legacy model!!!!!!!!!!!!!!")
        assert False, "Never do this!"
        assert(args.context_parallel_size == 1), "Context parallelism is only supported with Megatron Core!"

        model = megatron.model.GPTModel(
            config,
            num_tokentypes=0,
            parallel_output=False,
            pre_process=pre_process,
            post_process=post_process
        )
    return model

@yaox12
Copy link
Collaborator

yaox12 commented Mar 14, 2024

We're aware of this bug and will push a fix to MCore.
For now, you can add the following code in https://github.com/NVIDIA/Megatron-LM/blob/89574689447d694bb19dd86fc8a6153b4467ba9d/megatron/core/transformer/custom_layers/transformer_engine.py#L464

        # In PyTorch, the following two tensors are in fact the same:
        #   Tensor with shape (1, S, H, D) and stride (S*H*D, H*D, D, 1)
        #   Tensor with shape (1, S, H, D) and stride (H*D, H*D, D, 1)
        # We unify them to the first one to pass the stride check in TE
        if value.shape == key.shape and value.stride() != key.stride():
            value = value.as_strided(value.shape, key.stride())

@1049451037
Copy link
Author

No. This won't fix the bug. It makes inference normal, but makes training fail. (The loss of training cannot converge)

@1049451037
Copy link
Author

My solution for now is just adding the as_strided for inference, and comment out this line during training... Waiting for an official more elegant way to solve...

@stgzr
Copy link

stgzr commented Mar 14, 2024

Maybe it is the qkv_format, you can check the tensor format is sbhd or bshd.

@1049451037
Copy link
Author

sbhd, just the official training and inference code in Megatron.

@ptrendx
Copy link
Member

ptrendx commented May 16, 2024

I believe the MCore issue is fixed now, is that correct @yaox12? Can we close this issue?

@1049451037
Copy link
Author

Yes, the issue is fixed in the latest main branch of Megatron-LM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants