Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model.train_model(...) has ValueError with latest version of transformers, preventing model training and evaluation #1579

Open
PaulTran47 opened this issue Sep 5, 2024 · 0 comments

Comments

@PaulTran47
Copy link

Under the latest version of Google Colab, I ran into an issue when starting up my usual script with simpletransformers:

# Training XLNet
model.train_model(train_df = train_df, eval_df = eval_df, r2 = r2, pearson_corr = pearson_corr, mae = sklearn.metrics.mean_absolute_error)

The following error occurs:

ValueError                                Traceback (most recent call last)
[<ipython-input-7-34d9c76ad35f>](https://localhost:8080/#) in <cell line: 132>()
    130 
    131 # Training XLNet
--> 132 model.train_model(train_df = train_df, eval_df = eval_df, r2 = r2, pearson_corr = pearson_corr, mae = sklearn.metrics.mean_absolute_error)
    133 
    134 # Evaluating the selected trained version of XLNet. The evaluation metrics and

7 frames
[/usr/local/lib/python3.10/dist-packages/simpletransformers/classification/classification_model.py](https://localhost:8080/#) in train_model(self, train_df, multi_label, output_dir, show_running_loss, args, eval_df, verbose, **kwargs)
    628         os.makedirs(output_dir, exist_ok=True)
    629 
--> 630         global_step, training_details = self.train(
    631             train_dataloader,
    632             output_dir,

[/usr/local/lib/python3.10/dist-packages/simpletransformers/classification/classification_model.py](https://localhost:8080/#) in train(self, train_dataloader, output_dir, multi_label, show_running_loss, eval_df, test_df, verbose, **kwargs)
   1165 
   1166             if args.save_model_every_epoch:
-> 1167                 self.save_model(output_dir_current, optimizer, scheduler, model=model)
   1168 
   1169             if args.evaluate_during_training and args.evaluate_each_epoch:

[/usr/local/lib/python3.10/dist-packages/simpletransformers/classification/classification_model.py](https://localhost:8080/#) in save_model(self, output_dir, optimizer, scheduler, model, results)
   2457             # Take care of distributed/parallel training
   2458             model_to_save = model.module if hasattr(model, "module") else model
-> 2459             model_to_save.save_pretrained(output_dir)
   2460             self.tokenizer.save_pretrained(output_dir)
   2461             torch.save(self.args, os.path.join(output_dir, "training_args.bin"))

[/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in save_pretrained(self, save_directory, is_main_process, state_dict, save_function, push_to_hub, max_shard_size, safe_serialization, variant, token, save_peft_format, **kwargs)
   2791                 # At some point we will need to deal better with save_function (used for TPU and other distributed
   2792                 # joyfulness), but for now this enough.
-> 2793                 safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
   2794             else:
   2795                 save_function(shard, os.path.join(save_directory, shard_file))

[/usr/local/lib/python3.10/dist-packages/safetensors/torch.py](https://localhost:8080/#) in save_file(tensors, filename, metadata)
    284     ```
    285     """
--> 286     serialize_file(_flatten(tensors), filename, metadata=metadata)
    287 
    288 

[/usr/local/lib/python3.10/dist-packages/safetensors/torch.py](https://localhost:8080/#) in _flatten(tensors)
    494         )
    495 
--> 496     return {
    497         k: {
    498             "dtype": str(v.dtype).split(".")[-1],

[/usr/local/lib/python3.10/dist-packages/safetensors/torch.py](https://localhost:8080/#) in <dictcomp>(.0)
    498             "dtype": str(v.dtype).split(".")[-1],
    499             "shape": v.shape,
--> 500             "data": _tobytes(v, k),
    501         }
    502         for k, v in tensors.items()

[/usr/local/lib/python3.10/dist-packages/safetensors/torch.py](https://localhost:8080/#) in _tobytes(tensor, name)
    412 
    413     if not tensor.is_contiguous():
--> 414         raise ValueError(
    415             f"You are trying to save a non contiguous tensor: `{name}` which is not allowed. It either means you"
    416             " are trying to save tensors which are reference of each other in which case it's recommended to save"

ValueError: You are trying to save a non contiguous tensor: `transformer.layer.0.ff.layer_1.weight` which is not allowed. It either means you are trying to save tensors which are reference of each other in which case it's recommended to save only the full tensors, and reslice at load time, or simply call `.contiguous()` on your tensor to pack it before saving.

This error seems to stem from changes saved into the newest version of transformers, 4.44.2. When downgrading to version 4.42.4, the error no longer appears and my usual script for model training runs as normal.

Again, apologies for lack of details further describing the problem since diagnosing into the programmes of both simpletransformers and transformers is outside my area of expertise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant