model.train_model(...) has ValueError with latest version of transformers, preventing model training and evaluation #1579

PaulTran47 · 2024-09-05T03:58:36Z

Under the latest version of Google Colab, I ran into an issue when starting up my usual script with simpletransformers:

# Training XLNet
model.train_model(train_df = train_df, eval_df = eval_df, r2 = r2, pearson_corr = pearson_corr, mae = sklearn.metrics.mean_absolute_error)

The following error occurs:

ValueError                                Traceback (most recent call last)
[<ipython-input-7-34d9c76ad35f>](https://localhost:8080/#) in <cell line: 132>()
    130 
    131 # Training XLNet
--> 132 model.train_model(train_df = train_df, eval_df = eval_df, r2 = r2, pearson_corr = pearson_corr, mae = sklearn.metrics.mean_absolute_error)
    133 
    134 # Evaluating the selected trained version of XLNet. The evaluation metrics and

7 frames
[/usr/local/lib/python3.10/dist-packages/simpletransformers/classification/classification_model.py](https://localhost:8080/#) in train_model(self, train_df, multi_label, output_dir, show_running_loss, args, eval_df, verbose, **kwargs)
    628         os.makedirs(output_dir, exist_ok=True)
    629 
--> 630         global_step, training_details = self.train(
    631             train_dataloader,
    632             output_dir,

[/usr/local/lib/python3.10/dist-packages/simpletransformers/classification/classification_model.py](https://localhost:8080/#) in train(self, train_dataloader, output_dir, multi_label, show_running_loss, eval_df, test_df, verbose, **kwargs)
   1165 
   1166             if args.save_model_every_epoch:
-> 1167                 self.save_model(output_dir_current, optimizer, scheduler, model=model)
   1168 
   1169             if args.evaluate_during_training and args.evaluate_each_epoch:

[/usr/local/lib/python3.10/dist-packages/simpletransformers/classification/classification_model.py](https://localhost:8080/#) in save_model(self, output_dir, optimizer, scheduler, model, results)
   2457             # Take care of distributed/parallel training
   2458             model_to_save = model.module if hasattr(model, "module") else model
-> 2459             model_to_save.save_pretrained(output_dir)
   2460             self.tokenizer.save_pretrained(output_dir)
   2461             torch.save(self.args, os.path.join(output_dir, "training_args.bin"))

[/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in save_pretrained(self, save_directory, is_main_process, state_dict, save_function, push_to_hub, max_shard_size, safe_serialization, variant, token, save_peft_format, **kwargs)
   2791                 # At some point we will need to deal better with save_function (used for TPU and other distributed
   2792                 # joyfulness), but for now this enough.
-> 2793                 safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
   2794             else:
   2795                 save_function(shard, os.path.join(save_directory, shard_file))

[/usr/local/lib/python3.10/dist-packages/safetensors/torch.py](https://localhost:8080/#) in save_file(tensors, filename, metadata)
    284     ```
    285     """
--> 286     serialize_file(_flatten(tensors), filename, metadata=metadata)
    287 
    288 

[/usr/local/lib/python3.10/dist-packages/safetensors/torch.py](https://localhost:8080/#) in _flatten(tensors)
    494         )
    495 
--> 496     return {
    497         k: {
    498             "dtype": str(v.dtype).split(".")[-1],

[/usr/local/lib/python3.10/dist-packages/safetensors/torch.py](https://localhost:8080/#) in <dictcomp>(.0)
    498             "dtype": str(v.dtype).split(".")[-1],
    499             "shape": v.shape,
--> 500             "data": _tobytes(v, k),
    501         }
    502         for k, v in tensors.items()

[/usr/local/lib/python3.10/dist-packages/safetensors/torch.py](https://localhost:8080/#) in _tobytes(tensor, name)
    412 
    413     if not tensor.is_contiguous():
--> 414         raise ValueError(
    415             f"You are trying to save a non contiguous tensor: `{name}` which is not allowed. It either means you"
    416             " are trying to save tensors which are reference of each other in which case it's recommended to save"

ValueError: You are trying to save a non contiguous tensor: `transformer.layer.0.ff.layer_1.weight` which is not allowed. It either means you are trying to save tensors which are reference of each other in which case it's recommended to save only the full tensors, and reslice at load time, or simply call `.contiguous()` on your tensor to pack it before saving.

This error seems to stem from changes saved into the newest version of transformers, 4.44.2. When downgrading to version 4.42.4, the error no longer appears and my usual script for model training runs as normal.

Again, apologies for lack of details further describing the problem since diagnosing into the programmes of both simpletransformers and transformers is outside my area of expertise.

The text was updated successfully, but these errors were encountered:

PaulTran47 mentioned this issue Nov 12, 2024

trainer save_model ValueError You are trying to save a non contiguous tensor #1582

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model.train_model(...) has ValueError with latest version of transformers, preventing model training and evaluation #1579

model.train_model(...) has ValueError with latest version of transformers, preventing model training and evaluation #1579

PaulTran47 commented Sep 5, 2024

model.train_model(...) has ValueError with latest version of transformers, preventing model training and evaluation #1579

model.train_model(...) has ValueError with latest version of transformers, preventing model training and evaluation #1579

Comments

PaulTran47 commented Sep 5, 2024