Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TFT] Cannot start training #1437

Open
jomach opened this issue Dec 2, 2024 · 0 comments
Open

[TFT] Cannot start training #1437

jomach opened this issue Dec 2, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@jomach
Copy link

jomach commented Dec 2, 2024

Related to TFT/Pytorch

Describe the bug
I'm trying to add a new dataset to this framework following the yaml. I got all kind of errors to be honest, but most of them are:

  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1445, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/workspace/models/tft_pyt/modeling.py", line 229, in forward
    t_observed_tgt = fused_pointwise_linear_v2(t_tgt_obs, self.t_tgt_embedding_vectors, self.t_tgt_embedding_bias)
RuntimeError: Error instantiating 'training.trainer.CTLTrainer' : The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/workspace/models/tft_pyt/modeling.py", line 89, in fused_pointwise_linear_v2
def fused_pointwise_linear_v2(x, a, b):
    out = x.unsqueeze(3) * a
    out = out + b
          ~~~~~~~ <--- HERE
    return out
**RuntimeError: CUDA error: device-side assert triggered**
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

To Reproduce
Dataset:

zeus@b8ae237f7dad:/workspace$ head /workspace/datasets/sosd/timeseries_datasetcs.csv 
AUDAT,MATNR,WERKS,total_quantity
2022-04-01,M00000000213903201,D110,1
2022-04-01,M00000000215022201,D110,5
2022-04-01,M00000000214593302,D110,3
2022-04-01,M00000000215043701,D110,5
2022-04-01,M00000000213449504,D110,0
2022-04-01,M00000000214319300,D110,0
2022-04-01,M00000000214385102,D110,10
2022-04-01,M00000000214180004,D110,0
2022-04-01,M00000000214458104,D110,20

config:

_target_: data.datasets.create_datasets
config:
    graph: False
    source_path: /workspace/datasets/sosd/timeseries_datasetcs.csv
    dest_path: /workspace/datasets/sosd/
    train_range:
      - '2022-04-01'
      - '2023-09-02'
    valid_range:
      - '2023-10-26'
      - '2024-02-15'
    test_range:
      - '2023-09-02'
      - '2023-10-26'
    scale_per_id: True
    encoder_length: 5
    input_length: 5
    example_length: 10
    dataset_stride: 1
    MultiID: False
    features:
    - name: 'MATNR'
      feature_type: 'ID'
      feature_embed_type: 'CATEGORICAL'
      cardinality: 70908
    - name: 'MATNR'
      feature_type: 'STATIC'
      feature_embed_type: 'CATEGORICAL'
      cardinality: 70908      
    - name: 'WERKS'
      feature_type: 'ID'
      feature_embed_type: 'CATEGORICAL'
      cardinality: 1
    - name: 'AUDAT'
      feature_type: 'TIME'
      feature_embed_type: 'DATE'
    - name: 'WERKS'
      feature_type: 'KNOWN'
      feature_embed_type: 'CATEGORICAL'
      cardinality: 1
    - name: 'total_quantity'
      feature_type: 'TARGET'
      feature_embed_type: 'CONTINUOUS'
      scaler:
        _target_: sklearn.preprocessing.StandardScaler
    train_samples: 619765
    valid_samples: 174172
    binarized: True
    time_series_count: 70908

Expected behavior
The Training starts.

Environment

  • NVIDIA-SMI 535.216.03
  • Driver Version: 535.216.03
  • CUDA Version: 12.2
@jomach jomach added the bug Something isn't working label Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant