CUDA error if enabling compile_prefill for quantization model (int8)

Repro command:
```
python generate.py --compile --compile_prefill --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth
```

Errors:
```
(pt) [ybliang@devgpu002.ash8 ~/local/gpt-fast (main)]$ python generate.py --compile --compile_prefill --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth
/home/ybliang/local/miniconda3/envs/pt/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
Using device=cuda
Loading model ...
Using int8 weight-only quantization!
Time to load model: 6.15 seconds
/home/ybliang/local/pytorch/torch/backends/cuda/__init__.py:342: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
  warnings.warn(
unknown:0: unknown: block: [0,0,0], thread: [128,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [129,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [130,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [131,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [132,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [133,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [134,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [135,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [136,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [137,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [138,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [139,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [140,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [141,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [142,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [143,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [144,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [145,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [146,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [147,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [148,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [149,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [150,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [151,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [152,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [153,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [154,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [155,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [156,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [157,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [158,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [159,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [192,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [193,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [194,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [195,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [196,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [197,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [198,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [199,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [200,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [201,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [202,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [203,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [204,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [205,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [206,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [207,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [208,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [209,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [210,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [211,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [212,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [213,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [214,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [215,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [216,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [217,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [218,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [219,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [220,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [221,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [222,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [223,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [160,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [161,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [162,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [163,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [164,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [165,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [166,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [167,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [168,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [169,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [170,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [171,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [172,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [173,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [174,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [175,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [176,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [177,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [178,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [179,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [180,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [181,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [182,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [183,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [184,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [185,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [186,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [187,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [188,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [189,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [190,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [191,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [64,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [65,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [66,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [67,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [68,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [69,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [70,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [71,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [72,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [73,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [74,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [75,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [76,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [77,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [78,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [79,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [80,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [81,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [82,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [83,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [84,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [85,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [86,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [87,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [88,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [89,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [90,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [91,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [92,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [93,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [94,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [95,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [224,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [225,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [226,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [227,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [228,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [229,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [230,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [231,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [232,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [233,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [234,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [235,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [236,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [237,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [238,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [239,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [240,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [241,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [242,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [243,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [244,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [245,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [246,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [247,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [248,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [249,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [250,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [251,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [252,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [253,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [254,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [255,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [32,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [33,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [34,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [35,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [36,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [37,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [38,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [39,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [40,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [41,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [42,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [43,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [44,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [45,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [46,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [47,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [48,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [49,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [50,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [51,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [52,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [53,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [54,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [55,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [56,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [57,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [58,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [59,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [60,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [61,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [62,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [63,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [0,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [1,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [2,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [3,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [4,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [5,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [6,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [7,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [8,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [9,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [10,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [11,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [12,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [13,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [14,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [15,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [16,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [17,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [18,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [19,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [20,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [21,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [22,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [23,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [24,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [25,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [26,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [27,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [28,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [29,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [30,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [31,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [96,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [97,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [98,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [99,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [100,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [101,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [102,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [103,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [104,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [105,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [106,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [107,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [108,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [109,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [110,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [111,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [112,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [113,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [114,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [115,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [116,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [117,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [118,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [119,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [120,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [121,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [122,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [123,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [124,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [125,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [126,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [127,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
Traceback (most recent call last):
  File "/data/users/ybliang/gpt-fast/generate.py", line 421, in <module>
    main(
  File "/data/users/ybliang/gpt-fast/generate.py", line 359, in main
    y, metrics = generate(
  File "/home/ybliang/local/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/users/ybliang/gpt-fast/generate.py", line 202, in generate
    generated_tokens, _ = decode_n_tokens(model, next_token.view(1, -1), input_pos, max_new_tokens - 1, callback=callback, **sampling_kwargs)
  File "/data/users/ybliang/gpt-fast/generate.py", line 74, in decode_n_tokens
    next_token, next_prob = decode_one_token(
  File "/home/ybliang/local/pytorch/torch/_dynamo/eval_frame.py", line 450, in _fn
    return fn(*args, **kwargs)
  File "/data/users/ybliang/gpt-fast/generate.py", line 64, in decode_one_token
    def decode_one_token(model: Transformer, x: torch.Tensor, input_pos: torch.Tensor, **sampling_kwargs) -> Tuple[torch.Tensor, torch.Tensor]:
  File "/home/ybliang/local/pytorch/torch/_dynamo/eval_frame.py", line 450, in _fn
    return fn(*args, **kwargs)
  File "/home/ybliang/local/pytorch/torch/_dynamo/external_utils.py", line 36, in inner
    return fn(*args, **kwargs)
  File "/home/ybliang/local/pytorch/torch/_functorch/aot_autograd.py", line 917, in forward
    return compiled_fn(full_args)
  File "/home/ybliang/local/pytorch/torch/_functorch/_aot_autograd/utils.py", line 89, in g
    return f(*args)
  File "/home/ybliang/local/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 106, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
  File "/home/ybliang/local/pytorch/torch/_functorch/_aot_autograd/utils.py", line 113, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
  File "/home/ybliang/local/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 152, in rng_functionalization_wrapper
    return compiled_fw(args)
  File "/home/ybliang/local/pytorch/torch/_inductor/codecache.py", line 906, in __call__
    return self.get_current_callable()(inputs)
  File "/home/ybliang/local/pytorch/torch/_inductor/compile_fx.py", line 838, in run
    return compiled_fn(new_inputs)
  File "/home/ybliang/local/pytorch/torch/_inductor/cudagraph_trees.py", line 383, in deferred_cudagraphify
    fn, out = cudagraphify(model, inputs, new_static_input_idxs, *args, **kwargs)
  File "/home/ybliang/local/pytorch/torch/_inductor/cudagraph_trees.py", line 411, in cudagraphify
    return manager.add_function(
  File "/home/ybliang/local/pytorch/torch/_inductor/cudagraph_trees.py", line 1943, in add_function
    return fn, fn(inputs)
  File "/home/ybliang/local/pytorch/torch/_inductor/cudagraph_trees.py", line 1757, in run
    out = self._run(new_inputs, function_id)
  File "/home/ybliang/local/pytorch/torch/_inductor/cudagraph_trees.py", line 1798, in _run
    return self.run_eager(new_inputs, function_id)
  File "/home/ybliang/local/pytorch/torch/_inductor/cudagraph_trees.py", line 1913, in run_eager
    return node.run(new_inputs)
  File "/home/ybliang/local/pytorch/torch/_inductor/cudagraph_trees.py", line 616, in run
    out = self.wrapped_function.model(new_inputs)
  File "/home/ybliang/local/pytorch/torch/_inductor/codecache.py", line 934, in _run_from_cache
    return compiled_graph.compiled_artifact(inputs)
  File "/tmp/torchinductor_ybliang/mi/cmiek2ltsrliaqercc2b6xcfebjyeel2kxpgdgc65xbyxpekhh5j.py", line 2020, in call
    triton_red_fused_add_bmm_embedding_mm_mul_11.run(buf19, arg75_1, buf20, arg77_1, arg78_1, arg455_1, arg65_1, buf16, arg73_1, arg79_1, buf22, 4096, 11008, grid=grid(4096), stream=stream0)
  File "/home/ybliang/local/pytorch/torch/_inductor/triton_heuristics.py", line 635, in run
    self.autotune_to_one_config(*args, grid=grid, **kwargs)
  File "/home/ybliang/local/pytorch/torch/_inductor/triton_heuristics.py", line 531, in autotune_to_one_config
    timings = self.benchmark_all_configs(*args, **kwargs)
  File "/home/ybliang/local/pytorch/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/ybliang/local/pytorch/torch/_inductor/triton_heuristics.py", line 507, in benchmark_all_configs
    timings = {
  File "/home/ybliang/local/pytorch/torch/_inductor/triton_heuristics.py", line 508, in <dictcomp>
    launcher: self.bench(launcher, *args, **kwargs)
  File "/home/ybliang/local/pytorch/torch/_inductor/triton_heuristics.py", line 479, in bench
    return do_bench(kernel_call, rep=40, fast_flush=True)
  File "/home/ybliang/local/pytorch/torch/_inductor/utils.py", line 170, in do_bench
    return triton_do_bench(*args, **kwargs)[0]
  File "/data/users/ybliang/triton/python/triton/testing.py", line 101, in do_bench
    torch.cuda.synchronize()
  File "/home/ybliang/local/pytorch/torch/cuda/__init__.py", line 792, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
generated kernel file: https://gist.github.com/yanboliang/6f5c1171e63909b995b5372dc7c88ab7


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA error if enabling compile_prefill for quantization model (int8) #137

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA error if enabling compile_prefill for quantization model (int8) #137

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions