1.CUDA error: CUBLAS_STATUS_INTERNAL_ERROR

  • 问题日志
1
CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
  • 原因:CUDA版本与安装的PyTorch版本不匹配

  • 排查

    • 输入nvcc -V查看CUDA版本

    • 运行Python解释器,输入

      1
      2
      import torch
      print(torch.version.cuda)
    • 如果两个结果的版本不一致,如一个是12.1,一个是12.4,则说明版本不匹配

  • 解决方法

    • 前往PyTorch官网,查找匹配CUDA版本的PyTorch下载命令行

      image-20241120113442635

    • 如下载匹配CUDA12.1的命令行如下:

      1
      pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
    • 如果原先已有PyTorch,可以在命令行后添加--upgrade覆盖安装

2.tokenizer.chat_template is not set and no template argument was passed

  • 问题日志

    1
    Error in applying chat template from request: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating
  • 原因:模型配置文件tokenizer_config.json中不包含chat_template

  • 排查

    • 前往模型路径,找到tokenizer_config.json文件打开

    • 查找是否包含chat_template字段,如下:

      1
      2
      3
      {
      "chat_template": "{{ '<|begin_of_text|>' }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% endif %}{% if system_message is defined %}{{ '<|start_header_id|>system<|end_header_id|>\n\n' + system_message + '<|eot_id|>' }}{% endif %}{% for message in loop_messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|start_header_id|>user<|end_header_id|>\n\n' + content + '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<|eot_id|>' }}{% endif %}{% endfor %}"
      }
    • 如果没有,则说明没有聊天模板,此时如果使用了tokenizer.apply_chat_template就会报错

    • 例如LlaMa-3.1-8B模型没有chat_template,而LlaMa-3.1-8B-Instruct模型没有chat_template

  • 解决方案

    • 待定

3.NotImplementedError: Using RTX 4000 series doesn’t support faster communication broadband via P2P or IB. Please set NCCL_P2P_DISABLE="1" and NCCL_IB_DISABLE="1" or use accelerate launch` which will do this automatically.