华为昇腾310P废物利用
注:310P不支持bf16、W4A4
带宽200G,双芯版的300I duo, 有48g和96g两种
目前市面上所有昇腾的卡均不支持FP8
最终性能优化结果:
Qwen3-8B-W8A8
TPS :15Tokens/s
昇腾的PyTorch图模式使用和vllm-ascend的源码,里面有reduce-overhead和max-autotune两种模式,reduce-overhead只支持910B和910C,而且vllm-ascend里面写死了reduce-overhead模式
MindIE + Qwen 3-8B-W8A8
1. Launch the container on thehostdockerrun-it-d--net=host --shm-size=16g\--namemindie-qwen3-8b-310p\-w/workspace/MindIE-LLM/examples/atb_models\--device=/dev/davinci0:rwm\--device=/dev/davinci1:rwm\--device=/dev/davinci2:rwm\--device=/dev/davinci3:rwm\--device=/dev/davinci_manager:rwm\--device=/dev/hisi_hdc:rwm\--device=/dev/devmm_svm:rwm\-v/usr/local/Ascend/driver:/usr/local/Ascend/driver:ro\-v/usr/local/dcmi:/usr/local/dcmi:ro\-v/usr/local/bin/npu-smi:/usr/local/bin/npu-smi:ro\-v/usr/local/sbin:/usr/local/sbin:ro\-v/Users/zhaojiacheng/repos/MindIE-LLM:/workspace/MindIE-LLM\-v/home/s_zhaojiacheng:/home/s_zhaojiacheng\swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:3.0.0b2-300I-Duo-py311-openeuler24.03-lts\bashEnter the container:dockerexec-itmindie-qwen3-8b-310pbash2. Prepare the environment inside the containercd/workspace/MindIE-LLM scripts/qwen3_8b_310p_w8a8sc.sh prepare-env3. Download the model from ModelScope Recommended: download directly into a normal directory, not only into the default cache.mkdir-p/home/s_zhaojiacheng/models/Qwen3-8B-w8a8s modelscope download\--modelEco-Tech/Qwen3-8B-w8a8s-310\--local_dir/home/s_zhaojiacheng/models/Qwen3-8B-w8a8s If you already downloaded it earlier into the default cache with: modelscope download--modelEco-Tech/Qwen3-8B-w8a8s-310thenflatten it into a real directory first:mkdir-p/home/s_zhaojiacheng/models/Qwen3-8B-w8a8scp-aL\/home/s_zhaojiacheng/.cache/modelscope/hub/models/Eco-Tech/Qwen3-8B-w8a8s-310/.\/home/s_zhaojiacheng/models/Qwen3-8B-w8a8s/ Check the files exist:ls/home/s_zhaojiacheng/models/Qwen3-8B-w8a8s4. Compress W8A8S into W8A8SCcd/workspace/MindIE-LLM scripts/qwen3_8b_310p_w8a8sc.sh compress\--w8a8s-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8s\--w8a8sc-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc After it finishes, check the output directory exists:ls/home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc5. Start the OpenAI-compatible servercd/workspace/MindIE-LLM scripts/qwen3_8b_310p_w8a8sc.sh serve\--w8a8sc-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc\--model-name qwen3-8b-w8a8sc\--port1025This should start mindie_llm_server and expose the OpenAI-compatible endpoint on127.0.0.1:1025.6. Verify theserviceList models: curlhttp://127.0.0.1:1025/v1/models Expected model id: qwen3-8b-w8a8sc Test one inference request: curlhttp://127.0.0.1:1025/v1/chat/completions\-H'Content-Type: application/json'\-d'{ "model": "qwen3-8b-w8a8sc", "messages": [ {"role": "user", "content": "What is deep learning?"} ], "max_tokens": 128, "stream": false }'Short version If you want the shortest working sequence inside the container:cd/workspace/MindIE-LLM scripts/qwen3_8b_310p_w8a8sc.sh prepare-env modelscope download\--modelEco-Tech/Qwen3-8B-w8a8s-310\--local_dir/home/s_zhaojiacheng/models/Qwen3-8B-w8a8s scripts/qwen3_8b_310p_w8a8sc.sh compress\--w8a8s-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8s\--w8a8sc-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc scripts/qwen3_8b_310p_w8a8sc.sh serve\--w8a8sc-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc\--model-name qwen3-8b-w8a8sc\--port1025Then test: curlhttp://127.0.0.1:1025/v1/models One important detail:forthis single-310P flow,donot try to serve Qwen3-8B-w8a8s-310 directly. The supported path is download W8A8S ->compress to W8A8SC ->serve W8A8SC. If you want, I can also rewrite this into one clean host-sidebashscript that doesdockerrun,dockerexec, download, compress, and serve end to end.