news 2026/5/9 13:37:31

CANN a2向量归约约束

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
CANN a2向量归约约束

Vec Reduction on a2 (cmax + brcb Pattern)

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when implementing per-row reductions (max, sum) on a2 using the vec pipeline. On a2 there are noReg/RegList, so reductions use UB-to-UBcmax/cadd+brcb.

Goal

Get per-row max (or sum) correct on a2, including the broadcast step that is easy to forget.

1. The cmax output format

cmax(dst, src)reduces one repeat (64 float elements = 8 blocks of 8) to asingle scalar. The scalar is stored atdst[rep * dst_rep_stride]— one float element per repeat.

With the defaultdst_rep_stride=1, the scalars are packed densely:

dst[0] = max of row 0 dst[1] = max of row 1 ... dst[63] = max of row 63

This isnota C0 block layout. The 8-element block structure thatsub/vmaxexpect is not satisfied.

2. The bug: using cmax output directly in sub

If you pass the cmax output tosubwithblk_stride=0:

  • subreads a C0 block (8 elements) and broadcasts it across all 8 blocks of each repeat
  • But the 8 elements in that block are maxes of8 different rows, not 8 copies of one row's max
  • Result: each row gets subtracted by the wrong max →expproduces huge or wrong values

Symptom: output values > 1.0 fromexp(score - max)where max should be the row max.

3. The fix: brcb broadcast between cmax and sub

After cmax, usebrcbto expand each scalar to fill a full C0 block:

ub_max_s = Tensor(DT.float, [HALF_M, 1], Position.UB) # cmax scalars ub_max = Tensor(DT.float, [HALF_M, 8], Position.UB) # broadcast result cmax(ub_max_s, ub_tmp) brcb(ub_max, ub_max_s, dst_blk_stride=1, dst_rep_stride=8)

How brcb works:

  • repeat = infer_repeat_brcb(src) = HALF_M * 1 // 8 = 8
  • For each repeat: reads 8 scalars fromsrc[rep*8 : rep*8+8]
  • For each of 8 blocks: fillsdst[block_begin : block_begin + C0]with one scalar
  • Withdst_blk_stride=1, dst_rep_stride=8: blocks are contiguous, repeats advance by 8 blocks

Result:ub_max[n*8 : n*8+8]all containmax_of_row_nfor n in 0..63.

3a. Dense row[1, 64]-> broadcast[64, 8]also needs explicitbrcbparams

When the scalar statistics arrive as one dense row such as:

  • qkmaxbuf = Tensor(DT.float, [1, 64], Position.UB)
  • qksumbuf = Tensor(DT.float, [1, 64], Position.UB)

and the destination is the usual broadcast format:

  • qkmaxbrcb = Tensor(DT.float, [64, 8], Position.UB)

donotrely on defaultbrcb(...)parameter inference.

Validated pattern:

qkmaxbuf <<= qkmax[bh:bh + 1, row0:row0 + 64] brcb(qkmaxbrcb, qkmaxbuf, repeat=64 // 8, dst_blk_stride=1, dst_rep_stride=8)

Why this matters:

  • the source load into[1,64]is fine
  • the failure comes from the broadcast configuration, not from the GM -> UB read itself
  • with the validated explicit parameters, rowris expanded toqkmaxbrcb[r, 0:8]

Concrete reproducer:

  • tmp/validate_row64_brcb.py

Practical rule:

  • for row-stat broadcasts on a2, treatbrcb(..., dst_blk_stride=1, dst_rep_stride=8)as mandatory
  • when the source is[1,64], also pinrepeat=64 // 8explicitly in validated kernels instead of trusting defaults

4. Complete row-max pattern for [HALF_M, 128] float data

HALF_M = 64 HALF_N = 64 ub_data = Tensor(DT.float, [HALF_M, 128], Position.UB) ub_tmp = Tensor(DT.float, [HALF_M, HALF_N], Position.UB) ub_max_s = Tensor(DT.float, [HALF_M, 1], Position.UB) ub_max = Tensor(DT.float, [HALF_M, 8], Position.UB) # Step 1: element-wise max of two 64-col halves → 64 values per row vmax(ub_tmp, ub_data[0:HALF_M, 0:HALF_N], ub_data[0:HALF_M, HALF_N:128]) # Step 2: reduce 64 → 1 scalar per row cmax(ub_max_s, ub_tmp) # Step 3: broadcast each scalar to fill a C0 block (8 identical elements) brcb(ub_max, ub_max_s, dst_blk_stride=1, dst_rep_stride=8) # Step 4: subtract (sliced to align repeat with narrow max buf) sub(ub_data[0:HALF_M, 0:HALF_N], ub_data[0:HALF_M, 0:HALF_N], ub_max) sub(ub_data[0:HALF_M, HALF_N:128], ub_data[0:HALF_M, HALF_N:128], ub_max)

Why each step is needed:

  • vmax: 128 columns exceed one repeat (64 elements). Must merge to 64 first.
  • cmax: reduces 64 → 1 scalar per row. Output is dense, not block-aligned.
  • brcb: fills C0 blocks so thatsubwithblk_stride=0broadcasts correctly.
  • sub with slicing: seeagent/references/constraints/vec-stride.mdfor why.

5. Why[M, 8]broadcast format fails for binary ops between two narrow buffers

Afterbrcb, the result tensor has shape[HALF_M, 8]withspan[1]=8=C0. Stride inference for[64, 8]float gives:blk_stride=0, rep_stride=1, repeat=8.

Withblk_stride=0, all 8 blocks within one repeat address thesame8 elements. So each repeat touches 8 unique elements, and 8 repeats touch 8×8=64 elements. But the buffer contains 64×8=512 elements. The remaining 448 arenever reached.

This meansvmax(buf_a[64,8], buf_a[64,8], buf_b[64,8])only computes the max for the first 8 rows. Rows 8–63 are left unchanged.

Root cause:blk_stride=0is the broadcast stride designed forsub(wide, wide, narrow), where the wide destination's repeat cadence drives iteration and the narrow source stays per-row. It was never intended for element-wise operations between two identically-shaped narrow buffers.

Diagnostic method: before choosing a tensor format for any vec binary operation, manually trace:

  1. infer_repeat(dst)=span[0] * span[1] / (256 // dtype.size)
  2. infer_strides(tensor)— check ifblk_stride=0or1
  3. total unique elements =repeat × (8 if blk_stride==1 else 1) × elements_per_block
  4. compare against the actual element count (shape[0] * shape[1])

If the totals disagree, the operation will silently skip elements.

Reference implementation:easyasc/stub_functions/vec/vecutils.py(infer_strides,infer_repeat).

6. Using[M, 1]scalar format for binary ops between reduction outputs

Thecmaxoutput[HALF_M, 1]hasspan[1]=1. Stride inference for[64, 1]float:span[1]=1matches neither64nor8, so defaults apply:blk_stride=1, rep_stride=8, repeat=1.

Withblk_stride=1and 8 blocks per repeat:

  • Block 0: elements[0:8]
  • Block 1: elements[8:16]
  • Block 7: elements[56:64]
  • Total: 1 repeat × 8 blocks × 8 elements =64 elements = all rows

Sovmax(dst[64,1], src1[64,1], src2[64,1])correctly computes per-row element-wise max over all 64 dense scalars fromcmaxoutput. No rows are skipped.

Key insight: operate on the dense scalar[M, 1]format BEFOREbrcbbroadcast. Onlybrcbto[M, 8]after the scalar-level operation is complete.

Validated pattern for running max across tiles:

ub_max_s = Tensor(DT.float, [HALF_M, 1], Position.UB) # per-tile cmax output ub_rmax_s = Tensor(DT.float, [HALF_M, 1], Position.UB) # running max (persistent) ub_max = Tensor(DT.float, [HALF_M, 8], Position.UB) # broadcast for sub # before inner loop: initialize running max dup(ub_rmax_s, neg_large) # inside each tile: cmax(ub_max_s, ub_tmp) # per-tile row max vmax(ub_rmax_s, ub_rmax_s, ub_max_s) # update in [M,1] format brcb(ub_max, ub_rmax_s, dst_blk_stride=1, dst_rep_stride=8) # broadcast AFTER update sub(ub_data[0:M, 0:64], ub_data[0:M, 0:64], ub_max) sub(ub_data[0:M, 64:128], ub_data[0:M, 64:128], ub_max)

Hereneg_largeis a sufficiently large finite negative sentinel, not literalfloat("-inf").

UB overhead for running max: one extra[64, 1]float tensor = 0.25 KB.

6a. Copying[M,1]scalar state across iterations

The validated running-max pattern often needs a snapshot of the previous scalar state before updating it, for example to computeexp(prev_m - curr_m)in streamed attention.

Donotsnapshot[M,1]buffers withub_to_ub.

Why this fails:

  • ub_to_ubworks inC0-sized blocks
  • for float[64,1], that means an 8-element block copy per row
  • the operation does not mean "copy one scalar per row"

Stable fix:

  • allocate a zero buffer in the same[M,1]format
  • use a vec binary op such asadd(dst, src, zero)to make the copy

Example:

ub_prev_s = DBuff(DT.float, [HALF_M, 1], Position.UB) ub_rmax_s = Tensor(DT.float, [HALF_M, 1], Position.UB) ub_zero_s = Tensor(DT.float, [HALF_M, 1], Position.UB) dup(ub_zero_s, 0.0) add(ub_prev_s[slot], ub_rmax_s, ub_zero_s) # safe scalar-format copy vmax(ub_rmax_s, ub_rmax_s, ub_max_s) sub(ub_prev_s[slot], ub_prev_s[slot], ub_rmax_s) exp(ub_prev_s[slot], ub_prev_s[slot])

Study:

  • agent/example/kernels/a2/flash_attn_unnorm.py
  • agent/references/patterns/a2-cube-vec-cube-vec.md

7. Adapting for row sum (cadd)

Same pattern, replacevmaxadd,cmaxcadd:

add(ub_tmp, ub_data[0:M, 0:64], ub_data[0:M, 64:128]) cadd(ub_sum_s, ub_tmp) brcb(ub_sum, ub_sum_s, dst_blk_stride=1, dst_rep_stride=8) div(ub_data[0:M, 0:64], ub_data[0:M, 0:64], ub_sum) div(ub_data[0:M, 64:128], ub_data[0:M, 64:128], ub_sum)

For streamed normalized attention on a2, the stable update order is:

  1. computeexpdiff = exp(prev_max - curr_max)in[M,1]
  2. compute the float probability tilep = exp(score - curr_max)
  3. reducesum_jfrom that float tile withadd+cadd
  4. updaterow_sum = row_sum * expdiff + sum_jin[M,1]
  5. castpto half only after the sum update if the downstream cube stage needsp.half().float()

8. UB cost

BufferShapeBytes (float)
ub_tmp[64, 64]16 KB
ub_max_s[64, 1]0.25 KB
ub_max[64, 8]2 KB
Total reduction overhead~18.25 KB

Files to study

  • agent/example/kernels/a2/flash_attn_score.py— per-tile independent row max
  • agent/example/kernels/a2/flash_attn_score_iter.py— running max across tiles using[M,1]scalarvmax
  • agent/example/kernels/a2/flash_attn_unnorm.py— delayedexpdiffcomputed from copied[M,1]running state
  • agent/example/kernels/a2/flash_attn_full.py— running sum + final sliceddivon top of the delayed numerator pipeline
  • easyasc/simulator_v2/ops/vec/v.pyandeasyasc/simulator_v2/ops/vec/_legacy_vpipe.py— current vec runtime path forcmax,brcb, anddup
  • easyasc/stub_functions/vec/group.py— cmax stub with dst_rep_stride default
  • easyasc/stub_functions/vec/dupbrcb.py— dup and brcb stubs
  • easyasc/stub_functions/vec/vecutils.pyinfer_stridesandinfer_repeatlogic

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/9 13:37:30

CANN/ops-nn ELU反向梯度算子

aclnnEluBackward 【免费下载链接】ops-nn 本项目是CANN提供的神经网络类计算算子库&#xff0c;实现网络在NPU上加速计算。 项目地址: https://gitcode.com/cann/ops-nn &#x1f4c4; 查看源码 产品支持情况 产品是否支持 Ascend 950PR/Ascend 950DT √ Atlas A3 训…

作者头像 李华
网站建设 2026/5/9 13:36:29

机器IP、计算机名、Mac地址查询方式!(含Windows、Linux、Mac)

Windows 环境 第一步&#xff1a;打开命令提示符&#xff08;CMD&#xff09; 第二步&#xff1a;输入 ipconfig /all&#xff0c;即可看到Windows 系统的相关配置&#xff0c;红色圈出的即为&#xff1a;计算机名、Mac地址和IP Linux 环境 IP查询 Linux环境下 输入 ifconfig -…

作者头像 李华
网站建设 2026/5/9 13:35:37

CANN/ops-math PadV3Grad算子

PadV3Grad 【免费下载链接】ops-math 本项目是CANN提供的数学类基础计算算子库&#xff0c;实现网络在NPU上加速计算。 项目地址: https://gitcode.com/cann/ops-math 产品支持情况 产品是否支持Ascend 950PR/Ascend 950DT√Atlas A3 训练系列产品/Atlas A3 推理系列产…

作者头像 李华
网站建设 2026/5/9 13:33:00

Qwen3-32B 昇腾 NPU 环境 MindIE + vLLM 双引擎部署验证笔记

一、用户背景与核心问题1. 场景&#xff1a;昇腾Duo卡跑Qwen3-32B大模型&#xff0c;先后尝试MindIE与vLLM两种部署方式。2. 客户问题&#xff1a;MindIE部署后&#xff0c;curl回答中文输出出现乱码 。3. 关键线索&#xff1a;乱码仅出现在MindIE训练好的Jinja对话模板场景中。…

作者头像 李华
网站建设 2026/5/9 13:28:16

CANN/torchtitan-npu 自定义Context Parallel特性

自定义Context Parallel特性 【免费下载链接】torchtitan-npu Ascend Extension for torchtitan 项目地址: https://gitcode.com/cann/torchtitan-npu 在分布式训练任务中&#xff0c;上下文并行&#xff08;Context Parallelism, CP&#xff09;是突破单卡内存瓶颈、支…

作者头像 李华