CANN a2向量归约约束-编程阁

Vec Reduction on a2 (cmax + brcb Pattern)

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when implementing per-row reductions (max, sum) on a2 using the vec pipeline. On a2 there are noReg/RegList, so reductions use UB-to-UBcmax/cadd+brcb.

Goal

Get per-row max (or sum) correct on a2, including the broadcast step that is easy to forget.

1. The cmax output format

cmax(dst, src)reduces one repeat (64 float elements = 8 blocks of 8) to asingle scalar. The scalar is stored atdst[rep * dst_rep_stride]— one float element per repeat.

With the defaultdst_rep_stride=1, the scalars are packed densely:

dst[0] = max of row 0 dst[1] = max of row 1 ... dst[63] = max of row 63

This isnota C0 block layout. The 8-element block structure thatsub/vmaxexpect is not satisfied.

2. The bug: using cmax output directly in sub

If you pass the cmax output tosubwithblk_stride=0:

subreads a C0 block (8 elements) and broadcasts it across all 8 blocks of each repeat
But the 8 elements in that block are maxes of8 different rows, not 8 copies of one row's max
Result: each row gets subtracted by the wrong max →expproduces huge or wrong values

Symptom: output values > 1.0 fromexp(score - max)where max should be the row max.

3. The fix: brcb broadcast between cmax and sub

After cmax, usebrcbto expand each scalar to fill a full C0 block:

ub_max_s = Tensor(DT.float, [HALF_M, 1], Position.UB) # cmax scalars ub_max = Tensor(DT.float, [HALF_M, 8], Position.UB) # broadcast result cmax(ub_max_s, ub_tmp) brcb(ub_max, ub_max_s, dst_blk_stride=1, dst_rep_stride=8)

How brcb works:

repeat = infer_repeat_brcb(src) = HALF_M * 1 // 8 = 8
For each repeat: reads 8 scalars fromsrc[rep*8 : rep*8+8]
For each of 8 blocks: fillsdst[block_begin : block_begin + C0]with one scalar
Withdst_blk_stride=1, dst_rep_stride=8: blocks are contiguous, repeats advance by 8 blocks

Result:ub_max[n*8 : n*8+8]all containmax_of_row_nfor n in 0..63.

3a. Dense row`[1, 64]`-> broadcast`[64, 8]`also needs explicit`brcb`params

When the scalar statistics arrive as one dense row such as:

qkmaxbuf = Tensor(DT.float, [1, 64], Position.UB)
qksumbuf = Tensor(DT.float, [1, 64], Position.UB)

and the destination is the usual broadcast format:

qkmaxbrcb = Tensor(DT.float, [64, 8], Position.UB)

donotrely on defaultbrcb(...)parameter inference.

Validated pattern:

qkmaxbuf <<= qkmax[bh:bh + 1, row0:row0 + 64] brcb(qkmaxbrcb, qkmaxbuf, repeat=64 // 8, dst_blk_stride=1, dst_rep_stride=8)

Why this matters:

the source load into[1,64]is fine
the failure comes from the broadcast configuration, not from the GM -> UB read itself
with the validated explicit parameters, rowris expanded toqkmaxbrcb[r, 0:8]

Concrete reproducer:

tmp/validate_row64_brcb.py

Practical rule:

for row-stat broadcasts on a2, treatbrcb(..., dst_blk_stride=1, dst_rep_stride=8)as mandatory
when the source is[1,64], also pinrepeat=64 // 8explicitly in validated kernels instead of trusting defaults

4. Complete row-max pattern for [HALF_M, 128] float data

HALF_M = 64 HALF_N = 64 ub_data = Tensor(DT.float, [HALF_M, 128], Position.UB) ub_tmp = Tensor(DT.float, [HALF_M, HALF_N], Position.UB) ub_max_s = Tensor(DT.float, [HALF_M, 1], Position.UB) ub_max = Tensor(DT.float, [HALF_M, 8], Position.UB) # Step 1: element-wise max of two 64-col halves → 64 values per row vmax(ub_tmp, ub_data[0:HALF_M, 0:HALF_N], ub_data[0:HALF_M, HALF_N:128]) # Step 2: reduce 64 → 1 scalar per row cmax(ub_max_s, ub_tmp) # Step 3: broadcast each scalar to fill a C0 block (8 identical elements) brcb(ub_max, ub_max_s, dst_blk_stride=1, dst_rep_stride=8) # Step 4: subtract (sliced to align repeat with narrow max buf) sub(ub_data[0:HALF_M, 0:HALF_N], ub_data[0:HALF_M, 0:HALF_N], ub_max) sub(ub_data[0:HALF_M, HALF_N:128], ub_data[0:HALF_M, HALF_N:128], ub_max)

Why each step is needed:

vmax: 128 columns exceed one repeat (64 elements). Must merge to 64 first.
cmax: reduces 64 → 1 scalar per row. Output is dense, not block-aligned.
brcb: fills C0 blocks so thatsubwithblk_stride=0broadcasts correctly.
sub with slicing: seeagent/references/constraints/vec-stride.mdfor why.

5. Why`[M, 8]`broadcast format fails for binary ops between two narrow buffers

Afterbrcb, the result tensor has shape[HALF_M, 8]withspan[1]=8=C0. Stride inference for[64, 8]float gives:blk_stride=0, rep_stride=1, repeat=8.

Withblk_stride=0, all 8 blocks within one repeat address thesame8 elements. So each repeat touches 8 unique elements, and 8 repeats touch 8×8=64 elements. But the buffer contains 64×8=512 elements. The remaining 448 arenever reached.

This meansvmax(buf_a[64,8], buf_a[64,8], buf_b[64,8])only computes the max for the first 8 rows. Rows 8–63 are left unchanged.

Root cause:blk_stride=0is the broadcast stride designed forsub(wide, wide, narrow), where the wide destination's repeat cadence drives iteration and the narrow source stays per-row. It was never intended for element-wise operations between two identically-shaped narrow buffers.

Diagnostic method: before choosing a tensor format for any vec binary operation, manually trace:

infer_repeat(dst)=span[0] * span[1] / (256 // dtype.size)
infer_strides(tensor)— check ifblk_stride=0or1
total unique elements =repeat × (8 if blk_stride==1 else 1) × elements_per_block
compare against the actual element count (shape[0] * shape[1])

If the totals disagree, the operation will silently skip elements.

Reference implementation:easyasc/stub_functions/vec/vecutils.py(infer_strides,infer_repeat).

6. Using`[M, 1]`scalar format for binary ops between reduction outputs

Thecmaxoutput[HALF_M, 1]hasspan[1]=1. Stride inference for[64, 1]float:span[1]=1matches neither64nor8, so defaults apply:blk_stride=1, rep_stride=8, repeat=1.

Withblk_stride=1and 8 blocks per repeat:

Block 0: elements[0:8]
Block 1: elements[8:16]
…
Block 7: elements[56:64]
Total: 1 repeat × 8 blocks × 8 elements =64 elements = all rows✓

Sovmax(dst[64,1], src1[64,1], src2[64,1])correctly computes per-row element-wise max over all 64 dense scalars fromcmaxoutput. No rows are skipped.

Key insight: operate on the dense scalar[M, 1]format BEFOREbrcbbroadcast. Onlybrcbto[M, 8]after the scalar-level operation is complete.

Validated pattern for running max across tiles:

ub_max_s = Tensor(DT.float, [HALF_M, 1], Position.UB) # per-tile cmax output ub_rmax_s = Tensor(DT.float, [HALF_M, 1], Position.UB) # running max (persistent) ub_max = Tensor(DT.float, [HALF_M, 8], Position.UB) # broadcast for sub # before inner loop: initialize running max dup(ub_rmax_s, neg_large) # inside each tile: cmax(ub_max_s, ub_tmp) # per-tile row max vmax(ub_rmax_s, ub_rmax_s, ub_max_s) # update in [M,1] format brcb(ub_max, ub_rmax_s, dst_blk_stride=1, dst_rep_stride=8) # broadcast AFTER update sub(ub_data[0:M, 0:64], ub_data[0:M, 0:64], ub_max) sub(ub_data[0:M, 64:128], ub_data[0:M, 64:128], ub_max)

Hereneg_largeis a sufficiently large finite negative sentinel, not literalfloat("-inf").

UB overhead for running max: one extra[64, 1]float tensor = 0.25 KB.

6a. Copying`[M,1]`scalar state across iterations

The validated running-max pattern often needs a snapshot of the previous scalar state before updating it, for example to computeexp(prev_m - curr_m)in streamed attention.

Donotsnapshot[M,1]buffers withub_to_ub.

Why this fails:

ub_to_ubworks inC0-sized blocks
for float[64,1], that means an 8-element block copy per row
the operation does not mean "copy one scalar per row"

Stable fix:

allocate a zero buffer in the same[M,1]format
use a vec binary op such asadd(dst, src, zero)to make the copy

Example:

ub_prev_s = DBuff(DT.float, [HALF_M, 1], Position.UB) ub_rmax_s = Tensor(DT.float, [HALF_M, 1], Position.UB) ub_zero_s = Tensor(DT.float, [HALF_M, 1], Position.UB) dup(ub_zero_s, 0.0) add(ub_prev_s[slot], ub_rmax_s, ub_zero_s) # safe scalar-format copy vmax(ub_rmax_s, ub_rmax_s, ub_max_s) sub(ub_prev_s[slot], ub_prev_s[slot], ub_rmax_s) exp(ub_prev_s[slot], ub_prev_s[slot])

Study:

agent/example/kernels/a2/flash_attn_unnorm.py
agent/references/patterns/a2-cube-vec-cube-vec.md

7. Adapting for row sum (cadd)

Same pattern, replacevmax→add,cmax→cadd:

add(ub_tmp, ub_data[0:M, 0:64], ub_data[0:M, 64:128]) cadd(ub_sum_s, ub_tmp) brcb(ub_sum, ub_sum_s, dst_blk_stride=1, dst_rep_stride=8) div(ub_data[0:M, 0:64], ub_data[0:M, 0:64], ub_sum) div(ub_data[0:M, 64:128], ub_data[0:M, 64:128], ub_sum)

For streamed normalized attention on a2, the stable update order is:

computeexpdiff = exp(prev_max - curr_max)in[M,1]
compute the float probability tilep = exp(score - curr_max)
reducesum_jfrom that float tile withadd+cadd
updaterow_sum = row_sum * expdiff + sum_jin[M,1]
castpto half only after the sum update if the downstream cube stage needsp.half().float()

8. UB cost

Buffer	Shape	Bytes (float)
ub_tmp	[64, 64]	16 KB
ub_max_s	[64, 1]	0.25 KB
ub_max	[64, 8]	2 KB
Total reduction overhead	~18.25 KB

Files to study

agent/example/kernels/a2/flash_attn_score.py— per-tile independent row max
agent/example/kernels/a2/flash_attn_score_iter.py— running max across tiles using[M,1]scalarvmax
agent/example/kernels/a2/flash_attn_unnorm.py— delayedexpdiffcomputed from copied[M,1]running state
agent/example/kernels/a2/flash_attn_full.py— running sum + final sliceddivon top of the delayed numerator pipeline
easyasc/simulator_v2/ops/vec/v.pyandeasyasc/simulator_v2/ops/vec/_legacy_vpipe.py— current vec runtime path forcmax,brcb, anddup
easyasc/stub_functions/vec/group.py— cmax stub with dst_rep_stride default
easyasc/stub_functions/vec/dupbrcb.py— dup and brcb stubs
easyasc/stub_functions/vec/vecutils.py—infer_stridesandinfer_repeatlogic

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

CANN a2向量归约约束

Vec Reduction on a2 (cmax + brcb Pattern)

Goal

1. The cmax output format

2. The bug: using cmax output directly in sub

3. The fix: brcb broadcast between cmax and sub

3a. Dense row`[1, 64]`-> broadcast`[64, 8]`also needs explicit`brcb`params

4. Complete row-max pattern for [HALF_M, 128] float data

5. Why`[M, 8]`broadcast format fails for binary ops between two narrow buffers

6. Using`[M, 1]`scalar format for binary ops between reduction outputs

6a. Copying`[M,1]`scalar state across iterations

7. Adapting for row sum (cadd)

8. UB cost

Files to study

CANN/ops-nn ELU反向梯度算子

机器IP、计算机名、Mac地址查询方式！（含Windows、Linux、Mac）

CANN/ops-math PadV3Grad算子

Qwen3-32B 昇腾 NPU 环境 MindIE + vLLM 双引擎部署验证笔记

网盘直链下载助手LinkSwift：九大网盘一键获取真实下载链接的终极解决方案

CANN/torchtitan-npu 自定义Context Parallel特性

Vec Reduction on a2 (cmax + brcb Pattern)

Goal

1. The cmax output format

2. The bug: using cmax output directly in sub

3. The fix: brcb broadcast between cmax and sub

3a. Dense row[1, 64]-> broadcast[64, 8]also needs explicitbrcbparams

4. Complete row-max pattern for [HALF_M, 128] float data

5. Why[M, 8]broadcast format fails for binary ops between two narrow buffers

6. Using[M, 1]scalar format for binary ops between reduction outputs

6a. Copying[M,1]scalar state across iterations

7. Adapting for row sum (cadd)

8. UB cost

Files to study

CANN/ops-nn ELU反向梯度算子

机器IP、计算机名、Mac地址查询方式！（含Windows、Linux、Mac）

CANN/ops-math PadV3Grad算子

Qwen3-32B 昇腾 NPU 环境 MindIE + vLLM 双引擎部署验证笔记

网盘直链下载助手LinkSwift：九大网盘一键获取真实下载链接的终极解决方案

CANN/torchtitan-npu 自定义Context Parallel特性

3a. Dense row`[1, 64]`-> broadcast`[64, 8]`also needs explicit`brcb`params

5. Why`[M, 8]`broadcast format fails for binary ops between two narrow buffers

6. Using`[M, 1]`scalar format for binary ops between reduction outputs

6a. Copying`[M,1]`scalar state across iterations