CANNBot向量掩码约束-编程阁

Vector Mask Constraints

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when implementing or debugging A2 vec mask behavior in the simulator.

Goal

Keep mask semantics explicit so that:

every vec op either clearly consumes the current mask or clearly ignores it
set_mask/reset_maskstate is reproducible across tests
masked arithmetic matches the repository's agreed behavior instead of ad-hoc guesses

1. Current mask state

Each vec lane/runtime owns onevector_mask: uint8[256].

Default state:

all 256 entries are1

Interpretation:

one mask slot corresponds to one logical vec element
the active prefix depends on dtype / logical repeat payload size

Typical active prefix lengths:

float/int32:mask[0:64]
half/int16:mask[0:128]
int8/uint8:mask[0:256]

Current simulator behavior:

each repeat reuses the same mask prefix for the dtype
the mask doesnotadvance per repeat

2.`set_mask`/`reset_mask`

Logical mask parts

Instruction semantics:

lowandhighare treated asuint64
bitioflowwritesmask[i]
bitiofhighwritesmask[64 + i]
mask[128:256]stays unchanged byset_mask

Current bit order:

bit0-> lowest mask slot in the covered range
bit63-> highest slot in the covered range

Stub call shape

Current a2 stub API is:

set_mask(mask_high, mask_low)

So the emitted instruction swaps the two call arguments:

set_mask(hi, lo)emits instruction fieldshigh=hi,low=lo

This is why many call sites that only want the low 64-bit prefix use:

set_mask(0, low_mask)

Validated repository test:

testcases/simulator/micro/test_simulator_v2_muladddst_mask.py
testcases/simulator/micro/test_simulator_v2_vec_ops_extended.py

`reset_mask()`

resets the fullmask[0:256]back to1

3. Ops that consume the current mask

3.1 Unary / binary / unaryscalar

These ops consume mask atdst writeback.

Rule:

mask == 1-> write the newly computed result
mask == 0-> keep the olddstvalue unchanged

This currently applies to:

Unary

exp
ln
abs
rec
sqrt
rsqrt
vnot
relu

Binary

add
sub
mul
div
vmax
vmin
vand
vor
muladddst

Unaryscalar

adds
muls
vmaxs
vmins
lrelu
axpy

Dup

dup

Cast

cast

Additional cast rule:

the active cast-element domain is determined by thewiderofsrc/dstdtypes
for example,float <-> halfuses64mask slots per full repeat, not128

3.2 Additive group reductions

For these ops, mask doesnotpreserve olddst. Instead,mask == 0means the correspondingsrc slot contributes0.

This currently applies to:

cadd
cgadd
cpadd

Detailed behavior:

cadd: masked-off elements do not contribute to the full-repeat sum; ifalllanes in the active prefix are masked off, the destination scalar isnot overwritten
cgadd: masked-off elements do not contribute to their block-local sum; if one block's mask prefix is entirely zero, that block's destination scalar isnot overwritten
cpadd: masked-off elements are zeroed before flat-stream pairwise add

3.3 Max/min reductions with masked infinities

For these ops, masked-off source slots do not preserve olddstdirectly. Instead they are replaced with an extreme sentinel before reduction.

Max family

Masked-off slots act like-inf.

This currently applies to:

cmax
cgmax

Detailed behavior:

cmax: masked-off elements are replaced by-inf; ifalllanes in the active prefix are masked off, the destination scalar isnot overwritten
cgmax: masked-off elements are replaced by-inf; if one block's mask prefix is entirely zero, that block's destination scalar isnot overwritten

Min family

Masked-off slots act like+inf.

This currently applies to:

cmin
cgmin

Detailed behavior:

cmin: masked-off elements are replaced by+inf; ifalllanes in the active prefix are masked off, the destination scalar isnot overwritten
cgmin: masked-off elements are replaced by+inf; if one block's mask prefix is entirely zero, that block's destination scalar isnot overwritten

4. Ops that explicitly do NOT consume the current mask

These ops currently ignore the vec mask entirely and follow only their own normal semantics:

select
compare
compare_scalar
gather
scatter
sort32
mergesort4
mergesort_2seq
brcb

For the current V2 path,compare(...)/compare_scalar(...)andselect(...)also use their own explicit control tensor semantics:

the current vec mask fromset_mask(...)is irrelevant to them
when the control tensor is a packed-bituint8mask, the shape is[..., N // 8], not an expanded[..., N]byte-per-element mask
practical example: packed causal control for a[64, 64]half-tile isTensor(DT.uint8, [64, 8], Position.UB)

5. Deprecated / pending pieces

Deprecated

set_cmpmask
- deprecated because its role overlaps withcompare()/compare_scalar()result generation
- kept only for compatibility, not for simulator feature growth

Pending / not yet decided

Mask semantics are still not fixed for:

cast

If you touch new undecided ops later, document the decision here first.

6. Implementation rules

When adding mask support to a new vec op, decide which of these two patterns it belongs to:

Pattern A: mask gates writeback

Use for elementwise ops.

Meaning:

compute the full result
write only where mask is1
preserve olddstwhere mask is0

Pattern B: mask zeroes src contribution

Use for additive reductions.

Meaning:

masked-offsrcslots act like0
reduction result is still written normally

Do not mix the two patterns casually.

7. Validation checklist

When testing mask behavior:

verify default mask is all ones
verifyset_maskchanges onlymask[0:128]
verifyreset_maskrestores full ones
for writeback-gated ops, confirm masked-off lanes keep olddst
for additive reductions, confirm masked-off lanes contribute zero instead of preserving olddst

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

CANNBot向量掩码约束

Vector Mask Constraints

Goal

1. Current mask state

2.`set_mask`/`reset_mask`

Logical mask parts

Stub call shape

`reset_mask()`

3. Ops that consume the current mask

3.1 Unary / binary / unaryscalar

Unary

Binary

Unaryscalar

Dup

Cast

3.2 Additive group reductions

3.3 Max/min reductions with masked infinities

Max family

Min family

4. Ops that explicitly do NOT consume the current mask

5. Deprecated / pending pieces

Deprecated

Pending / not yet decided

6. Implementation rules

Pattern A: mask gates writeback

Pattern B: mask zeroes src contribution

7. Validation checklist

CANN/ops-tensor张量算子库

基于FPGA的医疗AI边缘计算：从模型轻量化到硬件部署实战

CANN/ge的AddResource API

生活处处有小美好

ML/AI教育如何留住人才？专业信心、社会价值与软技能是关键

Transformer模型在法律AI中的应用：从BERT理解到GPT生成

Vector Mask Constraints

Goal

1. Current mask state

2.set_mask/reset_mask

Logical mask parts

Stub call shape

reset_mask()

3. Ops that consume the current mask

3.1 Unary / binary / unaryscalar

Unary

Binary

Unaryscalar

Dup

Cast

3.2 Additive group reductions

3.3 Max/min reductions with masked infinities

Max family

Min family

4. Ops that explicitly do NOT consume the current mask

5. Deprecated / pending pieces

Deprecated

Pending / not yet decided

6. Implementation rules

Pattern A: mask gates writeback

Pattern B: mask zeroes src contribution

7. Validation checklist

CANN/ops-tensor张量算子库

基于FPGA的医疗AI边缘计算：从模型轻量化到硬件部署实战

CANN/ge的AddResource API

生活处处有小美好

ML/AI教育如何留住人才？专业信心、社会价值与软技能是关键

Transformer模型在法律AI中的应用：从BERT理解到GPT生成

2.`set_mask`/`reset_mask`

`reset_mask()`