news 2026/5/9 15:41:29

CANN NPU压缩算子文档

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
CANN NPU压缩算子文档

Compressor

【免费下载链接】cann-recipes-infer本项目针对LLM与多模态模型推理业务中的典型模型、加速算法,提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-infer

产品支持情况

产品是否支持
Atlas A3 推理系列产品

功能说明

  • API功能:Compressor是推理场景下SAS和QLI的前处理算子,用于将每4或128个token的KV cache压缩成一个,然后每个token与这些压缩的KV cache进行DSA计算。在长序列的情况下,Compressor可以有效地减少计算开销。

  • 计算公式:

    压缩阶段:

    1. 计算矩阵乘法:
      • C4A: $\left[kv_state^a, score_state^a\right] = X @ \left[W^{aKV}, W^{aGate}\right], \left[kv_state^b, score_state^b\right] = X @ \left[W^{bKV}, W^{bGate}\right];$
      • C128A: $\left[kv_state, score_state\right] = X @ \left[W^{KV}, W^{Gate}\right]$
    2. 计算分组加法:
      • C4A: $score_state_i^\prime = \left[score_state_{\left[4(i-1)+1:4i,:\right]}^a; score_state_{\left[4i+1:4(i+1),:\right]}^b\right] + Ape,~i=1,2,\cdots, \frac{s}{4};$
      • C128A: $score_state_i^\prime = score_state_{\left[128(i-1)+1:128i,:\right]} + Ape,~i=1,2,\cdots, \frac{s}{128};$
    3. 计算分组Softmax:
      • C4A: $S_i^\prime = softmax(score_state_i^\prime),~i=1,2,\cdots, \frac{s}{4};$
      • C128A: $S_i^\prime = softmax(score_state_i^\prime),~i=1,2,\cdots, \frac{s}{128};$
    4. 计算Hadamard乘积:
      • C4A: $(S_H)i = S_i^\prime \odot \left[kv_state^a{\left[4(i-1)+1:4i,:\right]} ; kv_state^b_{\left[4i+1:4(i+1),:\right]}\right],~i=1,2,\cdots, \frac{s}{4};$
      • C128A: $S_H = S_i^\prime \odot kv_state;$
    5. 沿着压缩轴分组求和:
      • C4A: $C_{i}^{\text{Comp}} = \left[1\right]_{1\times8} @ (S_H)_i, ~i=1,2,\cdots, \frac{s}{4};$
      • C128A: $C_{i}^{\text{Comp}} = \left[1\right]_{1\times128} @ (S_H)_i, ~i=1,2,\cdots, \frac{s}{128};$

    后处理阶段:

    1. 计算RMSNorm:
      • $\text{RMS}(C^{\text{Comp}}) = \sqrt{\frac{1}{N} \sum_{i=j* N}^{(j+1)* N} {(C_{i}^{\text{Comp}})}^{\text{2}} + norm_eps} ,N=head_dim, ~j=1,2,\cdots, \frac{s}{cmp_ratio}$
      • $\text{RmsNorm}(C^{\text{Comp}}) = norm_weight \cdot \frac{C_{i}^{\text{Comp}}}{\text{RMS}(C^{\text{Comp}})}$
    2. 计算Rope;
  • 主要计算过程为:

    1. 将输入$X$与$W^{KV}$做Matmul运算得到$kv_state$,将输入$X$与$W^{Gate}$做Matmul运算后再与$Ape$做Add运算得到$score_state$,$kv_state$与$score_state$根据输入的start_pos及cu_seqlens完成更新。
    2. 在coff为2的情况下对$kv_state$和$score_state$进行数据重排。
    3. 对$score_state$进行softmax运算将softmax结果与$kv_state$做Mul计算,后进行ReduceSum运算。
    4. 根据输入数据norm_weight、rope_sin、rope_cos,进行RMSNorm和Rope运算,得到$cmp_kv$结果输出。

函数原型

custom.compressor(x, wkv, wgate, state_cache, ape, norm_weight, rope_sin, rope_cos, rope_head_dim, cmp_ratio, *, state_block_table = None, cu_seqlens = None, seqused = None, start_pos = None, coff = 1, norm_eps = 1e-6, rotary_mode = 1, cache_mode = 1) -> (Tensor)
  • Transformer Compressor算子实现参考: Compressor
  • Transformer Compressor算子编译安装参考: Compressor编译安装指导

参数说明

说明:

  • x参数维度含义:B(Batch Size)表示输入样本批量大小、S(Sequence Length)表示输入样本序列长度、H(Head Size)表示hidden层的大小、D(Head Dim)表示hidden层的最小单元大小、T表示所有Batch输入样本序列长度的累加和。
  • xTensor):必选参数,表示原始不经压缩的数据,对应公式中的$X$。不支持非连续,数据格式支持ND,数据类型支持bfloat16float16。支持输入shape[B,S,H]、[T,H]。

  • wkvTensor):必选参数,表示kv压缩权重,对应公式中的$W^{KV}$。不支持非连续,数据格式支持ND,数据类型支持bfloat16float16。支持输入shape[coff* D,H]。

  • wgateTensor):必选参数,表示gate压缩权重,对应公式中的$W^{Gate}$。不支持非连续,数据格式支持ND,数据类型支持bfloat16float16。支持输入shape[coff* D,H]。

  • state_cacheTensor):必选参数,表示kv_state和score_state的历史数据,对应公式中的$\left[kv_state, score_state\right]$。不支持非连续,数据格式支持ND,数据类型支持folat32。支持输入shape[block_num,block_size,2* coff* D],要求block_num>0。

  • apeTensor):必选参数,表示positional biases,对应公式中的$Ape$。不支持非连续,数据格式支持ND,数据类型支持folat32。支持输入shape[cmp_ratio,coff* D]。

  • norm_weightTensor):必选参数,表示计算RmsNorm时的权重系数。数据类型支持bfloat16float16。支持输入shape[D,]。

  • rope_sinTensor):必选参数,表示Rope计算时sin的权重系数。数据类型支持bfloat16float16。当x的shape为[B,S,H]时,要求输入shape为[B,ceil(S/cmp_ratio),rope_head_dim];当x的shape为[T,H]时,要求输入shape为[min(T,T//cmp_ratio+B),rope_head_dim]。

  • rope_cosTensor):必选参数,表示Rope计算时cos的权重系数。数据类型支持bfloat16float16。当x的shape为[B,S,H]时,要求输入shape为[B,ceil(S/cmp_ratio),rope_head_dim];当x的shape为[T,H]时,要求输入shape为[min(T,T//cmp_ratio+B),rope_head_dim]。

  • rope_head_dimint):必选参数,表示rope_cos和rope_sin的hidden层最小单元大小。目前仅支持64。

  • cmp_ratioint):必选参数,表示数据压缩率。

  • *:代表其之前的参数是位置相关的,必须按照顺序输入;之后的参数是可选参数,位置无关,不赋值会使用默认值。

  • state_block_tableTensor):可选参数,表示state_cache存储使用的block映射表。不支持非连续,数据格式支持ND,数据类型支持int32。支持输入shape[B,ceil(Smax/block_size)],Smax为每个Batch中最大的Sequence Length。当x的shape为[B,S,H]时,Smax=max(start_pos)+S。当x的shape为[T,H]时,Smax=max(start_pos)+max(cu_seqlens[n+1] - cu_seqlens[n])。当其中元素的值为0时,表示当前位置无需进行更新state_cache操作。

  • cu_seqlensTensor):可选参数,表示不同Batch上的有效token数。不支持非连续,数据格式支持ND,数据类型支持int32。当x的shape为[B,S,H]时,参数必须为空。当x的shape为[T,H]时,输入shape必须为[B+1,]。该参数中每个元素的值表示当前batch与之前所有batch的token数总和,即前缀和,因此后一个元素的值必须大于等于前一个元素的值,且第一位必须为0。

  • sequsedTensor):可选参数,表示不同Batch中实际参与压缩的token数。不支持非连续,数据格式支持ND,数据类型支持int32。支持输入shape[B,]。如果指定为None时,表示和每个Batch上的Sequence Length长度相同。该入参中每个Batch的有效token数要求小于等于对应Sequence Length长度。当x的shape为[B,S,H]时,要求seqused[n] <= S,且不小于0;当x的shape为[T,H]时,要求seqused[n] <= cu_seqlens[n+1] - cu_seqlens[n],且不小于0。

  • start_posTensor):可选参数,表示计算起始位置。不支持非连续,数据格式支持ND,数据类型支持int32。支持输入shape[B,]。当输入为None时,表示从0开始进行计算。

  • coffint):可选参数,默认值1,支持1/2。当coff=1时,无需进行overlap数据重排。当coff=2时,需要进行overlap数据重排。

  • norm_epsfloat):可选参数,表示RmsNorm计算的权重系数。默认值1e-6。

  • rotary_modeint):可选参数,表示Rope计算的模式。默认值1,支持1/2。rotary_mode为1时,代表half模式。rotary_mode为2时,代表interleave模式。

  • cache_modeint):可选参数,表示state_cache的存储模式,1表示连续buffer,2表示循环buffer。默认值1。目前暂不支持输入2

返回值说明

  • cmp_kvTensor):必选输出,表示压缩后的数据。不支持非连续,数据格式支持ND。数据类型支持bfloat16float16。当x的shape为[B,S,H]时,输出shape为[B,ceil(S/cmp_ratio),D]:( compressed_tokens+pad0) + ( compressed_tokens+pad1) + ... + ( compressed_tokens+padN);当x的shape为[T,H]时,输出shape为[min(T,T//cmp_ratio+B),D]: compressed_tokens + compressed_tokens + ... + compressed_tokens + pad。

约束说明

  • 该接口支持B、S泛化,且存在如下场景限制:
    • 部分长序列场景下,如果计算量过大可能会导致出现超过NPU内存的报错,注:这里计算量会受x输入shape的影响,值越大计算量越大。典型的长序列(即B、S的乘积或T较大)场景包括但不限于:
    BSH
    100655254096
    252611204096
    1001310724096
    1002611204096
  • 支持D为128/512。
  • 支持H为1K~10K,512对齐。
  • 支持cmp_ratio为4/128。支持如下三种情况:
    • C4A: D=512, coff=2, cmp_ratio=4;
    • C4Li: D=128, coff=2, cmp_ratio=4;
    • C128A: D=512, coff=1, cmp_ratio=128。
  • 支持rotary_mode为2,Rope计算模式为interleave。
  • 该接口支持aclgraph模式。

Atlas A3 推理系列产品 调用说明

  • 单算子模式调用

    import torch import torch_npu import numpy as np import custom_ops import torch.nn as nn import math def get_seq_used_by_batch(batch_idx, S, seqused, cu_seqlens): if seqused is not None: return seqused[batch_idx] else: if cu_seqlens is not None: return cu_seqlens[batch_idx + 1] - cu_seqlens[batch_idx] else: return S data_type = torch.bfloat16 hidden_size = 4096 rope_head_dim = 64 norm_eps = 1e-6 coff = 1 # 1:no overlap 2:overlap cmp_ratio = 128 rotary_mode = 2 cache_mode = 1 head_dim = 512 cu_seqlens = [0, 1] # ------------- B = 1 S = 1 S_max = 0 block_size = 128 start_pos = [8191] * B # (B,) start_p=8191 seqused = None # (B,), None时cu_seqlens的数据全部参与计算,否则按传参实际值计算 # BS是否合轴 bs_combine_flag = True update_flag = 1 save_state_seqlens = None if seqused is not None: seqused = torch.tensor(seqused).to(torch.int32) if start_pos is not None: start_pos = torch.tensor(start_pos).to(torch.int32) else: start_pos = torch.full((B,), start_p, dtype=torch.int32) if bs_combine_flag: if cu_seqlens is None: T = B * S if T !=0: cu_seqlens = torch.arange(0, T + 1, S, dtype=torch.int32) else: cu_seqlens = torch.zeros((B+1), dtype=torch.int32) else: cu_seqlens = torch.tensor(cu_seqlens).to(torch.int32) for i in range(B): if start_pos[i] + cu_seqlens[i + 1] - cu_seqlens[i] > S_max: S_max = start_pos[i] + cu_seqlens[i + 1] - cu_seqlens[i] else: cu_seqlens = None S_max = max(start_pos) + S ### ======================== gen input data start ============================= # page state if cache_mode == 1: max_block_num_per_batch = (S_max + block_size - 1) // block_size block_num = B * max_block_num_per_batch next_block_id = 1 print(f"max_block_num_per_batch: {max_block_num_per_batch}") block_table = torch.zeros(size=(B, max_block_num_per_batch), dtype=torch.int32) for i in range(B): # 需要读取state的范围 cur_start = start_pos[i] // cmp_ratio * cmp_ratio - cmp_ratio cur_end = start_pos[i] // cmp_ratio * cmp_ratio + cmp_ratio if start_pos[i] % cmp_ratio == 0: cur_end = start_pos[i] cur_end = min(cur_end, start_pos[i] + S) cur_start_block_id = (cur_start // block_size) if cur_start >= 0 else 0 cur_end_block_id = (cur_end - 1) // block_size for j in range(cur_start_block_id, cur_end_block_id + 1): block_table[i][j] = next_block_id next_block_id = next_block_id + 1 # 需要写入state的范围 end_pos = get_seq_used_by_batch(i, S, seqused, cu_seqlens) if save_state_seqlens is not None: next_start = start_pos[i] + end_pos - save_state_seqlens[i] next_end = start_pos[i] + end_pos else: next_start = (start_pos[i] + end_pos) // cmp_ratio * cmp_ratio - cmp_ratio next_end = (start_pos[i] + end_pos) // cmp_ratio * cmp_ratio + cmp_ratio if (start_pos[i] + end_pos) % cmp_ratio == 0: next_end = start_pos[i] + end_pos next_end = min(next_end, start_pos[i] + end_pos) next_start_block_id = (next_start // block_size) if next_start >= 0 else 0 next_end_block_id = (next_end - 1) // block_size for j in range(next_start_block_id, next_end_block_id + 1): if block_table[i][j] == 0: block_table[i][j] = next_block_id next_block_id = next_block_id + 1 if B==0: kv_state = torch.tensor(np.random.uniform(-10, 10, (0, block_size, coff * head_dim))).to(torch.float32) score_state = torch.tensor(np.random.uniform(-10, 10, (0, block_size, coff * head_dim))).to(torch.float32) else: kv_state = torch.tensor(np.random.uniform(-10, 10, (torch.max(block_table) + 1, block_size, coff * head_dim))).to(torch.float32) score_state = torch.tensor(np.random.uniform(-10, 10, (torch.max(block_table) + 1, block_size, coff * head_dim))).to(torch.float32) # other input if bs_combine_flag: x_shape = (cu_seqlens[-1], hidden_size) rope_sin_shape = (min(x_shape[0], x_shape[0] // cmp_ratio + B), rope_head_dim) rope_cos_shape = rope_sin_shape else: x_shape = (B, S, hidden_size) rope_sin_shape = (B, (S + cmp_ratio - 1) // cmp_ratio, rope_head_dim) rope_cos_shape = rope_sin_shape x = torch.tensor(np.random.uniform(-10.0, 10.0, x_shape)).to(data_type).npu() wkv = torch.tensor(np.random.uniform(-10, 10, (coff * head_dim, hidden_size))).to(data_type).npu() wgate = torch.tensor(np.random.uniform(-10, 10, (coff * head_dim, hidden_size))).to(data_type).npu() ape = torch.tensor(np.random.uniform(-10, 10, (cmp_ratio, coff * head_dim))).to(torch.float32).npu() norm_weight = torch.tensor(np.random.uniform(-10, 10, (head_dim))).to(data_type).npu() rope_sin = torch.tensor(np.random.uniform(-1, 1, rope_sin_shape)).to(data_type).npu() rope_cos = torch.tensor(np.random.uniform(-1, 1, rope_cos_shape)).to(data_type).npu() if cache_mode == 1: # 连续buffer state_cache = torch.zeros((kv_state.shape[0], kv_state.shape[1], 2*kv_state.shape[2])) state_cache = state_cache.npu() state_cache[:, :, :state_cache.shape[2]//2] = kv_state.clone() state_cache[:, :, state_cache.shape[2]//2:] = score_state.clone() block_table = block_table.npu() start_pos = torch.tensor(start_pos).to(torch.int32).npu() if cu_seqlens is not None: cu_seqlens = torch.tensor(cu_seqlens).to(torch.int32).npu() if seqused is not None: seqused = torch.tensor(seqused).to(torch.int32).npu() cmp_kv = ( torch.ops.custom.compressor( x, wkv, wgate, state_cache, ape, norm_weight, rope_sin, rope_cos, rope_head_dim = rope_head_dim, cmp_ratio = cmp_ratio, state_block_table = block_table, cu_seqlens = cu_seqlens, seqused = seqused, start_pos = start_pos, coff = coff, norm_eps = norm_eps, rotary_mode = rotary_mode, cache_mode = cache_mode ) )
  • aclgraph调用

    import torch import torch_npu import numpy as np import torch.nn as nn import torchair import custom_ops import math def get_seq_used_by_batch(batch_idx, S, seqused, cu_seqlens): if seqused is not None: return seqused[batch_idx] else: if cu_seqlens is not None: return cu_seqlens[batch_idx + 1] - cu_seqlens[batch_idx] else: return S data_type = torch.bfloat16 hidden_size = 4096 rope_head_dim = 64 norm_eps = 1e-6 coff = 1 # 1:no overlap 2:overlap cmp_ratio = 128 rotary_mode = 2 cache_mode = 1 head_dim = 512 cu_seqlens = [0, 1] # ------------- B = 1 S = 1 S_max = 0 block_size = 128 start_pos = [8191] * B # (B,) start_p=8191 seqused = None # (B,), None时cu_seqlens的数据全部参与计算,否则按传参实际值计算 # BS是否合轴 bs_combine_flag = True update_flag = 1 save_state_seqlens = None if seqused is not None: seqused = torch.tensor(seqused).to(torch.int32) if start_pos is not None: start_pos = torch.tensor(start_pos).to(torch.int32) else: start_pos = torch.full((B,), start_p, dtype=torch.int32) if bs_combine_flag: if cu_seqlens is None: T = B * S if T !=0: cu_seqlens = torch.arange(0, T + 1, S, dtype=torch.int32) else: cu_seqlens = torch.zeros((B+1), dtype=torch.int32) else: cu_seqlens = torch.tensor(cu_seqlens).to(torch.int32) for i in range(B): if start_pos[i] + cu_seqlens[i + 1] - cu_seqlens[i] > S_max: S_max = start_pos[i] + cu_seqlens[i + 1] - cu_seqlens[i] else: cu_seqlens = None S_max = max(start_pos) + S ### ======================== gen input data start ============================= # page state if cache_mode == 1: max_block_num_per_batch = (S_max + block_size - 1) // block_size block_num = B * max_block_num_per_batch next_block_id = 1 print(f"max_block_num_per_batch: {max_block_num_per_batch}") block_table = torch.zeros(size=(B, max_block_num_per_batch), dtype=torch.int32) for i in range(B): # 需要读取state的范围 cur_start = start_pos[i] // cmp_ratio * cmp_ratio - cmp_ratio cur_end = start_pos[i] // cmp_ratio * cmp_ratio + cmp_ratio if start_pos[i] % cmp_ratio == 0: cur_end = start_pos[i] cur_end = min(cur_end, start_pos[i] + S) cur_start_block_id = (cur_start // block_size) if cur_start >= 0 else 0 cur_end_block_id = (cur_end - 1) // block_size for j in range(cur_start_block_id, cur_end_block_id + 1): block_table[i][j] = next_block_id next_block_id = next_block_id + 1 # 需要写入state的范围 end_pos = get_seq_used_by_batch(i, S, seqused, cu_seqlens) if save_state_seqlens is not None: next_start = start_pos[i] + end_pos - save_state_seqlens[i] next_end = start_pos[i] + end_pos else: next_start = (start_pos[i] + end_pos) // cmp_ratio * cmp_ratio - cmp_ratio next_end = (start_pos[i] + end_pos) // cmp_ratio * cmp_ratio + cmp_ratio if (start_pos[i] + end_pos) % cmp_ratio == 0: next_end = start_pos[i] + end_pos next_end = min(next_end, start_pos[i] + end_pos) next_start_block_id = (next_start // block_size) if next_start >= 0 else 0 next_end_block_id = (next_end - 1) // block_size for j in range(next_start_block_id, next_end_block_id + 1): if block_table[i][j] == 0: block_table[i][j] = next_block_id next_block_id = next_block_id + 1 if B==0: kv_state = torch.tensor(np.random.uniform(-10, 10, (0, block_size, coff * head_dim))).to(torch.float32) score_state = torch.tensor(np.random.uniform(-10, 10, (0, block_size, coff * head_dim))).to(torch.float32) else: kv_state = torch.tensor(np.random.uniform(-10, 10, (torch.max(block_table) + 1, block_size, coff * head_dim))).to(torch.float32) score_state = torch.tensor(np.random.uniform(-10, 10, (torch.max(block_table) + 1, block_size, coff * head_dim))).to(torch.float32) # other input if bs_combine_flag: x_shape = (cu_seqlens[-1], hidden_size) rope_sin_shape = (min(x_shape[0], x_shape[0] // cmp_ratio + B), rope_head_dim) rope_cos_shape = rope_sin_shape else: x_shape = (B, S, hidden_size) rope_sin_shape = (B, (S + cmp_ratio - 1) // cmp_ratio, rope_head_dim) rope_cos_shape = rope_sin_shape x = torch.tensor(np.random.uniform(-10.0, 10.0, x_shape)).to(data_type).npu() wkv = torch.tensor(np.random.uniform(-10, 10, (coff * head_dim, hidden_size))).to(data_type).npu() wgate = torch.tensor(np.random.uniform(-10, 10, (coff * head_dim, hidden_size))).to(data_type).npu() ape = torch.tensor(np.random.uniform(-10, 10, (cmp_ratio, coff * head_dim))).to(torch.float32).npu() norm_weight = torch.tensor(np.random.uniform(-10, 10, (head_dim))).to(data_type).npu() rope_sin = torch.tensor(np.random.uniform(-1, 1, rope_sin_shape)).to(data_type).npu() rope_cos = torch.tensor(np.random.uniform(-1, 1, rope_cos_shape)).to(data_type).npu() if cache_mode == 1: # 连续buffer state_cache = torch.zeros((kv_state.shape[0], kv_state.shape[1], 2*kv_state.shape[2])) state_cache = state_cache.npu() state_cache[:, :, :state_cache.shape[2]//2] = kv_state.clone() state_cache[:, :, state_cache.shape[2]//2:] = score_state.clone() block_table = block_table.npu() start_pos = torch.tensor(start_pos).to(torch.int32).npu() if cu_seqlens is not None: cu_seqlens = torch.tensor(cu_seqlens).to(torch.int32).npu() if seqused is not None: seqused = torch.tensor(seqused).to(torch.int32).npu() class CompressorNetwork(nn.Module): def __init__(self): super(CompressorNetwork, self).__init__() def forward(self, x, wkv, wgate, state_cache, ape, norm_weight, rope_sin, rope_cos, rope_head_dim, cmp_ratio, state_block_table = None, cu_seqlens = None, seqused = None, start_pos = None, coff = 1, norm_eps = 1e-6, rotary_mode = 1, cache_mode = 1): cmp_kv = ( torch.ops.custom.compressor( x, wkv, wgate, state_cache, ape, norm_weight, rope_sin, rope_cos, state_block_table = state_block_table, cu_seqlens = cu_seqlens, seqused = seqused, start_pos = start_pos, rope_head_dim = rope_head_dim, cmp_ratio = cmp_ratio, coff = coff, norm_eps = norm_eps, rotary_mode = rotary_mode, cache_mode = cache_mode ) ) return cmp_kv from torchair.configs.compiler_config import CompilerConfig config = CompilerConfig() config.mode = "reduce-overhead" npu_backend = torchair.get_npu_backend(compiler_config=config) torch._dynamo.reset() npu_mode = torch.compile(CompressorNetwork(), fullgraph=True, backend=npu_backend, dynamic=False) cmp_kv = npu_mode( x, wkv, wgate, state_cache, ape, norm_weight, rope_sin, rope_cos, rope_head_dim = rope_head_dim, cmp_ratio = cmp_ratio, state_block_table = block_table, cu_seqlens = cu_seqlens, seqused = seqused, start_pos = start_pos, coff = coff, norm_eps = norm_eps, rotary_mode = rotary_mode, cache_mode = cache_mode)

【免费下载链接】cann-recipes-infer本项目针对LLM与多模态模型推理业务中的典型模型、加速算法,提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-infer

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/9 15:41:29

从零构建量化交易系统:架构、策略与实盘部署全解析

1. 项目概述&#xff1a;当开发者决定“击败市场”如果你是一个对金融市场、量化交易或者自动化策略感兴趣的开发者&#xff0c;那么你很可能和我一样&#xff0c;曾经有过一个想法&#xff1a;能不能写个程序&#xff0c;让它自动帮我分析市场、执行交易&#xff0c;甚至“击败…

作者头像 李华
网站建设 2026/5/9 15:40:40

Meta发布的代码AI会黑进你的电脑吗?

这项由Meta AI安全实验室&#xff08;MSL Preparedness Team与AI Security Team&#xff09;联合完成的评估报告&#xff0c;于2026年5月5日正式发布&#xff0c;论文编号为arXiv:2605.00932v1&#xff0c;归类于计算机软件工程&#xff08;cs.SE&#xff09;领域&#xff0c;有…

作者头像 李华
网站建设 2026/5/9 15:39:35

CANN/catlass:Ascend 950 MX FP4矩阵乘示例

MXFP4MatmulTla Example Readme 【免费下载链接】catlass 本项目是CANN的算子模板库&#xff0c;提供NPU上高性能矩阵乘及其相关融合类算子模板样例。 项目地址: https://gitcode.com/cann/catlass 注意&#xff1a;社区包暂不支持 950 能力&#xff0c;后续支持的版本敬…

作者头像 李华
网站建设 2026/5/9 15:36:32

AI模型协议桥接器:用OpenAI API无缝调用Gemini与MCP服务

1. 项目概述&#xff1a;一个连接不同AI世界的“翻译官” 最近在折腾AI应用开发&#xff0c;特别是想把不同的大模型能力整合到自己的自动化流程里。相信很多开发者都遇到过类似的问题&#xff1a;你手头有一套基于OpenAI API构建的工具链&#xff0c;无论是代码生成、数据分析…

作者头像 李华
网站建设 2026/5/9 15:36:29

从无人机航拍到文档扫描:旋转框IoU计算在实际项目里到底怎么用?(附Python代码调试心得)

从无人机航拍到文档扫描&#xff1a;旋转框IoU计算的工程实践与调试技巧 旋转框IoU计算在计算机视觉领域的重要性&#xff0c;往往被初学者低估。直到你在无人机航拍图像中尝试检测倾斜停放的车辆&#xff0c;或在文档扫描应用中处理扭曲变形的表格时&#xff0c;才会真正理解传…

作者头像 李华
网站建设 2026/5/9 15:35:35

CANN/asc-tools CPU调试样例

CPU Debug直调样例说明 【免费下载链接】asc-tools Ascend C Tools仓是CANN基于Ascend C编程语言推出的配套调试工具仓。 项目地址: https://gitcode.com/cann/asc-tools 概述 本样例通过Ascend C编程语言实现了Add算子的CPU Debug调测。 支持的产品 Ascend 950PR/As…

作者头像 李华