news 2026/5/9 13:07:54

CANN PTO-ISA AUTO模式

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
CANN PTO-ISA AUTO模式

PTO AUTO Mode

【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa

What is PTO AUTO

PTO AUTO is a programming mode for PTO that provides two major benefits:

  • Simpify developing efficient PTO code while providing kernel developers with the mechanisms that are necessary to implement their optimizations.
  • Compatibility across different generations of the Ascend architecture.

More specifically, in PTO AUTO, the kernel developer does not need to explicitly specify tile memory addresses or synchronization between different pipes. Instead the PTO AUTO compiler automatically allocates optimal memory addressess for the tiles in different chip buffers. Moreover, the compiler automatically synchronizes the PTO tile operations in order to maximize parallelism among different pipes. Finally, the kernel developer does not need to be concerned with the minor differences between various generations of the Ascend architecture (particulary in terms of the way Cube and Vector computations are coordinated).

Note: auto mode currently only supports the compiler-O2option.

Simple Example

A simple example, elementwise multiplication demonstrates the key differences between the PTO AUTO and manual modes:

TMUL Manual Mode

template <typename T, int kGRows_, int kGCols_, int kTRows_, int kTCols_> __global__ AICORE void runTMul(__gm__ T __out__ *out, __gm__ T __in__ *src0, __gm__ T __in__ *src1) { using DynShapeDim5 = Shape<1, 1, 1, kGRows_, kGCols_>; using DynStridDim5 = Stride<1, 1, 1, kGCols_, 1>; using GlobalData = GlobalTensor<T, DynShapeDim5, DynStridDim5>; using TileData = Tile<TileType::Vec, T, kTRows_, kTCols_, BLayout::RowMajor, -1, -1>; TileData src0Tile(kGRows_, kGCols_); TileData src1Tile(kGRows_, kGCols_); TileData dstTile(kGRows_, kGCols_); TASSIGN(src0Tile, 0x0 + 0x400 * block_idx); TASSIGN(src1Tile, 0x4000 + 0x400 * block_idx); TASSIGN(dstTile, 0x8000 + 0x400 * block_idx); int offset = (block_idx / 4) * (64 * 16) + (block_idx % 4) * 16; GlobalData src0Global(src0 + offset); GlobalData src1Global(src1 + offset); GlobalData dstGlobal(out + offset); TLOAD(src0Tile, src0Global); TLOAD(src1Tile, src1Global); set_flag(PIPE_MTE2, PIPE_V, EVENT_ID0); wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID0); TMUL(dstTile, src0Tile, src1Tile); set_flag(PIPE_V, PIPE_MTE3, EVENT_ID0); wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID0); TSTORE(dstGlobal, dstTile); out = dstGlobal.data(); }

TMUL AUTO Mode

template <typename T, int kGRows_, int kGCols_, int kTRows_, int kTCols_> __global__ AICORE void runTMul(__gm__ T __out__ *out, __gm__ T __in__ *src0, __gm__ T __in__ *src1) { using DynShapeDim5 = Shape<1, 1, 1, kGRows_, kGCols_>; using DynStridDim5 = Stride<1, 1, 1, kGCols_, 1>; using GlobalData = GlobalTensor<T, DynShapeDim5, DynStridDim5>; using TileData = Tile<TileType::Vec, T, kTRows_, kTCols_, BLayout::RowMajor, -1, -1>; TileData src0Tile(kGRows_, kGCols_); TileData src1Tile(kGRows_, kGCols_); TileData dstTile(kGRows_, kGCols_); int offset = (block_idx / 4) * (64 * 16) + (block_idx % 4) * 16; GlobalData src0Global(src0 + offset); GlobalData src1Global(src1 + offset); GlobalData dstGlobal(out + offset); TLOAD(src0Tile, src0Global); TLOAD(src1Tile, src1Global); TMUL(dstTile, src0Tile, src1Tile); TSTORE(dstGlobal, dstTile); out = dstGlobal.data(); }

PTO AUTO Compiler Features

Cross-Architecture Compatibility

PTO AUTO Compiler ensures a single source PTO program can be compiled for different Ascend architecture generations without requiring any source-level modifications while maintaining performance.

Automatic Synchronization

In manual mode, user would normally have to keep track of the asynchronous nature of the hardware by using PTO'sevent modelat precise code locations in order to ensure both functional correctness and high performance in execution. This might be tedious and error prone.

Auto mode compilation will allow users to avoid having to use the event model to synchronize their code. The compiler will automatically determine the locations to insert synchronization under the hood - ensuring functional correctness and competitive performance.

Tile Memory Allocation

In the default mode of PTO compilation, after instantiatingTilevariables, we would need to complement them with aTASSIGNinstruction to manually assign a dedicated buffer address that it operates on. However in auto mode, this is not required anymore. By simply instantiating theTilevariable the compiler will automatically allocate the buffer addresses under the hood for the user.

PTO AUTO Documents

More detailed documentations of the PTO AUTO programming and compilations are organized into the following documents.

  • PTO_AUTO_kernel_developer_rules_and_limitations
  • PTO_AUTO_library_developer_rules_and_limitations
  • PTO AUTO Code Examples

【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/9 13:03:00

CANN/tensorflow NPUEstimatorSpec构造函数

NPUEstimatorSpec构造函数 【免费下载链接】tensorflow Ascend TensorFlow Adapter 项目地址: https://gitcode.com/cann/tensorflow 功能说明 NPUEstimatorSpec类的构造函数&#xff0c;NPUEstimatorSpec类继承了TensorFlow的EstimatorSpec类&#xff0c;可以调用基类…

作者头像 李华
网站建设 2026/5/9 12:59:24

CANN/graph-autofusion SuperKernel性能分析演示

super_kernel 用例演示 【免费下载链接】graph-autofusion Graph-autofusion 是一个面向昇腾&#xff08;Ascend&#xff09;芯片的轻量级、解耦式组件集合&#xff0c;旨在通过自动融合技术加速模型执行。 目前已开源 SuperKernel 组件&#xff0c;未来将持续开放更多自动融合…

作者头像 李华
网站建设 2026/5/9 12:59:20

CANN/pyasc双曲反余弦函数API文档

asc.language.adv.acosh 【免费下载链接】pyasc 本项目为Python用户提供算子编程接口&#xff0c;支持在昇腾AI处理器上加速计算&#xff0c;接口与Ascend C一一对应并遵守Python原生语法。 项目地址: https://gitcode.com/cann/pyasc asc.language.adv.acosh(dst: Loca…

作者头像 李华
网站建设 2026/5/9 12:58:35

CANNBot向量掩码约束

Vector Mask Constraints 【免费下载链接】cannbot-skills CANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体&#xff0c;本仓库为其提供可复用的 Skills 模块。 项目地址: https://gitcode.com/cann/cannbot-skills Read this file when implementing or debug…

作者头像 李华
网站建设 2026/5/9 12:58:09

CANN/ops-tensor张量算子库

ops-tensor 【免费下载链接】ops-tensor ops-tensor 是 CANN &#xff08;Compute Architecture for Neural Networks&#xff09;算子库中提供张量类计算的基础算子库&#xff0c;采用模块化设计&#xff0c;支持灵活的算子开发和管理。 项目地址: https://gitcode.com/cann…

作者头像 李华
网站建设 2026/5/9 12:57:32

基于FPGA的医疗AI边缘计算:从模型轻量化到硬件部署实战

1. 项目概述&#xff1a;当AI遇上硬件加速最近几年&#xff0c;AI在医疗影像诊断领域的应用已经不是什么新鲜事&#xff0c;但大多数方案都跑在云端服务器或者高性能GPU上。我们团队当时接到了一个挺有意思的挑战&#xff1a;能不能把一套用于辅助诊断的智能检测系统&#xff0…

作者头像 李华