Skip to content

Expanding CPU Capabilities for On-device AI with Arm SME2

  • mail

The Need for a flexible on-device AI ecosystem leveraging multiple computing resources

On-device AI has recently emerged as a central theme in mobile computing. While the proliferation of powerful NPUs has accelerated the adoption of AI workloads executed directly on mobile devices, a closer look at real-world AI applications reveals a more nuanced picture. In practice, AI workloads are typically deployed using a heterogeneous mix of computing resources—including CPUs, GPUs, and NPUs—depending on application objectives and data characteristics.

A significant portion of AI applications continues to rely on CPUs. CPU-based processing offers clear advantages, such as ease of software development, broad ecosystem compatibility, and reduced overhead associated with offloading workloads to dedicated accelerators. However, conventional CPU architectures face inherent limitations when it comes to efficiently handling the large-scale parallel computations required by machine learning workloads.

To address these limitations, Arm introduced SME2 (Scalable Matrix Extension 2). SME2 is an extension to the Arm ISA¹ for mobile CPUs, specifically designed to accelerate matrix operations. By integrating SME2, CPUs can retain their inherent programmability and flexibility while achieving the computational performance required for on-device AI workloads.

In this context, Arm describes the significance of SME2 as implemented in the Exynos 2600 as follows:

“As on-device AI becomes central to the mobile experience, efficiency and responsiveness are increasingly critical. Built on Arm compute subsystems with SME2-enabled C1-Ultra and C1-Pro, Exynos 2600 leverages SME2 to expand the potential of CPU-based AI, reducing the latency associated with offloading to discrete accelerators and making it well suited for short, interactive, and real-time AI workloads. This enables developers to deploy AI capabilities more flexibly across the system, even under strict power and thermal constraints. Arm will continue to work closely with Samsung to further expand the CPU-centric AI ecosystem.”

Stefan Rosinger, Senior Director, Product Management, Arm

 

SME2 instructions and unit architecture for accelerating machine learning on CPUs

To understand the motivation behind SME2, it is important to first examine the structural characteristics of CPU-based machine learning processing. CPUs are fundamentally designed to excel at control-oriented programming rather than raw computational throughput, making them less suitable for processing large volumes of data in parallel. SIMD² instructions were introduced to mitigate this limitation by enabling data-level parallelism, but the amount of data that can be processed concurrently remains constrained by register width.

In contrast, the dominant computational pattern in AI applications is matrix multiplication, typically represented by GEMM³ operations. These workloads involve applying identical operations across large datasets and therefore demand wide registers capable of holding large amounts of data, along with a substantial number of parallel multiply units to process individual elements simultaneously. Existing SIMD-based architectures alone struggle to meet these requirements.

SME2 was introduced against this backdrop. As an extended ISA, SME2 is designed to enable efficient large-scale matrix computation while preserving the programmability of the CPU. The hardware component that physically executes SME2 instructions is referred to as the SME2 unit.

SME2 Operations
Figure 1. SME2 Operations
SME2 Operations
Figure 1. SME2 Operations

The SME2 unit is designed as a shared resource within a CPU cluster, allowing all CPU cores to execute the same software transparently. Integrating large registers and numerous multipliers into every individual CPU core would result in excessive silicon area overhead. To avoid this, the SME2 unit is implemented as a separate block shared across multiple CPU cores.

Software threads execute on individual CPU cores as usual, but when an SME2 instruction is encountered, the corresponding operation is dispatched to the SME2 unit. To support this mechanism, the SME2 unit incorporates context storage structures capable of handling instruction streams from multiple CPUs simultaneously. The DSU⁴, which connects multiple CPU cores, provides a ring-based transport mechanism to deliver data and instructions to the SME2 unit. By including an L3 cache, the DSU ensures that required data streams are supplied efficiently to the SME2 unit. This architecture minimizes hardware area growth while maintaining full software compatibility across CPU cores.

CPU Cluster Architecture with SME2 unit
Figure 2. CPU Cluster Architecture with SME2 unit
CPU Cluster Architecture with SME2 unit
Figure 2. CPU Cluster Architecture with SME2 unit

Within this shared architecture, the SME2 unit itself is functionally divided into three major blocks. First, the VPU decodes instructions received from the CPU cores, generates control signals, and contains the register banks. Second, the Matmul unit performs the core matrix computations that dominate AI workloads. Finally, the Memsys block manages data movement to and from the register banks and incorporates an L1 cache. Structures responsible for managing execution contexts from multiple CPU cores are also housed within the Memsys block.

SME2 unit Diagram
Figure 3. SME2 unit Diagram
SME2 unit Diagram
Figure 3. SME2 unit Diagram

Building a Flexible and Scalable On-Device AI Solution

This article has explored SME2 and the architecture of the SME2 unit as a means of enhancing CPU-based AI performance. By adopting the SME2 unit, AI functions such as object detection can achieve up to 70% performance improvement compared to previous-generation CPUs that do not support SME2. Building on this, a foundation has been established for more flexible utilization of on-chip computing resources—including CPU, GPU, and NPU—depending on the model characteristics and usage scenarios in on-device AI applications.

Such an approach enables AI processing strategies that are not dependent on a single accelerator, ultimately contributing to a higher-quality overall user experience. Furthermore, by leveraging the Arm ISA, Samsung ensures broad compatibility with a wide range of third-party applications, allowing the on-device AI ecosystem to continue expanding in both scope and diversity.

 

* All images shown are provided for illustrative purposes only and may not be an exact representation of the product. All images are digitally edited, modified, or enhanced.
 
* All product specifications reflect internal test results and are subject to variations by user's system configurations. Actual performance may vary depending on use conditions and environment.

1) Instruction Set Architecture

2) Single Instruction, Multiple Data

3) General Matrix–Matrix Multiplication

4) DynamiQ Shared Unit