In 2025, the power of generative AI is quite literally at our fingertips. And thanks to the on-device AI capabilities built into Samsung’s Exynos SoCs, we can start accessing that power more quickly, reliably, and securely. This is because what once required massive servers and constant internet access can now run directly on personal devices like smartphones. In that sense, on-device AI is much more than a technical milestone; it represents a fundamental shift in how we use AI.
On-device AI has the potential to offer major advantages over cloud-based models. It works faster and lets users access AI services without an internet connection. It keeps personal data private and secure by processing everything locally. And it even helps cut down the cost of internet usage and cloud services. But making large-scale generative AI run smoothly on smartphones is no easy feat, due to the form factor’s limited computing resources and memory capacity. It requires highly efficient inference technologies and model optimization techniques, such as model compression and quantization. Additionally, the real-time operation of high-performance models is dependent on model conversion tools and runtime software technologies, as well as design technologies for high-performance, low-power neural network accelerators based on heterogeneous core architectures.
High-Performance, Low-Power NPU Based on Heterogeneous Core Architecture
Transformer architecture serves as the backbone of large-scale generative AI, consisting of a combination of multi-head attention mechanisms and feed-forward networks. Within these two structures, a variety of linear operations and nonlinear operations — such as matrix multiplication and the softmax1 function, respectively — are employed. The proportion of these operations can vary depending on the specific application scenario of the generative AI model.
For this reason, effectively running generative AI models on-device requires support for both linear and nonlinear operations. A neural network accelerator based on a heterogeneous core architecture is also critical, as it can efficiently handle variations in the ratio of these operations, ensuring optimal performance across diverse workloads.
Three key features are needed to configure a high-performance, low-power neural network accelerator for this type of application:
[1] High-Performance/Low-Power Compute Architecture for On-Device Environments
To meet the real-time processing requirements of on-device model execution, it is becoming necessary for the architecture to support computation performance in the hundreds of TOPS as well as low-precision formats below 16 bits. Although on-device systems may offer less raw compute power than cloud-based platforms, they can achieve higher energy efficiency by supporting lower precision operations such as 4-bit and 8-bit processing.
[2] Diverse Heterogeneous-Core-Based Architecture
To efficiently process both the linear and nonlinear operations that comprise generative AI models, the accelerator integrates both tensor engines and vector engines — each tailored for different types of computation. The tensor engines are equipped with multiple MAC2 arrays for high-speed linear operations, while the vector engines include SIMD3 units optimized for diverse nonlinear operations.
[3] Shared Memory and Controller Architecture Across Multiple Compute Units
To minimize data transfer overhead between heterogeneous compute units, all processing engines are equipped with shared internal memory such as scratchpad SRAM for exchanging computation results. Additionally, they feature dedicated controller architectures that maximize the execution efficiency of each core.
Software Technologies for Computation and Memory Optimization to Enable Efficient Inference
To run large-scale AI models in on-device environments, efficient inference enabled by optimization software technologies is needed to perform computations and store data using limited hardware resources. Representative examples include technologies for applying LoRA4 and compiler technologies for neural network computation and memory optimization.
[1] LoRA Application Technology
The LoRA technique enables various adaptations by making minimal changes to fixed model parameters, offering the advantage of significantly reducing the overall model size. Typical use cases include building domain-specific language models, generating images in specific styles, and developing task-specific chatbots and AI agents. To take advantage of LoRA in on-device environments, LoRA application technology is essential.
In on-device environments, the LoRA technique allows for a clear separation between the fixed parameters of the target model and the updatable LoRA parameters. This makes it possible to adapt flexibly to a variety of tasks while keeping memory usage to a minimum.
[2] Compiler Techniques for Neural Network Computation and Memory Optimization
In on-device environments, compiler technologies for accelerating generative AI models mainly include parallel processing techniques using heterogeneous accelerators, and weight sharing techniques between sub-models.
The parallel processing techniques take advantage of the fact that linear and nonlinear operations, which constitute generative AI models, are handled by different types of processing units. By scheduling the execution of these different units in parallel, the techniques minimize the overall execution time.
When neural network partitioning techniques for removing data dependencies are combined with these types of techniques, memory traffic can be reduced and computation time can be parallelized, significantly improving inference speed.
The weight sharing technique between sub-models is one of the compiler optimization technologies that helps overcome storage limitations in generative AI models. When parameters can be shared across sub-models within the overall model, the required storage space for the entire system can be significantly reduced, making the technique essential for this type of application.
Compression and Quantization Techniques
Model compression and quantization are also essential for running large-scale generative AI models on-device, as they enable efficient and real-time operation within limited hardware resources. These technologies play a crucial role in overcoming memory and computation restraints, thereby reducing model size, accelerating computation, and minimizing energy consumption.
In particular, pruning and knowledge distillation are two key techniques that enable on-device execution of large-scale generative AI models through model compression. Pruning involves removing unnecessary or low-importance neurons or their connections within the model. This reduces the model size and computational load, significantly improving processing speed and energy efficiency. On the other hand, knowledge distillation transfers the predictive knowledge of a large teacher model to a smaller student model. This allows the student model to retain the performance of the more complex model while dramatically reducing the number of parameters. Both methods are essential for achieving real-time AI inference in on-device environments, and when used complementarily, they can deliver optimal results.
Quantization technology converts a neural network’s weights and activation values into lower precision formats, such as 8-bit or lower integers. This greatly reduces the resources required for computation and storage, enabling efficient processing of large-scale generative AI models with limited hardware resources. As a result, quantization has become crucial to enabling real-time inference in on-device environments.
Recently, generative AI models have increasingly adopted lower precision formats, applying 4-bit or lower quantization for weights and 8-bit or lower for activation values. Accordingly, there is growing development and deployment of models that support sub-4-bit precision, driving rapid changes in on-device AI execution environments.
Charting the Future of On-Device Generative AI
Samsung Electronics has continuously advanced these core on-device generative AI technologies — which range from hardware that supports heterogeneous computing to execution efficiency-enhancing software and algorithms that reduce computational load — with a high level of integration.
Leveraging these advancements, Samsung enabled on-device AI capabilities in the world's first AI-powered smartphones by supplying its cutting-edge mobile SoC, reinforcing its leadership in the global on-device AI market.
For Samsung and the market as a whole, this type of recognition signals the growing significance of on-device AI, not only for key industry players but also in the context of users’ daily experiences. Looking ahead, the company will continue to push the boundaries of on-device AI technology to bring smarter, faster, and more secure AI experiences to people around the globe.
* All images shown are provided for illustrative purposes only and may not be an exact representation of the product. All images are digitally edited, modified, or enhanced.
* All product specifications reflect internal test results and are subject to variations by user's system configurations. Actual performance may vary depending on use conditions and environment.
1) Softmax is a mathematical function that converts a vector of real numbers into a probability distribution. In NPUs, softmax is crucial for accentuating the correlations in the Transformer’s attention mechanism. Separately, it is also essential for calculating probabilities during classification tasks.
2) Multiply-Accumulate
3) Single Instruction Multiple Data
4) Low-Rank Adaptation