Skip to content

Beyond the Cloud: A Deep Dive Into On-Device Generative AI

  • mail

Large-scale generative AI models, especially LLMs and text-to-image models, have typically been limited to cloud server environments. However, growing demand for stronger privacy protection, reduced latency, and improved cost and energy efficiency is fueling interest in SoC-based, on-device generative AI. Shifting to this type of edge computation offers many benefits, but three key physical limitations must be addressed to make it possible:

[1] Limited Computation Budget 

Unlike cloud servers, edge devices have SoCs with constrained computational resources, making it difficult to handle the trillions of operations per second (TOPS) required by generative models. Their number of parallel compute units is limited, and operating frequencies must be kept low, which means that reducing the computational load at a structural level is essential to enable high-performance inference on-device. 

[2] Limited Memory I/O Bandwidth 

High-performance generative models require handling hundreds of megabytes to several gigabytes of parameters and intermediate activations. However, compared to servers, edge devices are typically limited by their small DRAM capacity and significantly lower external memory access speeds. As a result, frequent memory access during model execution can become a major bottleneck, leading to overall system performance degradation and increased energy consumption. 

[3] Limited Battery Power & Thermal Envelope 

Battery-powered mobile devices have strict power limitations, and excessive power consumption leads to increased heat, triggering thermal throttling that automatically reduces system performance. Due to these constraints, even if a high-performance model is deployed, sustained inference becomes difficult, so computational processes must be redesigned with a strong focus on energy efficiency. 

To address these challenges, Samsung focused not only on hardware-level optimizations but also on architectural refinements to model structure, computation patterns, and algorithm design — key factors that enabled exceptional performance of large-scale generative models on Exynos SoCs.


Low-Bit Quantization: Scaling Down Models for SoC Deployment

Low-bit quantization is a technique that significantly reduces the overall size and computational complexity of deep learning models by representing weights and activations using 8-bit, 4-bit, or even lower-bit integers instead of 32-bit floating-point values. This approach increases computational speed, reduces memory usage, and enables the use of power-efficient integer-based operations, making it highly effective in SoCs and edge devices with limited compute resources. 

Recently, new algorithms have made it possible to quantize models down to 4 bits or fewer while maintaining accuracy, positioning low-bit quantization as a key technology for running LLMs and generative models on-device. Through this approach, Samsung has achieved high power efficiency — measured in TOPS/W — compared to floating-point models, while also mitigating memory bandwidth bottlenecks. This progress has enabled generative models like Llama and Stable Diffusion XL to run at practical performance levels on Exynos SoCs.  

Benefits of Implementing Low-Bit Quantization
Benefits of Implementing Low-Bit Quantization

Weight Sparsity: A Model Optimization Technique for Reducing Memory I/O

Weight sparsity eliminates or ignores weights in a deep learning model that are either of low importance or have values close to zero, allowing the model to perform only the essential computations. By leveraging this sparsity, the total number of operations is reduced, and unnecessary memory fetches can be avoided, leading to significant reductions in memory I/O.

In the past, structured pruning, which removes entire channels or filters, was commonly used to help simplify model architecture, but it has the drawback of offering limited reduction in actual computation due to low channel- or filter-wise sparsity. In contrast, unstructured pruning, which removes unnecessary individual weight connections, has started gaining traction, and the industry is moving toward sparse-aware custom accelerators to translate this fine-grained sparsity into real performance improvements.  

The Exynos platform supports unstructured weight sparsity at the hardware level, providing a solution that reduces memory I/O. This enables optimized performance and low power consumption, especially in models where memory I/O is the primary performance bottleneck. To go beyond weight sparsity, Samsung is also researching techniques like activation sparsity. Activation sparsity occurs when many input values to layers become zero, allowing computations to be skipped. Unlike weight sparsity, which is static, activation sparsity is dynamic and requires different hardware handling.¹


Algorithm-Level Optimization: A New Approach to Structurally Improving Inference Speed

To move beyond the methods of compressing fixed model architectures or skipping computations, Samsung is researching and applying structural optimizations at the algorithmic level — an increasingly prominent approach for accelerating inference.

[1] Speculative Decoding for LLMs

Speculative decoding dramatically accelerates inference in LLMs by first using a lightweight model to rapidly generate multiple candidate tokens, which are then verified in batch by a larger model. This approach enables the prediction of multiple tokens with significantly fewer computations compared to the conventional method of invoking the large model for each token, greatly reducing overall inference latency. Notably, it can deliver responses up to 3–4 times faster without compromising output quality, which makes it a key technology for running LLMs on mobile or edge devices with limited compute resources.

Comparing Autoregressive Decoding to Speculative Decoding
Comparing Autoregressive Decoding to Speculative Decoding

[2] Sliding Window Attention for LLMs

To address the exponential increase in computation and memory usage when LLMs process long inputs, Samsung has implemented optimization algorithms like Sliding Window Attention (SWA). This technique limits the Self-Attention computation by allowing each token to interact only with a fixed-length window of neighboring tokens, rather than the entire sequence.

By doing so, the computational complexity of the LLM’s Transformer block can be reduced from O(N²) to O(N). This architecture is particularly well-suited for long-context tasks such as summarization, as it enables efficient processing of extended sequences. As a result, it supports practical deployment of on-device AI in mobile environments, and while speculative decoding reduces computation by predicting future inference paths, SWA reduces the computational burden structurally by simplifying the context structure itself.

[3] Step Distillation for Image-Generating Diffusion Models

Step distillation in diffusion models is an intelligent optimization technique designed to reduce the number of iterative noise removal steps required for high-quality image generation. Conventional diffusion models use a U-Net architecture to progressively denoise images over dozens to hundreds of steps. However, this process is computationally intensive and requires frequent memory access, making it challenging to implement in SoC or edge device environments.

To address this, step distillation reduces the inference process from dozens or hundreds of steps to fewer than 10, while maintaining comparable image quality. It can be applied without significant changes to the model’s architecture or parameters, making it well-suited for large-scale image generation models like Stable Diffusion, and it is particularly advantageous in SoC and edge environments where power efficiency and inference time optimization are critical.

Moreover, step distillation enables high-quality generative AI within limited compute resources and memory bandwidth, positioning it as a key enabling technology. Additional optimizations are also possible, such as executing layers with significant changes per step more frequently, while less dynamic layers can be run intermittently based on the characteristics of the U-Net architecture.


Moving Toward Smarter On-Device AI Experiences


In response to the generative AI revolution, Samsung has driven on-device innovation by enhancing the AI capabilities of its Exynos SoCs through sustained architectural and algorithmic optimization. As the company prepares for the era of agentic AI, it will continue to research model compression techniques such as low-bit quantization and weight/activation sparsity, implementing them via Exynos AI Studio, an integrated toolchain.

At the algorithmic level, Samsung is advancing speculative decoding while researching and developing efficient implementations of cutting-edge model architectures, including MoE², Mamba³, and MM-DiT⁴, customized for edge-device environments.

Together, these software innovations mark a pivotal shift in how generative models are run in on-device environments. Samsung will continue to deliver hardware advancements and software innovation to further enhance on-device AI performance. This holistic approach will enable real-time, on-device generative AI to not just be practical, but in many cases, better.

* All images shown are provided for illustrative purposes only and may not be an exact representation of the product. All images are digitally edited, modified, or enhanced.

* All product specifications reflect internal test results and are subject to variations by user's system configurations. Actual performance may vary depending on use conditions and environment.


1) In a neural network, the fundamental operation is y = w × x. Weight sparsity occurs when w = 0, and activation sparsity occurs when x = 0. In either case, the computation can be skipped and y can be directly set to zero. However, because w is a constant and x is a variable, they require different hardware implementations.
2) MoE (Mixture of Experts) is a neural network architecture that selectively activates only a subset of expert models, improving computational efficiency while enabling effective scaling of model capacity.
3) Mamba is a sequence model designed to overcome the limitations of Transformers, capable of processing long sequences in linear time.
4) MM-DiT (Multimodal Diffusion Transformer) replaces the U-Net architecture in diffusion models with a Transformer-based structure. It divides an image into patches, treats each patch as a token, and processes them alongside text inputs — enabling high-quality image generation with multimodal understanding.