Until recently, the evolution of mobile cameras has been centered on the image sensor and the image signal processor (ISP). In this conventional architecture, the image sensor converts light into electrical signals, while the ISP corrects and processes those signals to generate the final image displayed on the screen. This pipeline has long defined the foundation of mobile camera systems.
Today, mobile cameras are undergoing a fundamental transformation beyond this traditional structure. With the full-scale integration of AI-based computer vision technologies, mobile cameras are no longer mere image-processing devices. They are evolving into intelligent systems capable of recognizing scenes and interpreting their semantic meaning. Image quality is no longer determined solely by sensor resolution or optical performance. Instead, AI algorithms and system-level design, operating throughout the entire capture pipeline before and after the shutter event, have become decisive factors.
A single photo is now produced through the coordinated operation of multiple AI models and algorithms. Even before the shutter is pressed, scene analysis and optimization are performed in parallel during the preview stage, with the entire process running in real time. As a result, the competitiveness of modern mobile cameras increasingly depends on how efficiently and effectively AI computer vision technologies are integrated and implemented within the system.
To address this paradigm shift, Exynos 2600 introduces, for the first time in the Exynos lineup, the VPS (Visual Perception System), establishing a dedicated AI computer vision subsystem for the camera. This architectural approach enables both real-time operation and high power efficiency, ultimately delivering differentiated camera performance. In this article, we explore how the structure of mobile camera systems is evolving and examine the technical strategies through which the VPS in Exynos 2600 brings this vision to life.
In the past, mobile camera systems were based on a serial pipeline structure that started at the image sensor, passed through the image processing subsystem, and ultimately delivered the output to the display. Based on Bayer raw data input from the image sensor, processes such as demosaicing, noise reduction, and color and contrast correction were performed sequentially, and only the final result was presented to the user.
Recent mobile camera systems are evolving beyond this simple serial architecture toward designs that incorporate parallel processing and feedback structures. Rather than operating independently, each processing stage interacts with AI-based computer vision solutions and repeatedly refines its results. Through this approach, mobile cameras have advanced to deliver images and video with optimal image quality by reflecting shooting conditions and subject characteristics in real time.
Earlier mobile cameras performed image quality processing based on a single image input from the image sensor. In contrast, modern mobile cameras adopt multi-frame processing as a default operating model, leveraging continuously incoming video sequences. Multi-frame processing requires analyzing context across adjacent frames along the temporal axis and accurately identifying inter-frame motion. In this process, motion estimation algorithms that precisely classify global motion vectors and local motion vectors play a critical role.
Deep learning–based motion estimation algorithms precisely extract localized movements, such as hand motion. Based on these analysis results, the operating configuration of the ISP is dynamically generated, and optimal ISP settings for multi-frame processing are determined. This enables accurate fusion of multiple frames and effectively overcomes the limitations of conventional image quality enhancement approaches based on single-image processing.
Another image quality technology in mobile cameras is CAX (Content Aware Preview/Video/Capture), which delivers region-optimized image quality based on real-time region-of-interest extraction through semantic segmentation. With CAX, detail-rich regions such as hair or eyebrows can maintain sharpness, while skin regions are processed independently to achieve a more natural appearance.
In mobile cameras, information related to faces is one of the most critical elements. Real-time processing of data such as face location, eye-blink status, and changes in facial expression directly determines the overall quality of the captured result. Modern mobile cameras are evolving toward recognizing and interpreting facial information in real time, enabling capture results that better reflect user intent. Exynos 2600 addresses these requirements through its VPS-based AI face detection solution, supporting eye-blink detection and facial landmark detection¹. This enables continuous analysis of facial states from the preview stage through the moment of capture.
AI computer vision technologies based on face detection translate directly into improvements in user experience. One common frustration that smartphone users encounter when taking photos is obtaining unsatisfactory group shots because some subjects have their eyes closed. The VPS integrated into Exynos 2600 recognizes facial states in real time and analyzes and evaluates each subject’s facial expression and eye-blink status. By selecting or compositing only the best frames for each individual, it delivers satisfactory results with a single capture, eliminating the need for repeated shots. This represents a typical example of how AI computer vision technologies intervene throughout the capture process to enhance output quality.
As a result, modern mobile camera systems are evolving into hybrid architectures that combine fixed-logic imaging systems with AI computer vision systems. The conventional imaging system is responsible for preprocessing for AI workloads and for generating images optimized for human perception, while the AI computer vision system performs semantic interpretation of scenes and subjects. The two systems are organically connected through feedback loops, forming a unified camera operation that delivers optimal photographic results.
Implementing AI computer vision systems in mobile environments requires simultaneously addressing constraints related to battery consumption, thermal characteristics, and real-time responsiveness. Traditionally, these computer vision functions have been implemented using general-purpose processors such as CPUs, GPUs, and NPUs.
However, unlike large language models, computer vision algorithms typically employ relatively lightweight network structures and lower input dimensionality, while requiring sustained real-time processing in ultra-high-resolution, 60 fps video environments. As a result, efficient implementation through dedicated architectures has become increasingly important.
Reflecting these characteristics, Exynos 2600 introduces VPS, a dedicated subsystem designed around Samsung’s proprietary AI computer vision algorithms. VPS delivers core solutions such as AI-based face detection, motion estimation, and real-time semantic segmentation, operating in real time under ultra-high-resolution, 60 fps scenarios. Compared to the previous generation, more than a 50% improvement in power efficiency has been achieved across multiple camera solutions, representing a remarkable achievement through simultaneous reductions in power consumption and latency.
VPS is designed to operate efficiently across the entire processing pipeline using proprietary algorithms from end to end. Through this approach, it achieves a balanced combination of low latency, minimized software overhead, and high power efficiency.
In addition, at the algorithm level, network lightweighting has been successfully implemented while maintaining high recognition accuracy. When compared with SOTA (State-of-the-Art)² research results, the algorithms achieve equivalent or superior accuracy. Operating at 60 fps in mobile on-device environments with low power consumption, these algorithms enable industry-leading performance in multi-frame–based image and video compositing.
Mobile cameras are no longer simple optical devices. They are evolving into high-quality intelligent cameras that combine AI computer vision systems with imaging systems. This evolution extends beyond photo and video capture, toward real-time perception and understanding of the surrounding environment.
Looking ahead, intelligent cameras will continue to evolve into multimodal technologies integrated with large language models, enabling more intuitive and richer user experiences. This trend will expand beyond smartphones to a wide range of wearable devices, becoming a naturally integrated technology across everyday life.
Through the VPS integrated into Exynos 2600, as introduced in this article, Samsung has secured core capabilities in AI computer vision systems as a key technology for the future. Building on this foundation, Samsung will continue to advance AI solutions optimized for mobile environments and expand its technologies toward next-generation multimodal AI.
1 Facial Landmark Detection is a computer vision technique that identifies the locations of predefined key points on a face to capture its structure and expressions.
2 SOTA refers to the highest level of performance and accuracy achieved by the most advanced methods reported in recent academic and industrial research at the time of evaluation.
* All images shown are provided for illustrative purposes only and may not be an exact representation of the product. All images are digitally edited, modified, or enhanced.
* All product specifications reflect internal test results and are subject to variations by user's system configurations. Actual performance may vary depending on use conditions and environment.