in Techspeak

Inside the machine.

Consumer electronics companies have recently taken an interest in machine learning in a consolidated effort to provide new and compelling user experiences for mobile devices.

The best way to accelerate machine learning software on mobile is by employing the principle of heterogeneous computing where an optimal combination of on-chip processors work together to run artificial intelligence (AI) algorithms. Most of the current use cases for AI in mobile are related to computer vision, the science of using parallel compute for feature detection, object recognition and image classification.

Probably the most suited processor currently able to handle parallel operations is the GPU (although that might change in the future). You might know the GPU as that thing which gets brought up in smartphone reviews whenever the editor feels the need to trout out the results of four or five benchmarks as the absolute indicator of how fast device A is compared to device B.

Unbeknownst to many, most mid-range and flagship mobile devices today integrate GPUs that are capable of handling much more than games or UIs. Their massively multithreading architecture makes them perfect targets for compute APIs such as Vulkan, OpenCL or RenderScript.

Since GPU-based machine learning has now captured the headlines, it is only a matter of time before the great marketing hype machine is unleashed on the Janes and Joes of this world trying to make sense of all the technical jargon.

Before that happens, I’m going to take this opportunity to briefly touch on two concepts related to heterogeneous computing: cache coherency and shared virtual memory.

Cache coherency

Cache coherency is really about not accessing external memory: for example, if the data is inside the GPU cache, the CPU can read it directly from that location rather than forcing a flush to main memory and then accessing the data.


AMD hUMA is an example of a cache coherent system

The main benefits of cache coherency can be seen at the fabric level. The fabric is the glue that ties together the various System-on-Chip (SoC) processors such as the CPU, graphics, video, ensuring that they can access memory in an efficient fashion.

Since the design complexity of a mobile chip has been increasing exponentially in the last decade, fabric architectures have had to find ways to keep up. Therefore, you might hear many industry experts today referring to fabric as a Network-on-Chip (NoC); this is because modern SoC fabrics are very similar to (very congested) networks that try to route traffic from one destination to another as quickly and error-free as possible.

Mobile SoC architecture - cache coherency Mobile SoCs include several coherent masters

One way to ease congestion inside the NoC is to implement a cache coherency mechanism at the hardware level instead of relying on a software implementations; there are several ways to add hardware-enforced cache coherency to the NoC, including directory-based approaches, snooping, and snarfing.

NetSpeed Gemini - cache coherent NoCAn example of a directory-based, cache coherent NoC from NetSpeed

Having hardware-based cache coherency makes software development much easier since there is no need for software-enforced cache management. To implement cache coherency throughout the system, NoC architects have defined several standardized interfaces based on the protocols CPUs have been using for a long time; one example of such an interface is ARM’s AMBA ACE (AXI Coherency Extensions) protocol.

Mobile SoC layout - last level cache + NoCThe layout of a typical mobile SoC, including the CPU, graphics, video, NoC and caches

And this brings me to my original point about marketing hype: cache coherency alone is not a magic bullet for compute-intensive applications. In fact, if the software is poorly written and data bounces a lot inside the SoC, then the unsuspecting developer will likely be faced with a performance and power hit as the SoC infrastructure tries to cope with continuous data flushes.

Shared virtual memory

Another nice feature for computer vision-type applications is shared virtual memory where the GPU and CPU share pointers to data. Shared virtual memory makes software development easier because there is no need to have two different views of the same memory space.

12-Block-level-implementation-of-face-detection-on-CPU-and-GPUDifferent parts of the computer vision algorithm get executed on the CPU and GPU

One effective use case for this concept is the PowerVR Imaging Framework from Imagination Technologies. I won’t go into the details of the implementation here, but I encourage you to read the articles below and see how the framework performs in real world applications related to HDR photography for mobile devices, lane and collision detection in automotive, and VP9 decoding for smart TVs.

05-Zero-copy-transfer-between-a-camera-and-display_fShared virtual memory applied to computer vision

Wrap up

The examples presented above act as a precursor to the full blown, Skynet-style AI that will soon fall upon us.

I believe the semiconductor industry will continue to make fantastic progress to prepare for the inevitable arrival of mainstream AI. We’ve already seen incredible year over year increases in terms of performance efficiency and new standards such as the HSA specification emerging to fill the gap between software and hardware.

Now if only someone got to work faster on that version of the Terminator that turns off the lights in the living room and tucks the kids in at night. Or did that not make the studio cut of Judgment Day?