in Techspeak

All your cores are belong to us.

First we saw the gold rush of octa-core; now we’re hearing reports of deca-core and perhaps even dodeca-core processors. So. Many. Cores.

How did we get here?

Patrick Moorhead has written several pieces about the mobile industry’s appetite for self-destruction (in case you’re wondering, I’m trying to cram a Guns N’ Roses reference in every article I write). In his latest article titled Why 8 And 10 CPU Cores In Smartphones Are A Bad Idea – An Auto Industry Lesson he writes about the perils of multicore decadence (I did it again!); what follows below is my personal take on the whole debate.

Too big to scale

If you ask me, the root of the problem is dead simple: poor CPU (and system) benchmarking. You see, some time ago the industry told consumers that big.LITTLE is the only sensible way to scale in peak performance for mobile devices. It looks and sounds pretty in a polished YouTube video but it quickly descends into ultimate hardware and software chaos: crank the core count up to 11, run some synthetic code for one minute – and voila! TRUE REPRESENTATIVE PERFORMANCE!

Tech-savvy or not, any sensible human being would agree that is absolutely not the way you should approach benchmarking. When was the last time you used your smartphone to run complex physics simulations on all eight CPU cores for five minutes straight? And the answer iiiiiis: that’s right, absolutely never! One self-proclaimed professional benchmark even has a human in flames for a logo – oh, the irony.

My cores bring all the boys to the yard

If achieving peak performance in mobile applications is one of the goals of SoC design – and don’t get me wrong, it is a very noble one -, then there are many (and better) ways to do it.

For example, take the recent AMD Zen announcement. Did you see AMD introducing microprocessors that had gazillions of cores? No, of course they didn’t; instead the engineers focused on real-world IPC gains based on enabling simultaneous multithreading (SMT) in two to four CPU configurations. And that is for a desktop-class microarchitecture.

Zen-IPC-Gain

Another example of smart performance scaling is the new 64-bit MIPS I6400 CPU – it too features the same SMT principles found in x86 CPUs. For a marginal increase in die size (roughly 10% for every pair of threads introduced), an SoC architect will get a 40-50% performance boost in threaded-oriented applications.

Core workouts

The beauty of adopting hardware multithreading lies in the efficiency gains delivered by SMT. Not only are single-threaded, multicore CPUs slow(er) to wake up and go to sleep, but they also require additional logic to handle power gating and coherency management. Turning on hardware threads on the other hand is almost an instant process and consumes almost no power.

Do you need to open five browser tabs at once while playing music in the background and syncing your email account? Don’t jump directly to switching on additional cores and clusters. Turn on a second hardware thread first, then turn on a second physical core, then turn on a second thread on the second core, then (maybe) turn on a second cluster.

MIPS I6400 - threads cores clustersHere is another fact to make you all warm and fuzzy on the inside: because Android is a thread-aware operating system, a dual-threaded CPU is seen to be dual-core processor. In a situation where you have a cluster of four dual-threaded MIPS I6400 CPUs, an application running on the chip has the ability to disperse on eight virtual cores.

I want eight cores in my phone consumer fantasy: unlocked!

In addition, MIPS-based cores and clusters have the ability to be clocked at different frequencies individually; you could be having two MIPS CPUs inside quad-core cluster #1 running at a top frequency of 1 GHz for low-power functionality while keeping one MIPS CPU in cluster #2 at 2 GHz for high performance operation, creating a lean.MEAN (©?) configuration. You can find this principle elegantly implemented in the Ingenic M200 chip for wearables and the Samsung ARTIK SoC for IoT.

Distilling all this techspeak down into tangible benefits:

  • SoC designers will be happy to see the processor stay in a reasonable thermal limit and save important die size. Remember: going dual-threaded only means a 10% increase in area while going dual-core automatically implies a 100% penalty in silicon.
  • Heavily multithreaded applications (so benchmarks too) will benefit from running at full throttle and getting the same performance boost on a single cluster of dual-threaded, four-core CPUs as they do on two clusters of single-threaded, quad-core processors (see what we did there?).
  • Consumers only interested in core counts will be overjoyed to see an octa-core configuration based on an underlying architecture of four dual-threaded CPUs (instead of a much bigger and power-hungry partition of two clusters of four CPUs – one of which has a power envelope of 2 W at 2 GHz!)

It gets better.

Gary Sims argues in his article Why 8 and 10 CPU cores in smartphones are a good idea – a lesson from the kitchen that adding an additional cluster of Cortex-A53s is cheap from an area perspective. He is absolutely correct; given a mobile SoC today measures close to 100 mm2 (give or take) in die size, a quad-core cluster of power-efficient CPUs clocked at a reasonably low frequency (e.g. 1 GHz) will represent only an increase of 2-3 mm2. In the grand scheme of things, that is indeed very little.

But before you start browsing through die shots to look for CPU sizes, close your eyes for one second and imagine two words: heterogeneous computing. Thanks to the recent progress in compute APIs, we are now on the verge of enabling true heterogeneous computing on multiprocessor mobile platforms.

The principle is simple: instead of running all tasks on a few number of processors while others are sitting idly by, you spread the workload across the entire chip by distributing tasks according to what hardware is best suited for a specific function.

SoC architectureThis will deliver huge gains in performance for real-world scenarios without the need to add more on-chip processors. Why run image processing kernels on the CPU, when you can use the GPU subsystem and OpenCL/RenderScript/etc.? The main CPU then becomes simply a fallback mechanism for when the graphics core needs to do other non-compute related work.

I miss KISS

I will end here with a story from back in the day. When I started studying digital IC design in the second year of university, our professor got up during the first lecture and said he would teach us the most important principle behind every IC design. He then picked up a sharpie pen and wrote KISS in big letters on the whiteboard.

While impressed with his love of classic rock, we weren’t exactly sure how four guys in makeup held the key to desinging perfect digital ICs; naturally, we asked him what other hip 1970s bands he was into and how those groups shaped design methodologies.

His answer: “Bands? I have no clue what you’re talking about. KISS means Keep It Simple, Stupid.”

KISS, mobile industry. We’re running out of Ancient Greek numerals.

DETROIT ROCK CITY!