in Techspeak

Benchmarking benchmarks.

I sometimes find certain mobile benchmarks confusing.

For me, the word benchmark implies an objective measurement of performance for a certain category of devices. But somehow along the way, objective morphed into synthetic and I’ve been lost at sea ever since.

I don’t really enjoy drawing analogies but I feel one is needed in this case. Imagine that your smartphone is like a digital camera (and to a certain extent, it is).

When making a purchasing decision for a new camera, some consumers tend to look only at the megapixel count. However, the value in megapixels alone offers absolutely no indication of final image quality. Instead, a good digital camera is defined by a combination of factors, including the size and type of the main lens, the sensor, the image signal processor (ISP), and the software that runs on the device.

The value of megapixels is to digital cameras what the number of CPU cores are to smartphones.

Unfortunately, the concept of measuring system performance today translates mostly to a series of focused, synthetic tests running for a short period of time. These tests try to push individual parts of the chip to the maximum by running complex yet isolated pieces of code at maximum frequency.

Even though some of these benchmarks generate useful/interesting results (if used correctly), they are often abused for pure publicity reasons which leads to headlines like the one below:

Microsoft beats Apple’s iPad Air 2 to fastest tablet title

Here’s the problem though: the title of fastest tablet was awarded after running only one CPU-focused benchmark where solely multicore performance was considered.

This scenario represents a fraction of typical workloads in mobile; the chart below shows you how an ideal app looks like and why it is important to evaluate performance over a long(er) period of time.

The ideal appThe ideal app runs fast for a short period of time then goes idle

Typically, a phone will remain in what I like to call active standby for prolonged periods of time; active standby means the phone is running some low overhead processes in the background (e.g. sending and receiving emails, pushing notifications etc.). When switched on, most interactions on a smartphone are burst-based (i.e. run relatively fast and go to sleep). In addition, typical workloads are not isolated at all but display heterogeneous features, requiring multiple parts of the chip to be active at the same time.

A high quality game will never rely simply on CPU cores to run. Instead, it will distribute some tasks optimally across a high-performance multicore CPU while a super-GPU tackles all the graphics-intensive processing. Even browsing the web is a CPU+GPU workload; for example, new browsers support multithreading and multi-window rendering techniques that run well on dual- or quad-threaded CPUs and multi-cluster GPUs.

Computing energy - standby, idle, 3D gaming_v2 Computing energy: standby, idle, 3D gaming

I’m going to give you one last example: the mundane act of taking a photograph using a smartphone camera. The diagram below presents how the different elements to capturing that perfect selfie are split across not one, not two but three separate engines inside the mobile chip.

PowerVR imaging framework - beautification appA beautification filter runs on multiple engines inside the SoC

We can see from the use cases presented above that mobile SoCs have gotten a whole lot smarter with each generation. So why is it that system benchmarking hasn’t kept up? Why are we seeing this emphasis of synthetic, CPU-centric code instead of real-world, heterogeneous workloads being integrated under one super-benchmark for mobile devices?

I’m no expert in this field by any means but I envision the emergence a new category of software packages used to test mobile devices that are based on the user experience and various mobile-specific activities rather than pure synthetic code.

By putting the user experience and not specifications at the core of these tools, the workloads can be engineered to simulate several real-world tasks that are common among consumers.

My collection of stress tests would have to run longer than the two to five minutes benchmarks you typically encounter today but not too long that it would be cumbersome to run. I’d probably settle for something in the space of two to three hours. During that time, I would run the tasks below while also logging battery draw for each:

  • Turn on a Wi-Fi and/or a cellular connection and send/receive bursts of data for 45 minutes.
  • Repeatedly run a standard, API-compliant graphics-intensive workload for 30 minutes and observe any performance throttling over time (thermal shutdown is likely to occur here). In addition, check for GPU-related image quality issues.
  • Attempt to run multiple image/video compute tasks on the GPU using OpenCL or RenderScript and measure performance; if certain (or all) kernels are not supported by the GPU driver, I would fall back on the CPU.
  • Attempt to play multi-standard video content (H.264, VP8, H.265, VP9, etc.) at different resolutions for 30-45 minutes and record the list of supported formats and the performance of the video hardware. If a video decoder is not present, I would use the CPU as a fallback mechanism.
  • Launch the default web browser and run a series of common tasks (load a page, open multiple tabs, access media-rich content, etc.)

I am fully aware the list above is not exhaustive and that the methodology might be flawed, but I think the benchmarking industry needs to do more to offer consumers a more realistic representation of performance for mobile devices.

Hopefully someone out there is listening.