NOTICE: This content was originally posted to Google+, then imported here. Some formatting may be lost, links may be dead, and images may be missing.
I've been doing a lot of code optimization the past week or so, and it has made me run into some interesting issues. One of them has been that faster code could actually lead to lower performance, and slower devices running faster code faster than faster devices.
In a few comparison tests, this actually made the Galaxy S3 (Exynos) come out quite a bit faster than the Galaxy S4 (Snapdragon).
Code background
This specific piece of code has to do with DSLR Controller, which (amongst other things) shows your camera's view on your Android device. If you're not familiar with any of that, just imagine it displays the image from a webcam on screen.
To get this image on screen (simplified), the compressed frame needs to be transferred from the camera via USB or Wi-Fi (I/O), the compressed frame needs to be decompressed (CPU or GPU), a number of modifications/filters may be applied to the frame (CPU), and finally the frame needs to be scaled and displayed (GPU).
At any given time, multiple frames may be in processing. Some CPU-based operations can be parallelized and run multi-threaded - saturating even a quad-core - while other operations are serial in nature and thus run single-threaded. An operation may also be stalled because we're waiting for I/O, or for the GPU.
As CPU speed, GPU speed, and I/O transfer speed vary wildly between device combinations, it's impossible to exactly predict the pipeline state at any given time. So we have to make some smart choices (for example: there should never be more than one frame being processed at a time by the multi-threaded part) and separately optimize every part of the pipeline.
In this specific case, it is not merely sufficient to simply process X frames in as little time as possible, we need to also make sure the latency is as low as possible, something very important to those who use DSLR Controller for video - sometimes we have to trade off a little overall speed for lower latency.
Ultimately what we are after is that if the camera and I/O speed are fast enough to provide us with image data at 30 frames per second, that we also actually manage to put that on screen. This means we have at most 33 milliseconds to process the frame (including decompressing and putting it on screen). As such, every millisecond we can shave off somewhere in the pipeline is worth the effort.
CPU speed scaling
To save battery, your Android device will scale CPU speed based on the processing power currently consumed. There are many variables to this scaling and it can be set up in various ways, but generally speaking, if the software running on your device isn't using the CPU to it's full capacity, it won't be running at full speed.
As not all parts of the rendering pipeline are fully multi-threaded and able to saturate a quad-core, and we may be waiting for I/O or the GPU at any given time, typically a single core will be used at 75-100% load, while the other cores may have a much lower load.
Those cores will be 0% usage for example 45% of the time, and at 100% usage for 55% of the time. This gives a load average of 55%. The latest CPUs are able to scale the speed of each core individually, and thus may scale down the speed of 3 of the 4 cores because they aren't being fully utilized.
What happens
So imagine we're optimizing the part of the code that runs on all the cores. If it becomes slightly more efficient, it may actually trigger a downscale of CPU speed, which in turn causes that code to take longer to run. From the device's perspective this is perfectly fine, as it can still run all the scheduled code with less than 100% load.
However, as the "main" core is waiting for the multi-threaded piece of the pipeline to complete, this does actually degrade performance of the app as a whole.
While this is certainly not generally true, in certain circumstances this can cause for example the S3 to run code faster than the S4, as the S3 will be closer to it's load limits, has a less aggressive default governor configuration, and will pretty much keep all cores running at full speed, while the S4's per-core speed will be jumping all over the place without ever settling. Aside from slightly lower FPS and higher latency, it also causes the FPS to less stable.
Again, this can happen in certain circumstances. Generally speaking, the S4 will outperform the S3 running DSLR Controller. I just found it interesting to see this happen :)
Solutions
Unfortunately, the Android API does not provide us with any way to tell the system to keep the CPU running at full speed - and I don't think it would be a great idea if this API would be available. For by far most situations it simply isn't needed and the governor is pretty good in deciding the optimal speed for the CPU to be running at, without any noticable impact for the end-user (and Android simply isn't a RTOS, it's not fair to expect it to be)
Still, it leaves me with a problem where I may not be able to get the highest performance out of the device, while in this corner case this is actually what the end-user really wants.
A possible solution would be to detect if the user has root, and if it does (with the user's permission) lock the cores at full speed. This is actually possible and not all that hard. But it doesn't help non-root users.
A non-root solution would be to have the low-usage cores do random calculus while they aren't being used to process frames - this will keep the core's usage high and prevent downscaling.
Both are very ugly solutions I'd rather not implement, but I wouldn't be surprised if the option becomes available in DSLR Controller in the near future.
[removed]
[removed]
Interesting interactions between different elements to give a final performance value... All pretty non deterministic, of course, but it is interesting nevertheless... I tend to use the performance governor because as it ends sooner, the sooner the cpu goes to low freq/deep sleep, which is good for battery as well... with the performance governor this behaviour should be minimized, as cpu 0 and the rest behave identically... (oc, rooted)
Thanks chainfire it explains allot, and actually makes allot of sense, but I'm sure glad you're the one trying to figure it all out! Keep up the good work buddy, don't know what we would do without you. AAAAAAA++++++++
Great write up!
[removed]
Great info, thank you for sharing your knowledge :)
Thanks chainfire for sharing this, it's always interesting to read your findings
[removed]
It never cease to amaze me! 99% of it all I doesn't even understand properly! Very interesting though.
Again my saying proves true: Nothing is ever as easy as it looks....
Great read! I really appreciate getting a glimpse at how android code and hardware operate hand in hand. Love the app btw
Truly interesting topic!!! Congrats !
Wow!! Good job!! Thanks Chainfire
Shouldn't putting the other cores into a busy-wait loop be sufficient to keep them at full power? Or is the power management sophisticated enough to somehow detect those?
Wow. Would that be worth dividing code to run on multi-core CPU at the first place (or should I say, is it possible)?
Good job chainfire. Thank you.
Some notebooks/laptops throttle down too, this can happen there too?
+Eduardo Ribeiro probably, but it's still unlikely to happen. I probably just hit a "sweet spot" :)
Good program
+Eduardo Ribeiro friends had this problem with Asus mother board while gaming because of this kind of dynamic under clocking
+Kevin C. That's pretty much what Chainfire mentioned near the end about artificially keeping the cores busy.
Probably one of the very few cases where Samsung's poor multicore power management approach is actually beneficial. (All cores are locked at the same frequency and voltage, and also, all cpuidle states except clock gating are disabled when more than one core is online. But at least in Chainfire's use case, those cores do clockgate when idle - a busyloop would prevent clockgating.)
+Andrew Dodd No, he mentioned "random calculus", which surprised me because I thought a busyloop would be sufficient, unless they were doing surprisingly detailed measurements of how busy the CPU is.
+Kevin C. that was just a figure of speech really. People have different interpretations on what a busy loop is, so let me just mention two possibilities:
(1) loop with short sleeps - I expect this will still clock down the cores to low speed unless you find the sweet spot experimentally (which will differ per device)
(2) loop that never sleeps - this will (probably) keep the core at full speed, but it will also drain battery like crazy. This pretty much what I was indicating with random calculus. It's just dirty :)
+Sergiy Shulik not all code is possible to parallellize - and even if so, we likely still wouldn't be using the CPU at 100%, because we need to wait for I/O
Could you explain how other OSes (iOS for eg) manage to cope with this problem?
+Mayur Dhaka Windows cannot manage for this problem on motherboard that support dynamic frequency changes and you end up underclocked by 50% when playing games not doing anything with your CPU but killing your GPU