Are you dubious that our hand-coded assembler optimisation was worthwhile?
(Be honest!)
We always value feedback on our presentations, but we were more than a little surprised to hear that some readers were doubtful about the potential of the performance increase gained by our hand optimisation of the bit-shifting code in our Candlelit Console kernel patchset.
We were always confident of the benefits, but since people have shown an interest, we decided to take some time out to do some benchmarks, and see just what was going on.
“The bit shifting calculation above is actually much easier in X86 assembler than it is in C.”
Candlelit Console, Exotic Silicon, 2022
Actually, the main point that we intended to make here was that the split register concept which is fairly unique to X86 CPUs is not really taken advantage of by C compilers, but that it can nevertheless be made use of in cases were we are dealing with several 8-bit values packed into a 32-bit word, and when you happen to want to modify the element stored in the lower bits. This happened to be the blue channel in our 32-bit RGB value.
The ‘best’ code that the C compiler produced with a short test program that did bit shifts on random numbers, seemed to be this:
movl %eax, %ecx # Copy the value returned from arc4random_uniform from eax to ecx
andl $16776960, %ecx # ecx & 0xffff00
shrl %eax # eax >> 1
andl $127, %eax # eax & 0x7f
leal (%rax,%rcx), %esi # esi = (rax + rcx)
It's difficult to believe that this could possibly be faster than a single opcode to shift the lower eight bits of the accumulator, even given the complex pipelining and caching that goes on within modern CPUs.
Testing methodology
We'll test the two approaches in two different ways. First, a synthetic benchmark to test the raw speed of the corresponding instructions outside of a wider program. Next, we'll benchmark the bit-shifting within the actual rasops code where we originally put it for testing.
Note that in the final version of the Candlelit Console kernel patch, the bit-shifting was moved from the putchar32() function, where it was called for every single character output, into the device color map initialisation code, which was then called every time our modified rasops_eraserows() function was called.
Whilst in putchar32() the overhead could plausibly be significant, in this new location any overhead would be almost undetectable on any reasonably modern hardware. Any performance gain from using assembler is therefore somewhat moot anyway, but it's the principle that we're discussing here.
Synthetic benchmark take one
Building a purely synthetic benchmark is fairly straightforward. We just need to loop a fixed number of times and shift the lower eight bits right by one bit on each iteration of the loop.
.section ".note", "a"
.p2align 2
.long 0x08
.long 0x04
.long 0x01
.ascii "OpenBSD\0"
.long 0x00
.p2align 2
.section .text
.globl _start
_start:
/* Clear the accumulator */
xor %rax, %rax
/* %rbx is our loop counter */
mov $0xFFFFFFFF, %rbx
/* Now loop doing right shifts on the lower 8 bits of the accumulator: */
_loop:
shrb %al
dec %rbx
cmp $0, %rbx
jnz _loop
/* Exit */
mov $0x01, %rax
syscall
After assembling and linking the above program, we can time how long it takes to run to completion using the shell:
# as -o benchmark1.o benchmark1.S
# ld --Bstatic --no-pie benchmark1.o
# time ./a.out
0m00.95s real 0m00.95s user 0m00.00s system
Hand written assembler: 0m00.95s real
Synthetic benchmark take two
Now let's repeat the benchmark using code that follows what we saw the C compiler emit:
.section ".note", "a"
.p2align 2
.long 0x08
.long 0x04
.long 0x01
.ascii "OpenBSD\0"
.long 0x00
.p2align 2
.section .text
.globl _start
_start:
/* Clear the accumulator */
xor %rax, %rax
/* %rbx is our loop counter */
mov $0xFFFFFFFF, %rbx
/* Now loop doing right shifts on the lower 8 bits of the accumulator: */
_loop:
movl %eax, %ecx
andl $0xffff00, %ecx
shrl %eax
andl $0x7f, %eax
leal (%rax, %rcx), %esi
movl %esi, %eax
dec %rbx
cmp $0, %rbx
jnz _loop
/* Exit */
mov $0x01, %rax
syscall
Do we really expect this to be faster?
# as -o benchmark2.o benchmark2.S
# ld --Bstatic --no-pie benchmark2.o
# time ./a.out
0m03.02s real 0m03.03s user 0m00.00s system
Assembler based on C compiler output: 0m03.02s real
Analysing the results
The difference is quite obvious, the single op-code version of the bit-shifting code is three times as fast on the test machine.
Testing on two more X86 machines, the speed ratio was consistently almost identical at about three to one.
Considering these results, we conclude that the hand assembler optimisation easily wins the synthetic benchmark on all of the test hardware.
Assembler: 1
C compiler: 0
Real world benchmarks
But wait! That's a synthetic benchmark!
OK, synthetic benchmarks are, after all, synthetic.
Maybe real-world performance would be different. Maybe the %al opcode will stall a pipeline, maybe our extract of assembler code in the second example wasn't typical of what the C compiler would produce in real usage, or perhaps the code just doesn't even get called often enough in putchar32() to even matter.
Fair enough! Let's test the theory...
To do this, we'll use /bin/cat to display a very long text file on the terminal. Each character that is drawn and causes a call to putchar32(), will invoke our bit-shifting code in the place where we originally put it for the first test of the color shifting effect in the Candlelit Console patchset.
We'll perform the tests in single user mode, repeat them a total of five times, and discard the timings from the first run because the text file would have been read from disk rather than from the buffer cache.
First, let's test performance without the bit-shifting code. Times listed are in seconds:
Kernel without bitshifting code
Run #
real
user
system
notes
1
5.40
0
5.34
values discarded
2
5.27
0
5.28
3
5.27
0
5.28
4
5.32
0
5.33
5
5.37
0
5.38
Low:
5.27
0
5.28
High:
5.37
0
5.38
Mean average:
5.3075
0
5.3175
Now, with the inline assembler code added to putchar32():
Kernel with asm bitshifting code
Run #
real
user
system
notes
1
5.54
0
5.49
values discarded
2
5.32
0
5.33
3
5.27
0
5.27
4
5.28
0
5.28
5
5.33
0
5.33
Low:
5.27
0
5.27
High:
5.33
0
5.33
Mean average:
5.3000
0
5.3025
Surprisingly, the benchmark runs very slightly faster with the extra opcode in place!
The conclusion that we can probably draw from this result is that the difference is so small that it is within the margin of error that we are measuring on subsequent runs of the code. In other words, it's ‘in the noise’.
Now, we'll test the C version of the code:
Kernel with C bitshifting code
Run #
real
user
system
notes
1
5.40
0
5.33
values discarded
2
5.30
0
5.31
3
5.51
0
5.51
4
5.43
0
5.43
5
5.52
0
5.54
Low:
5.30
0
5.31
High:
5.52
0
5.54
Mean average:
5.4400
0
5.4475
The C implementation genuinely seems to be slower by about 2.5%.
The C code is about 2.5% slower in real world usage than our hand-optimised assembler!
Closing thoughts
Our quick benchmarks seem to show that there really is performance to be gained by using hand-written in-line assembler in this particular case.
It's important to note, though, that moving the code from the putchar32() function to the device color map initialisation function effectively makes this a moot point, as the bit-shifting is then no longer being done on a character by character basis, and so performance of the frequently called putchar32() function should remain unaffected.
In general, micro-optimisations are best avoided if they detract in any significant way from the readability, clarity, or quality of the source code. In the case of the Candlelit Console patchset, a different approach eliminates the need for the in-line assembly, and we get the best of both worlds, (performance and clearer source code). We left the assembly code in the final version of the patch mainly as an example for readers who are learning about kernel programming by reading it.
Nevertheless, there are times when absolute performance or code compactness is required, and in those moments, there is often no substitute for hand written assembler by experienced programmers.