ExoticSilicon.com - Console by candlelight addendum, benchmarking our inline assembler code

We always value feedback on our presentations, but we were more than a little surprised to hear that some readers were doubtful about the potential of the performance increase gained by our hand optimisation of the bit-shifting code in our Candlelit Console kernel patchset.

We were always confident of the benefits, but since people have shown an interest, we decided to take some time out to do some benchmarks, and see just what was going on.

Actually, the main point that we intended to make here was that the split register concept which is fairly unique to X86 CPUs is not really taken advantage of by C compilers, but that it can nevertheless be made use of in cases were we are dealing with several 8-bit values packed into a 32-bit word, and when you happen to want to modify the element stored in the lower bits. This happened to be the blue channel in our 32-bit RGB value.

The ‘best’ code that the C compiler produced with a short test program that did bit shifts on random numbers, seemed to be this:

movl %eax, %ecx # Copy the value returned from arc4random_uniform from eax to ecx

andl $16776960, %ecx # ecx & 0xffff00

shrl %eax # eax >> 1

andl $127, %eax # eax & 0x7f

leal (%rax,%rcx), %esi # esi = (rax + rcx)

It's difficult to believe that this could possibly be faster than a single opcode to shift the lower eight bits of the accumulator, even given the complex pipelining and caching that goes on within modern CPUs.

We'll test the two approaches in two different ways. First, a synthetic benchmark to test the raw speed of the corresponding instructions outside of a wider program. Next, we'll benchmark the bit-shifting within the actual rasops code where we originally put it for testing.

Note that in the final version of the Candlelit Console kernel patch, the bit-shifting was moved from the putchar32() function, where it was called for every single character output, into the device color map initialisation code, which was then called every time our modified rasops_eraserows() function was called.

Whilst in putchar32() the overhead could plausibly be significant, in this new location any overhead would be almost undetectable on any reasonably modern hardware. Any performance gain from using assembler is therefore somewhat moot anyway, but it's the principle that we're discussing here.

Building a purely synthetic benchmark is fairly straightforward. We just need to loop a fixed number of times and shift the lower eight bits right by one bit on each iteration of the loop.

.section ".note", "a"

.p2align 2

.long 0x08

.long 0x04

.long 0x01

.ascii "OpenBSD\0"

.long 0x00

.p2align 2

.section .text

.globl _start

_start:

/* Clear the accumulator */

xor %rax, %rax

/* %rbx is our loop counter */

mov $0xFFFFFFFF, %rbx

/* Now loop doing right shifts on the lower 8 bits of the accumulator: */

_loop:

shrb %al

dec %rbx

cmp $0, %rbx

jnz _loop

/* Exit */

mov $0x01, %rax

syscall

After assembling and linking the above program, we can time how long it takes to run to completion using the shell:

# as -o benchmark1.o benchmark1.S

# ld --Bstatic --no-pie benchmark1.o

# time ./a.out

0m00.95s real 0m00.95s user 0m00.00s system

Now let's repeat the benchmark using code that follows what we saw the C compiler emit:

.section ".note", "a"

.p2align 2

.long 0x08

.long 0x04

.long 0x01

.ascii "OpenBSD\0"

.long 0x00

.p2align 2

.section .text

.globl _start

_start:

/* Clear the accumulator */

xor %rax, %rax

/* %rbx is our loop counter */

mov $0xFFFFFFFF, %rbx

/* Now loop doing right shifts on the lower 8 bits of the accumulator: */

_loop:

movl %eax, %ecx

andl $0xffff00, %ecx

shrl %eax

andl $0x7f, %eax

leal (%rax, %rcx), %esi

movl %esi, %eax

dec %rbx

cmp $0, %rbx

jnz _loop

/* Exit */

mov $0x01, %rax

syscall

Do we really expect this to be faster?

# as -o benchmark2.o benchmark2.S

# ld --Bstatic --no-pie benchmark2.o

# time ./a.out

0m03.02s real 0m03.03s user 0m00.00s system

The difference is quite obvious, the single op-code version of the bit-shifting code is three times as fast on the test machine.

Testing on two more X86 machines, the speed ratio was consistently almost identical at about three to one.

OK, synthetic benchmarks are, after all, synthetic.

Maybe real-world performance would be different. Maybe the %al opcode will stall a pipeline, maybe our extract of assembler code in the second example wasn't typical of what the C compiler would produce in real usage, or perhaps the code just doesn't even get called often enough in putchar32() to even matter.

To do this, we'll use /bin/cat to display a very long text file on the terminal. Each character that is drawn and causes a call to putchar32(), will invoke our bit-shifting code in the place where we originally put it for the first test of the color shifting effect in the Candlelit Console patchset.

We'll perform the tests in single user mode, repeat them a total of five times, and discard the timings from the first run because the text file would have been read from disk rather than from the buffer cache.

First, let's test performance without the bit-shifting code. Times listed are in seconds:

Kernel without bitshifting code

Run #	real	user	system	notes
1	5.40	0	5.34	values discarded
2	5.27	0	5.28
3	5.27	0	5.28
4	5.32	0	5.33
5	5.37	0	5.38
Low:	5.27	0	5.28
High:	5.37	0	5.38
Mean average:	5.3075	0	5.3175

Now, with the inline assembler code added to putchar32():

Kernel with asm bitshifting code

Run #	real	user	system	notes
1	5.54	0	5.49	values discarded
2	5.32	0	5.33
3	5.27	0	5.27
4	5.28	0	5.28
5	5.33	0	5.33
Low:	5.27	0	5.27
High:	5.33	0	5.33
Mean average:	5.3000	0	5.3025

Surprisingly, the benchmark runs very slightly faster with the extra opcode in place!

The conclusion that we can probably draw from this result is that the difference is so small that it is within the margin of error that we are measuring on subsequent runs of the code. In other words, it's ‘in the noise’.

Now, we'll test the C version of the code:

Kernel with C bitshifting code

Run #	real	user	system	notes
1	5.40	0	5.33	values discarded
2	5.30	0	5.31
3	5.51	0	5.51
4	5.43	0	5.43
5	5.52	0	5.54
Low:	5.30	0	5.31
High:	5.52	0	5.54
Mean average:	5.4400	0	5.4475

Our quick benchmarks seem to show that there really is performance to be gained by using hand-written in-line assembler in this particular case.

It's important to note, though, that moving the code from the putchar32() function to the device color map initialisation function effectively makes this a moot point, as the bit-shifting is then no longer being done on a character by character basis, and so performance of the frequently called putchar32() function should remain unaffected.

In general, micro-optimisations are best avoided if they detract in any significant way from the readability, clarity, or quality of the source code. In the case of the Candlelit Console patchset, a different approach eliminates the need for the in-line assembly, and we get the best of both worlds, (performance and clearer source code). We left the assembly code in the final version of the patch mainly as an example for readers who are learning about kernel programming by reading it.

Nevertheless, there are times when absolute performance or code compactness is required, and in those moments, there is often no substitute for hand written assembler by experienced programmers.

You know where to find us!