“Obvious oversights. Textbook mistakes. Surely not in the OpenBSD console code?”
Finding and fixing bugs in the wscons UTF-8 implementation
Debugging
Discovering one bug after another in the UTF-8 decoding logic in OpenBSD, then going on to fix other aspects of related code.
Session
In this article, we'll take a look at the [sorry] state of affairs regarding UTF-8 support on the OpenBSD kernel, at least as of OpenBSD 7.2-release. It's not a pretty picture, but hopefully we can improve things.
* In fact, we already have, because a bug report to the project's tech mailing list duly prompted a fix for some code which has been broken since it was introduced nine years ago. Ouch.
Still, the debugging process we went through here to discover the cause of the problems in the first place is worth sharing from the beginning, as the code in question was particularly bad with plenty of textbook mistakes. Who knows what you might find in your own investigations elsewhere.
A quick primer and the opening of the can of worms
Wouldn't it be useful to display all those non-latin scripts, mathematical symbols, and other interesting characters that live above codepoint 0xFF on the console?
$ echo "ÿĀ←↝"
ÿĀ←↝
It seems all too easy to be true
Internally, the ‘characters’ passed around the wscons and rasops code in OpenBSD are unsigned 32-bit integers, so we might assume that somebody has been here before us and done all of the work already. Certainly, if we load a font with more than 256 glyphs, and store a value of 256 or above in a character cell, we might expect the correct glyph to be displayed.
Well, this much is true. It is possible to load a font with more than 256 glyphs, and if you could poke the right values in to the right bytes of memory to call the rasops putchar function with 0x169, then ‘ũ’, also known as, ‘Latin small letter U with tilde’, would dutifully appear.
Unfortunately, this second requirement of sending the right values to the display code is somewhat difficult as of OpenBSD 7.2-release, due to a bug in the UTF-8 decoding logic.
But first, let's take a step backwards and make sure that we're all on the same, (code), page here, (bad pun intended).
What is UTF-8, and why should I care?
So as we've just said, internally the character values are unsigned 32-bit integers. That's fine as an internal representation, but throwing around 32 bits to represent each character when the majority of English text only uses the first 128 character positions and could be represented with just 7 bits is obviously not ideal. Add to that the practical need to be backwards compatible with regular ASCII encoding, and at least not too incompatible with existing 8-bit encodings that mirror ASCII for the low 7-bit characters and put a variety of different characters in the top 128 slots.
This was the situation faced in the early 1990s, and amazingly, for once, common sense actually prevailed in the IT industry. A sensible, logical, and efficient encoding was developed which gained broad acceptance in the years that followed. This was UTF-8, and in 2023 it's a reasonable bet that any new software that wants maximum compatibility in interchange applications where ASCII or 8-bit national encodings were previously used, is either going to talk UTF-8 or at least not choke on it.
Whilst UTF-8 is a more complex encoding to handle than plain ASCII, compared with the somewhat bizarre and rather badly thought out from a technical point of view encoding known as UTF-1, implementing it is a breeze.
In resume, the idea is that programs can and should use any convenient internal representation of individual characters, or character strings that need some kind of processing done on them. But for data interchange, including sending output to a terminal emulator, the UTF-8 encoding system is usually the best choice. Terminal emulators that previously supported ASCII and various ‘national’ 8-bit character sets with limited sets of accented characters should be updated to correctly interpret UTF-8 data streams.
Other capable encodings certainly exist, such as UTF-16. But for general purpose applications, UTF-8 is now clearly the way to go.
Can we use UTF-8 on OpenBSD?
Certainly. At least that's the intention.
Within the X Window System, things work quite well. Xterm supports UTF-8, and so it's possible to just cat a UTF-8 encoded text file to the console and see the correct output. Assuming that your font of choice actually includes all of the glyphs, you'll see accented letters, mathematical symbols, line drawing characters, and more. Some of the lesser used and more obscure unicode features might be missing support, but the actual decoding from UTF-8 into the correct unicode codepoints works just fine.
But not everybody uses X, or at least not all of the time.
OpenBSD added support for using such unicode characters on the console many releases ago. UTF-8 isn't the default configuration, though.
A look at the current default of ISO 8859-1
By default the console uses ISO 8859-1, (except that the C1 control codes are not implemented, so we can't use, for example, 141 reverse linefeed or 148 destructive backspace).
This gives us a potential 192 printable characters, if we reserve 128-160 as controls. Actually 191 if we want to reserve 127 for delete as well. That's already beginning to sound like a limitation.
It's also possible to switch to one of the ‘national’ ASCII variants, (National Replacement Characterset or NRC), and replace some of the characters in the 32-127 range with some others that are already available in the 160-255 range.
So we can, for example, send the escape sequence ␛(A to the terminal to switch to the British NRC:
$ echo '#^[(A#'
#£
We could do this 25 years ago
This replaces the hash sign at position 35 with the pounds sterling currency symbol, which was available in ISO 8859-1 at position 163 anyway. Useful for legacy software that pre-dates both ISO 8859-1 and UTF-8, or for anybody running in a 7-bit environment, (which in 2023, and especially on the OpenBSD console, is going to be exactly nobody).
If this all seems about as much fun as flipping dip switches on a dot-matrix printer back in 1986, you would be right. Obviously the world has moved on, and so should we. After all, although NRCs as they are currently implemented in OpenBSD would actually allow us to reach beyond position 0xFF and map a character with a higher codepoint in to the 8-bit range, this is hardly a practical way of using a font with thousands of glyphs, (assuming that we had such a font to begin with - all but one of the fonts in the wsfonts directory lack even the glyphs for the box-drawing characters).
If only we could output UTF-8 sequences to the console, and have it interpret those bytes as codepoints above 0xFF. Then we could use such fonts directly, and in a way that software is likely to already support.
TOP TIP
The NRC translation tables are set in wsemul_vt100_chars.c, so if you wanted to modify one to create a custom NRC for your own personal use then that would be where to put your code.
Adding a brand new NRC is also possible, for that you'd need to decide on a suitable escape sequence, add code to recognise it, and call vt100_setnrc with the correct parameters.
Testing the current UTF-8 support
Since the console works in ISO 8859-1 by default, there are escape sequences to switch it in to and out of UTF-8 mode.
The escape sequence to enter UTF-8 mode is ␛%G, and to explicitly leave it we can send ␛%@, although various other escape sequences also have the effect of leaving UTF-8 mode.
This can be demonstrated from the shell. By default we are in ISO 8859-1, so to display ÿ, we can just write an 0xFF byte, (which is 0377 in octal):
$ echo "\0377"
ÿ
Oh, come on! You love octal!
$ echo "^[%G"
$ echo "\0377"
$
Hummm, extra linefeeds anybody?
Now, if we enter UTF-8 mode, as expected this no longer works.
The 0xFF byte is not a valid character in UTF-8, and should be ignored, (in the context of a terminal emulator - other applications may treat it as a fatal error condition). The correct way to display ÿ at codepoint 0xFF is with the two byte UTF-8 sequence 0xC3 0xBF.
The UTF-8 sequence mostly works as expected:
$ echo "\0303\0277"
ÿ
$
Success!
However, look again at the previous example where we sent 0xFF. Notice the two blank lines after the second echo command before the shell prompt re-appears.
There was also an extra newline in the output when we sent the correct UTF-8 sequence.
This is wrong. It might seem harmless, but it's a tell-tale sign that the kernel code is buggy. We'll see the bug that causes it shortly.
In the mean time, further compare the output of the following commands
$ echo "foo\0377bar"
foobar
$
Spurious linefeed
$ echo "foobar"
foobar
$
No spurious linefeed
Now something definitely seems wrong. The output should be the same whether the 0xFF byte is there or not, but it's different.
What's more, it's not just a case of an incorrect glyph being printed - the spurious newline isn't even at the point where the rogue 0xFF has been placed in the datastream.
Here we are seeing a classic case of non-conformant input affecting the behaviour of a program processing the data in an unexpected way.
This is usually a bad thing, and seeing it should cause alarm bells to ring in a competent programmer's mind.
Of course, we're digressing somewhat here and skipping ahead in our investigation. Let's get back to testing the current UTF-8 implementation.
The UTF-8 sequence for the first codepoint beyond the 8-bit range, (I.E. codepoint 256, 0x100, or 0400 in octal), is 0xC4, 0x80.
Since we don't yet have a font with a glyph in this position, we would expect to see a question mark displayed instead. Why? Because sys/dev/rasops/rasops.c includes code to substitute ‘?’ for a character that is not present in the currently selected font. This code is in the function rasops_mapchar.
However, testing shows that this doesn't happen:
$ echo "\0304\0200"
$ echo "foo\0304\0200bar"
foobar
$
No 0x100 for us, it seems
Sending the sequence for codepoint 0x100 does not produce the expected result, and the issue with an extra linefeed is also still there.
And so begins a long and tedious journey in to the kernel code to find out what exactly is going on.
Where to start looking
By the time the codepoint reaches the rasops putchar routines as the fourth parameter ‘uc’, it's basically just a linear offset in to the font data. There really isn't much to go wrong at that point.
Before then, wsemul_vt100_output_normal calls the rasops mapchar function. As we discussed above, if the character is not present in the font it's remapped to a question mark.
Since we're not seeing any output at all, with no glyph painted and not even any cursor movement over the space it should take up, the problem obviously lies in the wscons code and the parsing of the UTF-8 data, rather than in the graphically-orientated rasops code.
This is trivially testable, by inserting one line in sys/dev/wscons/wsemul_vt100.c, right at the beginning of wsemul_vt100_output_normal:
if (instate->inchar=='`') { instate->inchar=0xFFFF; }
Convert all backticks to codepoint 0xFFFF
After re-compiling the kernel and re-booting in to the new kernel, sending a backtick to the console, (either with an echo command from the shell or just typing a backtick at the shell prompt), we get ÿ.
Surprised? Don't be. Remember that since we rebooted, by default the console is not operating in UTF-8 mode but in ISO 8859-1 instead. In this mode, the value 0xFFFF is just being logically and'ed with 0xFF, so we end up losing the high-order byte and seeing ÿ.
However, if we switch to UTF-8 mode with ␛%G, then displaying a backtick does indeed send the value 0xFFFF to the re-mapping function, which faithfully re-maps it to a ‘?’ character.
Now that we've seen wsemul_vt100_output_normal happily hand a codepoint above 0xFF to the rasops code, we need to find out why it's not being called with this character in the first place.
Buggy multibyte decoding
In our initial preliminary investigations, we saw that the wscons output code can indeed cope with codepoints above 0xFF, as can the rasops putchar routines.
The challenge now is to find out why those codepoints are not being decoded from the UTF-8 datastream
How UTF-8 works in 60 seconds
Before we look at the UTF-8 decoding logic in the OpenBSD kernel, we need to understand how UTF-8 encoding actually works. Thankfully, this is quite simple.
UTF-8 is a variable length encoding. A single codepoint can be encoded using one, two, three, or four bytes.
The single byte case is obvious - codepoints between 0 and 127 are coded as a single byte with that value, which will obviously have bit 7 reset.
On the other hand, bytes in a UTF-8 datastream that have bit 7 set, if they are valid at all, will always be something to do with a multi-byte sequence. Furthermore, a multi-byte sequence should only contain bytes with bit 7 set, (and therefore nothing in the 0x00 - 0x7F range).
The first byte of a multi-byte sequence indicates the total number of bytes in the sequence by the number of high order bits that are set to 1, before the first 0 bit is encountered. Any remaining bits after that first 0 bit, encode the highest bits of the codepoint being encoded.
Second and subsequent bytes, (known as continuation bytes), of a multi-byte sequence always have bit 7 as 1, and bit 6 as 0, (a combination that would make no sense as the first byte of a multi-byte sequence), with the lowest 6 bits encoding the next lowest order bits of the codepoint.
Encountering a continuation byte without seeing a byte that has either bits 7 and 6 set, with bit 5 reset, or bits 7, 6, and 5 set, with bit 4 reset, or bits 7, 6, 5, and 4 set, with bit 3 reset, means that we are decoding an invalid UTF-8 stream.
Important!
When decoding the first byte of a multi-byte character in a UTF-8 datastream, the zero bit following the initial 1 bits is important, and must be checked for.
Simply checking how many of the high order bits are set, without checking that they are followed by a 0, is not sufficient and can lead to decoding errors.
This is one of the bugs that we will find in the OpenBSD kernel code.
The UTF-8 decoder in the OpenBSD kernel
The code that implements the multi-byte UTF-8 decoding logic lives in wsemul_subr.c, right at the beginning in the wsemul_getchar function.
When UTF-8 support is disabled or not present in the kernel, this function is quite trivial because there is a one-to-one relationship between bytes input and characters output. As a result, no special decoding logic is required, and data is simply drained from a buffer.
This is not the case when processing UTF-8 encoded data, as UTF-8 is a variable length encoding. A single codepoint can be encoded using one, two, three, or four bytes.
It's quite easy to see how the code is intended to work. The while loop processes characters from the buffer until it either runs out, or we have a valid character. This could be a single byte character, or a successfully re-constructed multi-byte character.
If a character is successfully returned in state->inchar, then the function itself is supposed to return 0. Otherwise it returns EILSEQ.
Except that there are several bugs here, all of which prevent it from working as intended.
The first bug is fairly obvious. Looking near the end of the function, we see some checks for bits 5, 4, 3, and 2 being set, and if they are then mbleft is increased. Clearly mbleft is a counter for the number of remaining bytes to process in this multi-byte sequence.
By pure co-incidence, this works for the values encoding codepoints 0x80 - 0xFF, because those codepoints always have bit 8 unset, (because they are 8 bit values which only use bits 0 - 7). Bit 8 of the codepoint would be stored in bit 2 of the first byte, as can be seen in the following table:
Unicode codepoint to encode
First UTF-8 byte
Second UTF-8 byte
0xFF
0xC3 11000011
0xBF 10111111
0x100
0xC4 11000100
0x80 10000000
Unfortunately, as soon as we try to process a two-byte sequence encoding a codepoint of 0x100 or above, the check for bit 2 being set returns true, and mbleft is increased by one. The effect of this will be that instead of returning the re-constructed value after the second byte, we'll loop again and try to apply the next byte in the buffer to the value of the codepoint that we are building.
We might at this point naïvely expect the following character to be swallowed, but whether due to design or accident, when a continuation byte which should have it's highest bit set doesn't in fact have it set, the code doesn't actually return EILSEQ, but instead just assigns this value to rc and then falls through to the following if statement that checks whether it's a single-byte character or a new multi-byte one.
So effectively, when we tried sending the 0xC4, 0x80, sequence to print character 256, the kernel tried to interpret it as a three byte sequence instead of a two-byte sequence. The trailing newline caused the re-assembly of the multi-byte character to fail, but we fell through to code which then correctly processed the newline character.
Even more interesting things happen if there is no following character. If we output just the 0xC4, 0x80 sequence without a trailing newline using the -n option to echo, then the console will hang. The shell prompt will not re-appear until we press a key, at which point the entire prompt will be printed.
$ echo -n "^[%G\0304\0200"
A regular user can hang the OpenBSD console from the shell.
“Press any key to continue.”
Readers who are familiar with UTF-8 encoding will already have noted that we shouldn't be checking for bits 3 and 2 being set anyway, as these would encode five and six byte multi-byte characters. Such encoding was valid in the original UTF-8 proposal, but has long since been deprecated.
In any case, we should be checking bit 4 if and only if all of bits 7, 6 and 5 are set.
If we were to check bits 3 and 2 to implement the deprecated five and six byte sequences, then we should be checking them only when the bit to the immediate left is also set.
Assuming that we don't want to include support for the deprecated five and six byte sequences, then what we really want to do is to move the test for bit 5 in to the conditional if for bit 6, and add a check that bit 4 is zero, throwing EILSEQ if not.
One bug down, many more to go
At this point, more things start to work. Characters in the range 256 - 2047 can now be displayed, although there is still another issue to fix which will cause erratic behaviour for the time being.
Nevertheless, if we repeat our earlier shell command to display character 0x100, it works:
$ echo "foo\0304\0200bar"
foo?bar
Progress is being made!
Since we can now send these higher codepoints to the rasops code, it's finally worthwhile actually installing a font with glyphs at codepoints beyond 0xFF.
Extra glyphs
Although there are plenty of fonts available to download from the internet, as well as programs to convert them to the right format to use on the console, for testing purposes I designed some new characters:
/* 0x100, 256, 100000000 - Centre dot */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x06, 0x00, /* .....**..... */
0x06, 0x00, /* .....**..... */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
/* 0x101, 257 - generic glyph */
0xee, 0xe0, /* ***.***.***. */
0xaa, 0xa0, /* *.*.*.*.*.*. */
0xaa, 0xa0, /* *.*.*.*.*.*. */
0xaa, 0xa0, /* *.*.*.*.*.*. */
0xee, 0xe0, /* ***.***.***. */
0x00, 0x00, /* ............ */
0x2e, 0x20, /* ..*.***...*. */
0x2a, 0x20, /* ..*.*.*...*. */
0x2a, 0x20, /* ..*.*.*...*. */
0x2a, 0x20, /* ..*.*.*...*. */
0x2e, 0x20, /* ..*.***...*. */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
/* 0x102, 258 - generic glyph */
0xee, 0xe0, /* ***.***.***. */
0xaa, 0xa0, /* *.*.*.*.*.*. */
0xaa, 0xa0, /* *.*.*.*.*.*. */
0xaa, 0xa0, /* *.*.*.*.*.*. */
0xee, 0xe0, /* ***.***.***. */
0x00, 0x00, /* ............ */
0x2e, 0xe0, /* ..*.***.***. */
0x2a, 0x20, /* ..*.*.*...*. */
0x2a, 0xe0, /* ..*.*.*.***. */
0x2a, 0x80, /* ..*.*.*.*... */
0x2e, 0xe0, /* ..*.***.***. */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
/* 0x103, 259, 100000011 - Wiggle */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x01, 0x80, /* .......**... */
0x03, 0x00, /* ......**.... */
0x06, 0x00, /* .....**..... */
0x03, 0x00, /* ......**.... */
0x01, 0x80, /* .......**... */
0x03, 0x00, /* ......**.... */
0x06, 0x00, /* .....**..... */
0x03, 0x00, /* ......**.... */
0x01, 0x80, /* .......**... */
0x03, 0x00, /* ......**.... */
0x06, 0x00, /* .....**..... */
0x0c, 0x00, /* ....**...... */
0x18, 0x00, /* ...**..,.... */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
0x00, 0x00, /* ............ */
These new glyphs can be added to the gallant12x22 font by simply appending the data at the end of the existing declaration for gallant12x22_data[], and changing the value of .numchars from 256 - ' ', to 260 - ' '.
After re-compiling the kernel, and rebooting in to the new kernel, now we are finally able to display some new characters that natively live beyond the 8-bit boundary!
Hold the celebrations
Unfortunately, our success will prove to be short-lived. Whilst we now have two-byte characters being decoded correctly, three and four byte characters are not.
Now that we've seen how to encode UTF-8 sequences by hand and send them to the console using octal codes in the shell, we'll automate the process slightly using a Perl script.
#!/usr/bin/perl
use Encode;
print (chr(27).'%G');
for ($n=2045; $n<2055; $n++) {
print (">".encode('UTF-8', chr($n))."<\n");
}
print (chr(27).'%@');
print ("\nDone\n");
Running this Perl script on a system running OpenBSD 7.2-release, with the changes to wsemul_subr.c described above to fix the interpretation of the bits in the byte that introduces a multi-byte character, results in the following output:
$ ./test_script.pl
>[0007fd]<
>[0007fe]<
>[0007ff]<
><
><
><
><
><
><
><
Done
<
>à <
Done
$
Output seen from the above Perl script
Screenshot
Here [0007fd] represents the glyph at codepoint 0x7FD, or 2045.
Note that 0x800, or 2048 is the first codepoint which requires three bytes to encode in UTF-8.
I actually created a font with an extra 129791 characters, filling the codepoints up to 0x1FBFE to test rendering of high codepoints. The screenshot shows the results of running the above perl script in all of their bitmapped glory.
Obviously we still have some more debugging to do.
Revisiting the UTF-8 decoder
The problem with the code this time is perhaps less obvious. Presumably the function must either be returning EILSEQ, or possibly a control character, because we don't get a glyph output at all when processing these three and four byte characters.
Luckily, reading through the code and substituting in to the variables appropriate values that we might expect to have whilst processing the second byte, (I.E. the first continuation byte), quickly shows up the bug.
Considering codepoint 2048 in UTF-8, we would have the three-byte sequence 0xE0, 0xA0, 0x80, which in binary is 11100000, 10100000, 10000000:
First byte
Second byte
Third byte
0xE0
0xA0
0x80
11100000
10100000
10000000
The first byte is processed just fine by our modified code, and will set mbleft to 2.
The second byte has exactly bit 7 set out of bits 6 and 7, so we end up running the ‘else’ clause of the first nested if statement. Here the bits are added to tmpchar, where the ultimate value of the decoded codepoint is being accumulated, and mbleft is decreased.
So far, so good. But now we get to the buggy code.
If mbleft has reached zero, which at this point when processing two-byte sequences it would have done, we set rc to zero to indicate success, and promptly break out of the while loop to exit the whole function, (after writing some locally changed values back to the supplied pointers).
Unfortunately, if mbleft hasn't reached zero, we just fall off of the end of the ‘else’ clause, and continue processing. The same byte is then parsed by the next block of code, the block following the comment about deciding whether it's a multibyte sequence or not.
Oops! That's not good. Clearly we should have looped back to the beginning of the while loop to grab another byte from the input.
So what exactly happens if we continue following the program flow through this erroneous code path?
This is an important question to ask when finding a bug in existing code that has been present for a long time on production systems. Rather than just fixing the problem, we need to establish whether the bug is exploitable on machines out ‘in the wild’ in a harmful way or not.
Well in this case, the continuation byte, which has bit 7 set, obviously doesn't match with the next if statement, so it's not treated as a single byte character. The next test checks whether bit 6 is reset, which, of course, it should never be for the start of a muti-byte sequence. Since in this case it is, mbleft and tmpchar get set to zero, and rc is set to EILSEQ to indicate the error condition.
However, we don't exit the function at this point! Instead, we loop around again and grab the next byte, the third and last byte of the sequence. At this point, since mbleft is now zero, we don't even touch the ‘else’ clause of the first nested ‘if’, but instead end up directly back at the test for a 7-bit character or the start of a multi-byte one. The single character test fails, as we have bit 7 set, and we end up running the ‘abort the sequence’ code again, (even though we were not in a sequence anymore, at least according to mbleft being zero), due to bit 6 being reset.
Finally, we end up at the start of the while loop again, where we process the next character in the buffer, (if there is one), normally.
The fix this time is to add a check for still having bytes remaining in a multi-byte sequence, and jumping back to the beginning of the while loop using a continue statement if we do. This avoids running the same byte through the code below and ultimately aborting the multi-byte character that we were constructing:
if (mbleft > 0) { continue; }
The fix is so simple, and yet... elusive!
Learning to count
Next bug - text in the wrong places
With the obvious bugs in the UTF-8 decoder fixed, we might naïvely assume that everything would ‘just work’.
This is, of course, ignoring the fact that the UTF-8 decoder is still far from perfect. It still allows overly-long encodings, and doesn't reject the UTF-16 surrogate codepoints, but the point is that well behaved software should display the correct text in the correct place.
Unfortunately, it doesn't.
The following perl program demonstrates the problem by outputting the UTF-8 sequences for characters with codepoints between 192 and 287:
#!/usr/bin/perl
use Encode;
print (chr(27)."%G");
for ($n=192; $n<288; $n++) {
print (encode('UTF-8',chr($n)));
}
print (chr(27)."%@");
print ("\nDone\n");
Outputting 96 characters - what could go wrong?
Evidently, quite a lot!
As we can clearly see from the screenshot, there is extraneous output after the expected characterset.
Buffer management issues
To the trained eye, this is clearly an issue of confusing bytes with characters. Somewhere in the code, our ‘current position’ in a buffer is not being updated correctly. Instead of moving forward by two, three, or four bytes, we're moving forward by one.
Obviously there is more going on than just this, though, because if we were just moving through a buffer and skipping over or failing to skip over the correct number of characters, we'd see missing or repeated output along the way. That is not happening, instead we appear to be outputting the data correctly but somehow repeating the last bit of it.
Seeing this kind of program behaviour suggests that there are, in fact, two, (or more), pointers to the same data that should be being kept in sync but due to a bug are not. Where possible, it's best to avoid writing software like this, but if it is necessary then being diligent and not letting this classic class of bug creep in to your code is absolutely vital.
Just imagine if this same sort of error was to occur elsewhere, such as copying between buffers that contained cryptographic keys. We could potentially write past the end of the target buffer, (rarely a good thing), or alternatively continue reading past the supposed end of data in the source buffer to get access to a copy of data that we might not otherwise have, (because the spurious trailing copy hadn't been subsequently zero'ed out along with the data that was actually previously intended to be in that buffer), and more. The mind boggles.
Looking closely at the garbage characters after the real output is also revealing. We see a lot of à which is 0xC3, and then Ä which is 0xC4. These are bytes that introduce two-byte UTF-8 characters. So when the data is erroneously dumped to the console a second time, it's not even being interpreted as UTF-8, but just dumped directly.
Interesting.
Fun and games trying to exploit it
This behaviour can be used to hang the console in a different way to what we saw before when we sent the 0xC4, 0x80 sequence. That basically just swallowed the next byte of input, but we can do better, (or worse, depending on your viewpoint), than this.
We'll test the situation with our local changes to the kernel code backed out. This means that any mis-behaviour we discover would be reproduceable on a stock installation of this release.
The following perl script should repeatedly output the characters at codepoints 160 - 255:
#!/usr/bin/perl
use Encode;
no warnings;
print (chr(27)."%G");
for ($m=0; $m<256; $m++) {
for ($n=160; $n<256; $n++) {
print (encode('UTF-8',chr($n)));
}
}
print (chr(27)."%@");
print ("\nDone\n");
Too much of a good thing - the console eventually hangs
Due to the buggy kernel code, the console hangs after it's output about 2048 characters.
This is different from the previous hang, because we need to interrupt it with ^C rather than pressing any key to generate a character.
Outputting the 0xC4, 0x80 sequence repeatedly, as the following perl script does, also causes an interesting effect:
#!/usr/bin/perl
no warnings;
print (chr(27)."%G");
for ($m=0; $m<2048; $m++) {
print (chr(0xC4).chr(0x80));
}
print (chr(27)."%@");
print ("\nDone\n");
An amusing visual effect
When this script is run, the console will appear to hang. Typing any character will cause that character to repeat, and further characters can then be typed which will be concatenated to it and repeatedly sent to the console. At some point the repeating string is cleared and we start again. Eventually it will just hang as before, and we'll need to ^C ourselves out of it.
Although this is fun to see, especially on an unmodified -release kernel, it doesn't seem to be particularly exploitable.
Investigating the buffer management issue
Whilst our changes to the UTF-8 decoding logic in the wsemul_getchar function haven't fixed all of it's shortcomings, it should at least now be internally consistent. In other words, it processes the intended number of bytes, and updates it's own ‘len’ and ‘buf’ values correctly. Certainly whilst testing it with valid UTF-8 sequences, we wouldn't expect to see any problems.
So the buffer mis-management issue must lie elsewhere.
The getchar function is called from wsemul_vt100_output, and just by looking casually at the code we can see an interestingly named variable, ‘processed’, being incremented by exactly one for every character that is sent to wsemul_vt100_output_normal. The value of ‘processed’ is then eventually used as the return value for the function.
Even without studying the code further, it seems likely that this should be counting bytes and not characters.
This theory can be tested in a very quick and simple way by adding two lines after the processed++ statement following the call to wsemul_vt100_output_normal:
if (instate->inchar >= 160) { processed++; }
if (instate->inchar >= 2048) { processed++; }
This is an ugly hack, but good enough for testing
Obviously this is far from being the correct fix, but it will allow us to see if we're on the right track.
Testing the characterset program again, we get a much better result.
#!/usr/bin/perl
use Encode;
print (chr(27)."%G");
for ($n=192; $n<288; $n++) {
print (encode('UTF-8',chr($n)));
}
print (chr(27)."%@");
print ("\nDone\n");
The same code we used earlier...
Now the characters are displayed as expected, and there is no spurious trailing output.
... but a very different result!
Displaying lots of un-interrupted multi-byte characters will still cause the console to hang, confirming that the testing code we just added to further increment ‘processed’ isn't sufficient to solve the problem.
Just as it was getting interesting...
Having got this far, we were quite close to discovering the root cause of the remaining problem and then hopefully moving on to something more interesting, such as adding some checks in the UTF-8 decoder for overly-long sequences.
However, at this point I received an email replying to a post I'd made a week or so earlier about the original issue to the OpenBSD development mailing list, and shortly after the UTF-8 decoding code in the OpenBSD CVS was replaced. Shortly after that, another commit addressed the buffer mis-management issue, so hopefully these bugs present in OpenBSD 7.2-release are now a thing of the past.
Conclusions
During this debugging session, we saw quite a few examples of coding mistakes that fall in to the category of fairly basic and easily avoided textbook bugs.
The fact that we are finding these sorts of bugs at all, lurking in code that has been in an operating system for nine years, is surprising enough.
Adding to that the fact that this is an operating system which is considered to have a policy of code-correctness and auditing, as well as a focus on security, it becomes somewhat concerning.
The fact that it has not been possible to use codepoints above 0xFF on the console since the kernel UTF-8 support was added in 2013, suggests that few if any users of the system have even tried to do so, and any who have done so didn't care enough about the issue to mention it on the project's mailing lists.
Of the actual programming errors, we found:
Falling out of conditional blocks through to code that follows, when we should be skipping over it or looping.
Confusing byte-counts with character-counts when dealing with multi-byte characters.
Allowing two pointers to the same data to get out of sync with each other.
Using broken logic to test the bits of the first byte of a multi-byte UTF-8 sequence.
Testing that logic, if it was tested at all, was presumably only done with common values which just ‘happened to work’ due to bit 2 being reset, and not with lesser used codepoints above 0xFF, which would have quickly shown up the problem.
Professionals at Exotic Silicon quickly found all of the above issues once we started looking at the code.
We're IT specialists. And we can fix bugs in your code, too.