ExoticSilicon.com - handling UTF-8 streams in C code

Userspace application considerations

In the kernel, the context in which we are using UTF-8 and our requirements are well-defined - we expect strictly compliant input for decoding to individual codepoints, and reject anything that fails this test.

But non-kernel applications might well have a need to parse - or even encode - a UTF-8 stream that breaks the rules. Additionally, when parsing such a stream a fine degree of error reporting is often desirable. Rather than a simple boolean flag for valid or invalid, letting the calling code know exactly how the input fails to meet the specification is sometimes essential for proper error handling or safe fixing up of bad input.

The easiest way to meet this need is to write some UTF-8 handling code from scratch.

Avoiding the C standard library functions

Of course, these days the C standard library includes functions to deal with multi-byte encodings, but we certainly won't be using any of those here for several reasons.

Firstly, the semantics of the C library routines are pretty awful, making code that uses them messy and inconvenient to work with. For example, mbstowcs requires us to pre-allocate the output buffer without knowing it's exact size requirements ahead of time, because the library code obviously can't dynamically resize the output buffer for us.

If we call mbrlen() over the entire input to get the length first then we're basically decoding the same stream twice. The first time we decode and discard the actual output just to get the length value, then at some point, quite likely immediately after allocating the output buffer, we decode the exact same input again and save the result. Obviously this isn't great for performance or efficiency.

But the bigger problem is that the library routines handle the details of the UTF-8 stream internally in an opaque way and just report any decoding error indiscriminately with EILSEQ. This is a severe limitation and greatly restricts the depth to which we can parse and manipulate any UTF-8 stream that isn't strictly standards compliant - which in some cases includes the input merely being truncated, depending on exactly which functions you use.

Furthermore, the behaviour of the C library functions is dependent on locale. On OpenBSD this isn't much of an issue, but if you are writing portable code that needs to run on other systems it may well be. And if you're developing portable code on an OpenBSD machine where the issues are not readily obvious to the uninitiated, then buggy code may well result.

So in resumé, the C library functions are useful if:

1. You're lazy.

2. You don't know how to write UTF-8 parsing code yourself.

We can't do much about laziness, but we can certainly teach you how to write a UTF-8 parser, so let's get on with it.

Overview of UTF-8 encoding

To ensure that our code is correct and reliable we obviously need a full understanding of the UTF-8 encoding scheme. Luckily, it's fairly simple.

Encoding of data to UTF-8 involves taking 32-bit, (see below), values and outputting each of them as a variable-length byte sequence or token representing that 32-bit value.

Decoding a UTF-8 stream is basically the reverse process, reading variable-length tokens one byte at a time, building up a 32-bit value, and outputting those 32-bit values once the last byte of the token has been reached and they've been fully re-constructed.

These 32-bit values are also known as runes in some environments.

Pedantic note

Actual size of encoded values

Strictly speaking, the values only need to be 21-bit rather than 32-bit - if and only if we limit ourselves to processing UTF-8 streams that strictly adhere to the specification.

In practice, however, such 21-bit values are most conveniently stored in a 32-bit data type anyway so this detail can mostly be ignored and the same approach used regardless of whether we're limiting the range of codepoints to the official size of 21 bits or the practical 31-bit, (actually not 32-bit), limit of an unconstrained bitstream.

A key design decision that makes UTF-8 particularly convenient is that the variable-length tokens for all 7-bit ASCII characters - input bytes with values 0x00 - 0x7F - are just those single bytes with the same value. So pure 7-bit ASCII text is already a valid UTF-8 stream, and can be passed unchanged.

We only need to deal with tokens of two or more bytes when encoding input values of 0x80, (128 decimal), and above. Furthermore, every valid token of two or more bytes always consists exclusively of bytes with bit 7 set, so only byte values of 0x80 - 0xFF should appear in the output when encoding an input value of 0x80 or greater.

Convenient encoding side-benefits

The design of UTF-8 allows us to do some useful things in simple ways, such as:

Detecting with a high degree of confidence whether we're actually looking at a valid UTF-8 stream.
Reading a UTF-8 stream backwards.
Identifying newlines, tabs, and other C0 control characters without fully decoding the input.
Finding the nearest character boundary starting from any arbitrary point in the stream.
Calculating the size of the decoded output without fully decoding the input.

(See example code to implement these functions in ‘Parsing techniques for niche requirements’ further below.)

Furthermore, the second and subsequent bytes of a multi-byte token always fall in the range 0x80 - 0xBF. This is true not only of strictly standards compliant streams, but also of the vast majority of ‘mostly compliant but not quite’ data that is otherwise intended to be interpreted as UTF-8.

The first byte of such a token should fall between 0xC2 - 0xF4, with some further restrictions then being placed on the possible values for the following byte in some cases.

In reality, the mathematical logic behind the coding scheme allows us to place a fairly obvious and un-ambiguous meaning on tokens with first bytes in the range 0xC0 - 0xFD, with second and subsequent bytes anywhere in the normal 0x80 - 0xBF range.

As a result, a stream deliberately created with these ‘unofficial’ characteristics can be decoded in such a way that is highly likely to produce the intended result. In fact, early draft versions of the UTF-8 scheme explicitly allowed many of them and the current restrictions were only introduced later.

Bitstream format

The actual encoding scheme is best visualised in binary notation:

Byte types

	Bit position 76543210
ASCII characters	0xxxxxxx	0x00 - 0x7F
Continuation bytes	10xxxxxx	0x80 - 0xBF
2 byte sequence introducer	110xxxxx	0xC0 - 0xDF
3 byte sequence introducer	1110xxxx	0xE0 - 0xEF
4 byte sequence introducer	11110xxx	0xF0 - 0xF7
5 byte sequence introducer	111110xx	0xF8 - 0xFB
6 byte sequence introducer	1111110x	0xFC - 0xFD
Unassigned	11111110	0xFE
Unassigned	11111111	0xFF

Written in this way, it's clear to see that the number of leading 1's in the binary representation determines the purpose of each byte.

Single byte ASCII characters store their 7 bits of data unchanged in bits 0-6.

Multiple byte tokens begin with one of the introducer sequences above, which also provide the first 1 - 5 highest bits of the final value, and then the rest of those bits are stored in the lower 6 bits of one or more continuation bytes.

The upper limit on the number of bits that a single token can store is therefore 31, with one in the sequence introducer followed by 30 spread out across 5 more bytes.

The representation above describes the logic of UTF-8 encoding without the arbitrary restrictions on valid codepoints that were introduced some time after the initial adoption of the standard. With these restrictions in place, the five and six byte sequence introducers become invalid bytes and 4 byte sequences that would represent code points above 0x10FFFF are also considered invalid.

Unfortunately, whilst this technique of storing the lower-order bits of the codepoint directly in the lower six bits of continuation bytes has the big advantages of simplicity and legibility when written out in binary notation, (and also, in fact, when written in octal), it does have one major drawback.

The issue of concern is that it is possible to encode all but the longest codepoints in multiple ways. Tokens of varying length can be created with all of the low-order bits of the introducer set to zero and possibly one or more ‘empty’ continuation bytes, (0x80), both of which contribute only leading zero bits to the codepoint being built up.

For example, the highest codepoint that can be represented by a two byte sequence is 0x7FF, with all eleven bits set:

11011111 10111111

correct, standard-compliant representation of 0x7FF

However, the following three byte sequence also logically decodes to the same value 0x7FF:

11100000 10011111 10111111

over-long encoding of 0x7FF

Important detail!

Correct over-long encoding detection

Note that it is not possible, as might be naively assumed, to reliably detect such over-long encodings simply by observing that the low bits of the introducer byte are all zero. Some codepoints with all of the low bits in the introducer byte set to zero are in fact the shortest possible encodings and are therefore valid.

This happens because when moving from an n-byte coding to an (n+1)-byte encoding, the number of available bits in the continuation bytes alone always increases by six. This value of six is, in turn, always more than the number of significant low bits previously used in the introducer byte.

So using such a simple check would create false positives in some cases, such as the encoding of 0xFFF:

11100000 10111111 10111111

Correct and shortest possible UTF-8 encoding of 0xFFF, which has no low bits set in the introducer, but which cannot be represented by a shorter two byte sequence.

Instead, over-long encodings can be correctly identified by first fully decoding them to a codepoint value and then checking that the decoded value is indeed greater than the maximum value that can possibly be encoded by the next shortest sequence.

Very early drafts of the predecessor of UTF-8 actually avoided the issue of multiple possible encodings for the same codepoint by adding an arbitrary offset on to multi-byte tokens, effectively increasing the lowest possible value that they could encode to one above the maximum of the previous-sized encoding. However, this aspect of the design was, perhaps unfortunately, removed from the final specification in favour of the more direct approach described above, with the proviso that overly long encodings of codepoints that could also be represented in fewer bytes were invalid.

This change slightly simplified the encoding process and also arguably made manual reading of the UTF-8 bytestream easier, although the benefit of the latter is only really apparent during debugging or other troubleshooting processes. With regards to decoding, the change effectively just traded one form of complexity for another, because now it became necessary for decoders to check for overly long encodings which simply did not exist in the earlier draft.

Useful formulas

Number of bits represented by a token

For all multi-byte tokens, the number of bits represented in an 𝓍 byte sequence is:

7 - 𝓍 + 6 × (𝓍 - 1)

Note that this calculation only applies to multibyte tokens with lengths of two to six bytes.

Using a value of 𝓍=1 in the above formula for a single byte token will return an incorrect value of 6 instead of the correct value 7.

Highest value that can be stored by a token

Extending the above formula to form a bit shift, we can calculate the highest value that can be stored in the same token.

In C, we could calculate it with:

((1 << (7 - x + 6 * (x - 1))) - 1)

Again, this calculation only applies to multibyte tokens with lengths of two to six bytes.

Lowest permitted bound

The lowest permitted codepoint in a sequence of y bytes is the first value which does not fit in a sequence of (y-1) bytes.

Unfortunately this means that we have to special case y=2, as single byte sequences are in fact seven bits and not six as calculated by the first formula.

Nevertheless, the lower bound for a y byte sequence where (y > 2), can be calculated with:

(1 << (8 - y + 6 * (y - 2)))

The same calculation can also be written as:

(1 << ((7 - y + 6 * (y - 1)) - 5))

In practice, since single byte values don't follow the otherwise logical progression of values and strictly standard-compliant streams only contain multi-byte tokens that don't exceed a length of four bytes, it's often, (but not always!), more practical to pre-calculate these values and include them as magic numbers in conditional statements in the code.

Worked example

Encoding a check mark (✓)

The character, ‘Check Mark’, has codepoint 0x2713, which is 10003 in decimal.

Expressed as binary, this is 10011100010011.

The highest 1 bit is in position 13, and we have a total of 14 bits, so this codepoint will require three bytes to encode it as a UTF-8 token.

Since the continuation bytes always have six bits, we can group the bits to be encoded like this: 10 011100 010011.

The three byte introducer is 1110xxxx, so we pad the 14-bit value with two leading zero bits for correct alignment.

Therefore the final encoding is:

11100010 10011100 10010011

E2 9C 93

Correct and shortest possible UTF-8 encoding of check mark ✓, codepoint 0x2713.

Reader exercise

For fun, try decoding the codepoint that is represented by the bits shown in the background picture!

Answer, (and textual description), at the bottom of the page.

First steps towards decoding a UTF-8 stream - parsing and validating

As previously mentioned, there are several different requirements that we might have for parsing a UTF-8 stream. For example, we might have the entire input available in one go, or we might be parsing character by character. Additionally, we might or might not know the total length of input at the start of decoding.

None of this really makes any difference to the actual decoding algorithm used, but obviously the code for the specific implementation will differ depending on whether it needs to store and recall it's current state in the middle of decoding a multi-byte token, and whether it needs to allocate an increasing amount of memory as it goes along, and so on.

So first of all, we'll look at what is probably the simplest case of decoding - parsing a known amount of data that is already in memory without writing the output anywhere, to validate that the input is indeed a valid UTF-8 stream.

By writing this code we'll create a framework around which to write a full decoder function later on.

Remember that the focus here is to write a comprehensive UTF-8 parser with good error handling rather then create the smallest or fastest implementation. If you are looking for performance, though, don't worry - I'll show you some advanced techniques in assembler further below that can significantly increase performance over the equivalent C code in many cases.

Top tip!

Avoiding common pitfalls

Aside from a lack of detailed error reporting, there are two main issues with poor coding style that I've seen in various sub-optimal implementations of UTF-8 decoding logic:

Looking ahead in the stream by an arbitrary number of bytes when encountering a multi-byte token instead of keeping state and parsing the input byte-by-byte.
Using explicit checks for the validity of the first continuation byte after the introducer instead of decoding the whole multi-byte token and then checking it's value.

Of course it is possible to write bug-free, valid code using either or both of those approaches, but in my experience it tends to make the code more difficult to audit and also less flexible for making changes.

Specifically, modifying such code to alter it's behaviour when it encounters particular types of non-conformance in the input, such as over-long encodings, encodings for codepoints above 0x10FFFF, or codepoints that represent the UTF-16 surrogates, can be error prone.

Correct bounds checking to avoid reading past the end of the input is trivial when the code simply loops over it once and only ever processes the current byte. In contrast, looking ahead in the stream means doing either implicit or explicit checks that we are at least one, two, three, or more bytes before the end depending on how far ahead we want to look.

Explicit checks for certain combinations of introducer and continuation bytes that would encode overly-long sequences tend to add a lot of arbitrary magic numbers to the source code. These are time consuming and tedious to check for correctness if the code ever breaks due to other changes.

For example, a decoder which is only intended to decode strictly compliant UTF-8 streams might use a conditional statement such as the following to identify the start of a two-byte token:

if (byte >= 0xc2 && byte <= 0xdf) {

/* Parse a two byte sequence */

}

Sufficient for fully compliant UTF-8 streams

Values of 0xC0 and 0xC1 can also introduce logically valid two byte sequences, but only for codepoints which should be encoded as single bytes.

As a result, the use of 0xC2 as the lower bound here instead of 0xC0 works and can be used as a way to enforce the rejection of such over-long encodings. Although it might not be immediately obvious to anybody reading the code at a later date why that has been done, the change required to accept the two byte over-long encodings is trivial:

if (byte >= 0xc0 && byte <= 0xdf) {

/* Parse a two byte sequence */

}

Handles over-long two byte encodings

However for three and four byte sequences it's more complicated because the lowest value introducer for each token length, (0xE0 for three bytes and 0xF0 for four bytes), can form overly long sequences when formed by some values of continuation byte and not with others.

So the check becomes more complicated, with more magic numbers. Additionally, we either have to read ahead in the input to see what the first continuation byte is, ensuring that we don't read past the end of the input buffer if the introducer was actually the last byte and isn't followed by anything, or otherwise if we're parsing one input byte at a time and keeping state, we need to set a flag to indicate this special condition of being in the middle of decoding a possibly invalid sequence and then check that flag on the next iteration of the loop.

Much easier and less prone to error is to decode the possibly over-long sequence in it's entirety first, then check that the decoded value is greater than the maximum that could be encoded by a shorter sequence.

Note that this doesn't contradict what I said above about pre-computing the bit-sizes of, and lowest permitted bounds of the various lengths of multi-byte token. The point here is to check the reconstructed codepoint value for compliance and not individual bytes, be that with pre-computed magic numbers or algorithmically each time.

Annotated source code

Code licensing

The code in the example routines presented on this page is licensed under the ESASSL.

Here is the annotated C source code for a function that does a comprehensive validation of a UTF-8 stream.

On success it returns zero, and if an error is found it returns an error code from the following table and sets the supplied variable errpos to the position in the input where the error occurred.

Error codes

1	Bit 0 -	String contains an invalid byte such as 0xFE or 0xFF.
2	Bit 1 -	String contains an over-long encoding.
4	Bit 2 -	String contains an invalid codepoint, (surrogate half).
8	Bit 3 -	String contains an invalid codepoint, (> 0x10FFFF).
16	Bit 4 -	String contains non-continuation byte when continuation byte was expected.
32	Bit 5 -	String contains continuation byte when non-continuation byte was expected.
64	Bit 6 -	String ends in the middle of a multibyte character.

int utf8_validate(unsigned char * buffer, unsigned int len, unsigned int * error_pos)

{

The function declaration is straightforward enough. We pass a pointer to the UTF-8 stream to be decoded and pass it's length separately, since the stream isn't necessarily null-terminated and may contain embedded 0x00 bytes.

The error_pos argument should be a pointer to an already allocated integer variable that the validator can use to return the position of the first error encountered, (if any). It's value is undefined if no error occurs.

unsigned int i; Loop counter

unsigned int codepoint_build; Codepoint value being built up over several bytes of a multi-byte token

unsigned char bitops; Copy of the byte currently being processed, used to count the leading set bits

unsigned char state; Zero if we are not already in a multibyte character, otherwise the number of bytes left

unsigned char leading_ones; Number of leading one bits in the current byte

unsigned char flag_byte_done; Flag set once we have processed the current byte

unsigned char flag_overlong; Flag set if a decoded multi-byte token was longer than it needed to be

unsigned char multibyte_size; Size of the multibyte token currently being processed, (if processing such a token)

#define CUR_BYTE *(buffer+i)

#define STORE_ERRPOS if (error_pos != NULL) { *error_pos = i; }

Two defines purely to make the source code more readable.

CUR_BYTE gives us the value of the byte currently being processed.

STORE_ERRPOS writes the current position to the supplied error_pos variable, unless no variable was supplied, (by the caller passing NULL instead).

state=0;

At the start of processing we are not yet within a multi-byte token.

for (i=0; i < len; i++) {

flag_byte_done=0;

This is the main loop over the input.

For each character, we reset flag_byte_done and then set it once we've processed it in some way to avoid falling through to code below.

bitops=CUR_BYTE;

for (leading_ones=0; (bitops & 0x80)==0x80; leading_ones++, bitops=bitops << 1) ;

Here we count the number of leading set bits in each byte. This is the fundamental way that the different types of byte are identified in UTF-8 encoding, and we will use the value of leading_ones in several places further on.

The actual count is done by copying the value to the temporary variable bitops and repeatedly shifting it left by one bit until the high bit is zero.

Note that this technique works to correctly count the set bits even in the case of 0xFF where all bits are set, because as the original number is shifted left the low-order bits that are introduced to fill the space are set to zero.

if (state > 0 && leading_ones != 1) {

STORE_ERRPOS

PRINTF ("Non-continuation byte within multibyte sequence at string pos %u\n", i);

return (16);

}

Now we start a series of tests for each possible situation. The first is simple enough, if we are already processing a multi-byte token, (state > 0), then only continuation bytes are valid. ASCII characters which would have zero leading ones, and further sequence introducers which would have two or more, are not valid, so if we encounter them then we return error code 16.

Top tip!

Debug output and the PRINTF() macro

Notice that the code extract above includes a call to PRINTF() in uppercase. This isn't a typo, but a convenient macro, which we'll use throughout and should be defined as follows near the start of your source code:

#ifdef DEBUG

#define PRINTF printf

#else

#define PRINTF(...) {}

#endif

This handly macro allows you to do two things with ease:

Activate or silence all of the debug output by defining or undefining DEBUG at the beginning of your source code.
When disabled globally with DEBUG undefined, quickly enable a single PRINTF() by case-inverting it to a regular printf().

The second point is particularly easy to do in editors based on vi, using the tlide key in command mode.

Although more comprehensive debugging aids and output methods do exist, this simple technique is useful for smaller and uncomplicated projects where you don't really need multiple levels of debug output.

if (state > 0) {

PRINTF ("Continuation byte at string pos %u\n", i);

codepoint_build = codepoint_build << 6;

codepoint_build |= CUR_BYTE & 0x3F;

state--;

flag_byte_done = 1;

At this point, if state is still non-zero then we know that we are actually processing a continuation byte in the range 0x80 - 0xBF.

The data may or may not be valid since it could be encoding an over-long encoding, but at the moment we don't care because it's certainly decodable to a codepoint and we'll check that codepoint once it's completely decoded.

Continuation bytes always contribute exactly their lower six bits to the codepoint being built up. The way we do it here is to shift whatever we already have left by six bits, then store the new bits in the lowest bits of codepoint_build. If there is another continuation byte to follow, then the bits that we've just written will be moved as required.

Of course we could just calculate the final position of each set of six bits from each continuation byte and store it in the correct location to begin with. In that case, the code above would need to change to something like this instead:

if (state > 0) {

PRINTF ("Continuation byte at string pos %u\n", i);

codepoint_build |= (CUR_BYTE & 0x3F) << (6*(state-1));

state--;

flag_byte_done = 1;

But note that with this alternate code in place it would also be necessary to change the code that handles the sequence introducer byte which we'll see shortly.

For now, we'll stay within the (state > 0) conditional block and check for having just completed reassembling a multibyte token.

Remember that we've just decreased state by one, so the following check for state==0 makes sense even though we're inside the (state > 0) conditional block.

* Finished decoding a multibyte? Then check that the codepoint is valid.

if (state==0) {

* Overly long encoding?

flag_overlong=0;

if (multibyte_size == 2 && codepoint_build < 0x80) {

flag_overlong = 1;

}

if (multibyte_size == 3 && codepoint_build < 0x800) {

flag_overlong = 1;

}

if (multibyte_size == 4 && codepoint_build < 0x10000) {

flag_overlong = 1;

}

PRINTF ("%sultibyte, (%u bytes), codepoint 0x%x, (decimal %u) ending at string pos %u\n", (flag_overlong==1 ? "Overlong m" : "M"),

multibyte_size, codepoint_build, codepoint_build, i);

This nested conditional for state==0 handles all of the possible outcomes following the re-construction of a multi-byte token.

First we check for it being an over-long encoding. We're doing that here with a series of explicit checks for better code readability, but later on when we write a full decoder we'll change this to do it algorithmically, partly because we'll want to support token lengths up to six bytes instead of just up to four.

If we see an over-long encoding, we just set a flag which will be checked in the next block of code.

if (flag_overlong == 1) {

STORE_ERRPOS

return (2);

}

if (codepoint_build > 0x10FFFF) {

PRINTF ("Codepoints > 0x10FFFF are invalid.\n");

STORE_ERRPOS

return (8);

}

if (codepoint_build >= 0xD800 && codepoint_build <= 0xDFFF) {

PRINTF ("Codepoint encodes a UTF-16 surrogate, and is therefore invalid in UTF-8.\n");

STORE_ERRPOS

return (4);

}

} /* Closes (state > 0) conditional */

To finish off the multi-byte token handling, we check for three specific issues, the overlong encoding flag that we just discussed, high codepoints above 0x10FFFF, and codepoints that encode the UTF-16 surrogates.

If any of these conditions is encountered, we return with the corresponding error code.

if (flag_byte_done == 0) {

if (leading_ones == 1) {

PRINTF ("Continuation byte without multibyte start byte\n");

STORE_ERRPOS

return (32);

}

If we haven't matched anything so far, yet the current byte is a continuation byte, (leading_ones == 1), then it can't have been part of a valid multi-byte sequence following an introducer.

In this case, we return error code 32.

if (leading_ones > 4) {

PRINTF ("Invalid byte 0x%02x in UTF-8 stream at position %u\n", CUR_BYTE, i);

STORE_ERRPOS

return (1);

}

Equally, if the number of leading ones exceeds four then we are either looking at a five or six byte introducer, or possibly an 0xFE or 0xFF byte.

Logically, 0xFE and 0xFF might be thought of as seven and eight byte sequence introducers, although they have never been assigned to such a role in any official version of the UTF-8 standard. Nevertheless, since our bit-counting code counts them in this way, the (leading_ones > 4) conditional above will conveniently catch 0xFE and 0xFF.

if (leading_ones > 1) {

flag_byte_done = 1;

state = multibyte_size = leading_ones;

* Decrement state by one, as the number of continuation bytes is one less than the total.

state--;

* The leading ones are always followed by a zero.

* Therefore, the number of lower bits that contain codepoint data is (7 - leading_ones).

* We can create the necessary bitmask by right shifting 0xFF by leading_ones places.

PRINTF ("Multibyte start at string pos %u\n", i);

codepoint_build = (CUR_BYTE & (0xFF >> leading_ones));

}

At this point we've already checked for introducers of tokens beyond four bytes in length, so if leading_ones is two or more than we have the start of a multi-byte token that we want to decode, (but which could still turn out to be invalid for other reasons).

We store the number of leading ones, (which directly represents the overall number of bytes in the sequence), in multibyte_size so that we can refer to it once the sequence is complete and check that it isn't an over-long encoding.

The same value is also initially stored in state, but since this represents the number of remaining continuation bytes and doesn't include the introducer, we immediately decrement state by one to get the correct value.

Any low bits of the introducer which form part of the final codepoint are copied in to the lower bits of codepoint_build, and will be shifted left as required every time we encounter a continuation byte.

As with the continuation bytes, it's also possible to write the bits to the correct positions directly:

codepoint_build = (CUR_BYTE & (0xFF >> leading_ones)) << (6 * state);

Doing it this way involves more calculations overall, though, including multiplication which is not by a factor of two.

Top tip!

Calculating the bitmask

Up to five of the least significant bits of the introducer contain codepoint data.

Since the final re-constructed codepoint value will be used directly without any masking of it's significant bits, we need to mask off the variable number of leading set bits in the introducer byte here.

This applies in the same way regardless of whether we initially store these bit values directly in the corresponding low bits of codepoint_build and let the code we saw earlier shift them left by six bits each time it encounters a valid continuation byte, or whether we store them directly in their final position.

To create a suitable mask, we can simply take the value 0xFF and shift it right by the number of leading ones. For example, for a three byte sequence introducer we shift by three bits:

11111111 0xFF

00011111 (0xFF >> 3) = 0x1F

11101010 Typical three byte sequence introducer

00011111 Mask

00001010 Bits containing codepoint data

Note that strictly speaking, the bits that represent actual codepoint data would be correctly masked by (0xFF >> (leading_ones + 1)):

00001111 (0xFF >> 4) = 0x0F

00001010 Bits containing codepoint data

However, by definition the bit immediately following the leading set bits must be a zero, so including it in the mask as a set bit will not affect the value extracted from the introducer and doing it this way allows us to make a slight optimisation by avoiding the calculation of (leading_ones + 1).

if (flag_byte_done == 0) {

* Must be a regular ASCII character

PRINTF ("Regular ASCII character at string pos %u\n", i);

flag_byte_done=1;

}

Having dealt with every possible instance of multi-byte introducers and continuation bytes, if flag_byte_done is still zero then we're looking at a regular ASCII character, 0x00 - 0x7f.

In a decoder the ASCII value would just be written out directly as the 32-bit codepoint value, (with possible special handling of 0x00 and other control characters). Since we're only validating the input and not writing the decoded output anywhere there is no special processing required and we just set flag_byte_done to skip the integrity check below.

if (flag_byte_done == 0) {

PRINTF ("Logic error in utf-8 parser!\n");

exit(1);

}

This final block of code should never run with the program in it's current form. It was used during development, (together with a now removed check for CUR_BYTE < 0x80 in the previous block), to catch errors in the decoding logic before everything was complete and fully debugged.

Now that the function runs successfully and produces the correct results we could indeed remove these lines, along with the line that sets flag_byte_done when an ASCII character is found.

However, as well as being included here for completeness as an example of how to add such debugging code to your own programs, the assumption that the function will always produce the expected results could be broken by any of the following:

Future changes to the code
Over-zealous use of compiler optimisation flags
Bugs in the C compiler, OS code, or system libraries
Hardware failures

Whilst it's not really practical to make any guarantees about programs running on faulty hardware or that have been compiled with buggy compilers, we do plan to make further changes and additions to this code. During this process it's useful to keep the integrity check above enabled.

}

This closes the big for (i=0; i < len; i++) loop.

if (state != 0) {

PRINTF ("End of string within multibyte character - expecting %u more bytes\n", state);

STORE_ERRPOS

return (64);

}

PRINTF ("Valid UTF-8 string of %u bytes\n", len);

return (0);

#undef CUR_BYTE

#undef STORE_ERRPOS

}

All that is left to do once we've reached the end of input is to check that we didn't finish within an uncompleted multi-byte token.

If we did have an uncompleted token then we return error 64, otherwise the validation was successful and the function returns 0.

Bit counting

Look again at the line in the code above that counts the number of high set bits:

for (leading_ones=0; (bitops & 0x80)==0x80; leading_ones++, bitops=bitops << 1) ;

Now as a C programmer you might be thinking that doing this bit counting loop for each byte would be painfully slow compared to a set of explicit comparisons.

However, in practice, bit shifts and bit operations in general are very cheap on modern CPUs.

When compiled with a modern optimising compiler, the above C code was seen to end up in X86 assembly as something like the following:

xorl %r14, %r14 # leading_ones=0

test %bl, %bl # if ((bitops & 0x80) == 0) ...

jns done # ... then { goto done; }

loop:

addb $1, %r14b # leading_ones++

addb %bl, %bl # bitops = bitops * 2

js loop # if (bitops < 0) { goto loop; }

done:

(If you're having trouble visualising how this computes the same values for leading_ones as the C code, don't worry - I've included a version of the assembly code logic expressed in C further below.)

In synthetic benchmarks this was indeed slightly slower that a set of explicit checks. On the test machine, typically by 7% and at worst around 9% - 10% on pure random 32-bit data, but only about 1% slower when the input had a strong bias towards ASCII text. This latter mix is obviously kinder to the bit-popping algorithm since the loop exits immediately when the high bit is 0. We also theorise that the explicit comparisons version would benefit more from speculative execution in modern CPUs than the bit-shifting algorithm.

Now, a 1% slow-down might not sound very encouraging, but since the set high bit counting is just a small part of the overall UTF-8 decoding process, the drop in performance would likely be almost undetectable in any practical real world application. Certainly, there are benefits to accepting a small performance hit in return for keeping the source code easy to audit and easy to modify.

But of course, I wouldn't be working here at Exotic Silicon's research department if that was all I could offer you! So how about we go off on a tangent here, spice things up a bit, kick some dirt in the face of the C compiler, and screw every last drop of performance out of our hardware with some hand written assembler?

Assembler optimisations

Amasingly, considering the leading bit counting code in isolation, we can get a boost of approximately 12% - 19% on ASCII-orientated data, and wait for this - a cool 500% speed increase on random data! Yes, that's not a typo, the hand coded assembler code actually runs in less than a fifth of the time that the compiler optimised C version takes when processing random data on the test machine!

Curiosity

Deterministic running time

What's also interesting is that the hand coded assembly runs in almost constant time irrespective of the input data, and so is hardly affected at all by the mix of ASCII and non-ASCII characters.

This kind of deterministic behaviour is useful from a security perspective for avoiding unwanted leaking of information about the type of data being processed via covert timing side-channel attacks.

The key here is knowing that many CPU architectures have specific instructions to do things like find the highest bit that's set, the number of zero bits, and so on.

In the case of X86, there isn't a specific single instruction that would directly count the number of leading ones in an 8-bit value. However, it is possible to count the number of leading zero bits in a 16-bit, 32-bit, or 64-bit value, using the lzcnt opcode.

This can obviously be used in combination with a logical not operation to count the leading one bits in a particular value, as we can just flip every bit of that value first and then count the leading zeros.

Since we can't use this opcode on an 8-bit value, we obviously need to move it in to the high bits of a larger one and run the lzcnt opcode over that. The naïve approach would be to assume that moving to a 16-bit value would be fastest since we don't need to shift the bits by as many places as we would do for a 32-bit or 64-bit value. However in practice various factors influence the performance of the code, and dealing with 32-bit quantities is often faster then using 16-bits.

Overall then, the assembler code looks something like this, with the input byte in the lower bits of %ecx, and the number of leading ones being written to %r12d:

shll $24, %ecx # Shift the value in %ecx left by 24 bits.

notl %ecx # Invert the entire 32-bit value in %ecx.

lzcntl %ecx, %r12d # Store the number of leading zero bits in %r12d.

Within our C code, the assembler is represented slightly differently, as we want to let the compiler chose which registers to use, as well as load the values to and from the C variables on entry and exit:

asm volatile ("shl $24, %0; not %0; lzcntl %0, %1" : "+r" (bitops) ,"+r" (leading_ones)::);

The benchmark reference code is essentially just a loop containing the relevant line from the UTF-8 validating routine, that iterates over a buffer of random data:

for (i=0; i < BSIZE; i++) {

bitops=inbuf[i];

for (leading_ones=0; (bitops & 0x80)==0x80; leading_ones++, bitops=bitops << 1) ;

bits[leading_ones]++;

}

Here is the optimised assembler code using lzcnt for the actual bit counting only:

for (i=0; i < BSIZE; i++) {

bitops=inbuf[i];

asm volatile ("shl $24, %0; not %0; lzcntl %0, %1" : "+r" (bitops) ,"+r" (leading_ones)::);

bits[leading_ones]++;

}

In the version above, the code for looping over the input and writing the results to the bits[] array is still done in C.

This allows us to determine a fairly realistic value for the potential performance increase available from using hand-optimised assembler whilst still being able to fit the rest of the UTF-8 decoding logic in C around the new code. In other words, what we could achieve without re-writing the whole UTF-8 decoding function itself in assembler.

However, if we are actually only interested in counting the leading set bits in each byte of the input and producing a tally of those values in the bits[] array, then we can do even better. Taking advantage of the CPU's ability to process 64-bit values, we can instead write the whole loop in assembler like this:

asm volatile (

"mov %1, %%r13; add $32, %%r13;" /* Store the address of bits[8] in %r13 for quick access later */

"shrq $3, %2;" /* BSIZE *= sizeof(uint64_t) */

"outerloop:"

"dec %2; mov (%0,%2,8), %%rdx;" /* %rdx now contains 64 bits of useful data */

"not %%rdx;" /* Only invert once for the whole 64 bits */

"mov $8, %%rcx;" /* %rcx is the inner loop counter */

"innerloop:"

"lzcntq %%rdx, %%r15;" /* Count leading zeros, might go beyond 8 and into the next byte */

"cmp $8, %%r15;" /* Test for 8 */

"jl skip;" /* If < 8 then skip forward to the regular handling code */

"incl (%%r13);" /* Otherwise, hard-code an increase to bits[8]... */

"shl $8, %%rdx;" /* ... and unroll the loop code for speed ... */

"dec %%rcx; test %%rcx, %%rcx;" /* ... rather than jump to the identical code below */

"jnz innerloop;"

"test %2, %2; jnz outerloop;jmp out;"

"skip: leaq(%1,%%r15,4), %%r14;" /* Store the address of bits[%r15] in %r14 */

"incl (%%r14);" /* Increase bits[%r15], (pay attention, that's _not_ a typo!) */

"shl $8, %%rdx;" /* Move to the next byte of the 8 in the inner loop */

"dec %%rcx; test %%rcx, %%rcx;" /* Check inner loop counter */

"jnz innerloop;" /* Repeat inner loop */

"test %2, %2; jnz outerloop;out:" /* Repeat outer loop */

:: "r" (inbuf), "r" (bits), "r" ((uint64_t)BSIZE) : "rcx", "rdx", "r15", "r14", "r13", "memory");

At this point we've obviously moved away from UTF-8 decoding somewhat and in to the realm of synthetic benchmarks. Although this version was around 18% faster than the first assembler version in casual tests on random data, if we tried to integrate the rest of the UTF-8 decoding function in to the code above, (which would be a non-trivial task anyway), it's very likely that much of the performance gained by doing 64-bit reads would be lost.

However, before dismissing this as a purely academic exercise, it's worth noting that this algorithm can actually be used to calculate the size of the decoded data given any valid UTF-8 input stream, without the overhead of actually doing the decoding!

This is obviously useful if we want to allocate exactly the required amount of memory before starting the real decoding process.

The formula needed is simple:

number of output codepoints = bits[0] + 2*bits[2] + 3*bits[3] + 4*bits[4] + 5*bits[5] + 6*bits[6]

We don't need to use the count of continuation bytes as they are included in the initial sequence introducing byte.

Of course, we could verify that the number of continuation bytes matches the expected number implied from the initial leading bytes:

bits[1] == bits[2] + 2*bits[3] + 3*bits[4] + 4*bits[5] + 5*bits[6]

None of this guarantees that the bitstream is valid, as it could contain the correct number of continuation bytes but just not in the correct places after each multi-byte token introducer. But the second formula above does immediately catch certain errors and is a very fast calculation to perform if we already have the bits[] array correctly populated.

Naturally, the value computed by the first equation for the size of the output buffer will also only be correct if the UTF-8 stream is actually valid, otherwise it will be meaningless.

This doesn't cause much of a problem in practice though, even considering that the decoding logic should always be written such that invalid input cannot cause data to be written beyond the end of the allocated output buffer.

If the decoding function always stops immediately when it encounters invalid input, the data won't overflow an output buffer sized with the first formula above.

In a function that does try to correct broken input, we need to be careful if we do that in any way that could extend the output beyond the pre-calculated size. For example, if continuation bytes with exactly one leading bit set were passed through unchanged to the output, (which would probably be a bad idea in most cases anyway), then the output buffer may indeed overflow if it was sized with that formula and further steps are not taken to detect such an overflow.

In summary, the ability to count the leading bits of each byte very quickly is potentially useful in situations where re-sizing the output buffer is an expensive operation, and we want to allocate the correct amount of memory up front without fully decoding the whole stream twice.

Benchmark code

If you want to compare the speeds of different C and assembler implementations of the function on your own hardware, the source code for a benchmark program to do exactly that is included in the download link at the bottom of the page in the ‘downloads’ section.

The benchmark tests a total of seven algorithms, four implemented in C and three in hand written X86 assembler. If you are running on a platform other than X86 then the assembler routines can be disabled leaving only those implemented in C.

When invoked, the benchmark code fills a buffer with random 32-bit values which can either be truly random or biased towards ASCII character codes using a compile-time define. (Technically 127 is not usually a printable character, but we included it anyway.)

The leading set bits are counted a total of fifty times using a randomly selected algorithm on each iteration. The random ordering of algorithms is intended to minimise any bias caused by side-effects of the previously run algorithm, (such as data being fetched from cache instead of main memory). A summary showing the fastest run time for each algorithm is then displayed.

To perform a comprehensive benchmark of your own hardware, first compile the program without SIMULATE_MOSTLY_ASCII defined and run it. On the reference hardware, the assembler routines were considerably faster in this configuration, although your results might vary. Next, re-compile the benchmark code after defining SIMULATE_MOSTLY_ASCII and run it again.

This time the buffer will be filled with approximately 94% of bytes being in the range of ASCII characters 32 - 127, and yet again, fifty iterations over the new buffer are performed. Now the difference between C and assembler implementations is likely to be much less, although in our tests the fastest computation was still always achieved with the last of the three hand written assembler routines.

The benchmark code includes some basic internal consistency checks intended to catch errors caused by broken compilers, broken hardware or changes to the code that introduce bugs, and it exits with a fatal error if it detects that a second or subsequent run over the same input data produces different results to the first run.

Refer to the comments in the source code for further details.

Assembler algorithm expressed in C

For those readers who were having trouble visualising how the first extract of assembler code computes the values for leading_ones, here is a version of the assembly code logic expressed in C.

Whilst this would be a rather cumbersome way to perform the task in a C program, the same method is very compact when written in X86 assembler.

#include <stdio.h>

int main()

{

unsigned char leading_ones;

char bitops;

unsigned int i;

for (i=0; i < 256; i++) {

bitops=i;

leading_ones=0;

printf ("Input 0x%2x, ", i);

if ((bitops & 0x80) == 0) {

goto done;

}

loop:

leading_ones++;

bitops = 2 * bitops;

if (bitops<0) {

goto loop;

}

done:

printf ("leading ones: %d\n", leading_ones);

}

return (0);

}

Decoding a UTF-8 stream

After our brief diversion in to assembly language, let's return to decoding UTF-8 in C.

We saw earlier how to parse a UTF-8 stream and verify that it strictly complied with the standard. This essentially involved decoding the codepoints and discarding the output.

Now we'll use that code as the basis for a full decoder.

The main differences of the new function compared with the existing validator code will be:

The decoded codepoints will be written to an output buffer instead of being discarded.
Rather than hard-coding certain values and forcing standards compliance, various aspects of the decoding will be configurable.
In addition to the return code indicating overall success or the code of a fatal error, any non-fatal errors will also be reported separately.
The function will accept a single pointer to a structure containing all of the required arguments, instead of three discrete values.

Data structure

Taking the above points in reverse order, let's look at the structure that we'll be passing to the decoder. In fact, we'll use exactly the same structure for the encoder a bit later on.

struct st_utf8_params {

unsigned char * buffer_encoded; Encoded data buffer

uint32_t * buffer_decoded; Decoded data buffer

unsigned int buffer_encoded_len; Encoded data buffer length

unsigned int buffer_decoded_len; Decoded data buffer length

unsigned int max_token_len; Maximum token length to accept

unsigned int error_pos; Position of fatal error

unsigned int ignore; Bit flags for errors to ignore

unsigned int error_flags; Bit flags for errors encountered and ignored

unsigned int error_len; Number of bytes involved in an error, (used by pretty error printer)

};

Some of these elements will be used to supply values to the function, and others will be used to return values back to the caller.

For the decoder, the input is expected to be in buffer_encoded, with it's length indicated by buffer_encoded_len. The function itself will allocate memory as required for the decoded data and store the pointer to it in buffer_decoded, along with it's length in buffer_decoded_len.

If buffer_decoded_len contains a non-zero value when the decoder function is called then that value will be used as a hint to size the output buffer, but supplying such a hint will not be required and the function will also be capable of sizing the buffer completely automatically if buffer_decoded_len is set to zero.

Before calling the decoder we will also need to set max_token_len to indicate the largest multi-byte tokens that we want to accept as being valid. For standards compliant streams this would be set to four. As discussed previously, streams which are not strictly compliant often contain tokens of up to six bytes, which we could accept and decode by setting max_token_len to six.

Note that the maximum token length is related to, but distinctly separate from the concept of allowing or disallowing codepoints above 0x10FFFF.

It is possible to encounter a stream with either of the following characteristics:

Restricted to tokens with a maximum length of four bytes, but still encoding some codepoints between 0x10FFFF and 0x1FFFFF as four-byte tokens.
Restricted to codepoints within the 0x000000 - 0x10FFFF range, but with some of these codepoints encoded as over-long encodings of five or six bytes.

The last value that will be parsed as an input by the decoder function is ignore. Any bits set in this value correspond to types of error that we want the decoder to ignore and continue past. The error numbers will be based on those that we used in the validator function.

For example, if the validator encountered a byte with value 0xFF, it set bit 0 in the return value to indicate an ‘invalid byte encountered’ error and returned without processing any more of the input.

In the case of the decoder, it might be desirable to ignore such bytes and continue processing the input stream. To enable that behaviour, we would set bit 0 in the ignore parameter. Since it is interpreted as a bitfield, we can set as many bits as we like and ignore multiple errors as required.

The other elements of the structure will be filled by the decoder, and these all relate to errors found whilst processing the stream.

If a fatal error causes the function to return before the end of the input, the position of the byte that caused the error will be stored in error_pos just as it was in the validator. Errors which are ignored will not set error_pos, so if a fatal error occurs along with one or more non-critical errors then the value of error_pos will always indicate the fatal error.

The non-fatal errors are reported in error_flags, as a bitfield which is analogous to that used for the ignore parameter. In general, only bits which are set in ignore will ever be set in error_flags. There is one exception to this which we will cover later on.

Finally, we have an error_len value, which indicates the number of bytes involved in an error, (for example a single 0xFF byte would be a one byte error, but a four byte token that encodes an over-long sequence would have all four bytes involved). This error_len value will be used in a separate error printing routine to highlight exactly which bytes caused the error.

New error codes

The error codes that we defined for utf8_validate() will be used to report the same errors in the decoder. We'll also introduce two new error codes to report situations that didn't occur in the validation code:

128	Bit 7	Memory allocation error.
256	Bit 8	Invalid parameters supplied.

Setting these bits in the ignore flag will have no effect, as it makes little or no sense to continue after these errors.

Decoder source code

Here is the annotated source code for the decoder function.

int utf8_decode(struct st_utf8_params * params)

{

This time we are passing all of the required values within a single structure.

unsigned int i;

unsigned int codepoint_build;

unsigned int outbuf_size; Amount of memory currently allocated to the output buffer

unsigned char bitops;

unsigned char state;

unsigned char leading_ones;

unsigned char flag_byte_done;

unsigned char flag_overlong;

unsigned char multibyte_size;

unsigned int * new_outbuf; Temporary pointer used when re-allocating the output buffer

Most of the local variables here serve the same purpose as they did in the validator function. The two new ones relate to the management of the output buffer, which obviously didn't exist in the validator.

Note that outbuf_size is distinct from params->buffer_decoded_len.

Whereas buffer_decoded_len will store the actual number of valid elements in the output buffer, (in other words the current position to append data to), outbuf_size will store the amount of memory currently allocated which will always need to be more than is currently in use.

#define CUR_BYTE *(params->buffer_encoded + i)

#define STORE_ERRPOS { params->error_pos = i; }

#define MAX_OUTBUFFER_SIZE (256 * 1024 * 1024) /* 256 Mb */

#define REALLOC_INCREASE (64 * 1024) /* 64 Kb */

The first two of these macros also serve the same purpose here as they did in utf8_validate(), but be aware that the actual definitions have changed.

Also note that the value of REALLOC_INCREASE should always be a multiple of sizeof(uint32_t). See below for further details.

CUR_BYTE is functionally the same, with just a change of variable name from buffer to buffer_encoded.

STORE_ERRPOS no longer checks for error_pos being NULL, as it is now params->error_pos and as part of the supplied structure it always exists.

MAX_OUTBUFFER_SIZE is used to constrain the amount of memory that can be allocated to the output buffer.

REALLOC_INCREASE is a fairly arbitrary constant specifying the size of increments to use if the initial memory allocation is insufficient.

params->error_flags=0;

params->error_pos=0;

params->error_len=0;

These are values that will be returned to the calling function.

Since error_flags is a bitfield that will be built up one bit at a time if and when we encounter any non-fatal erros, we need to initialize it to zero before use.

The other return values for position and length will be explicitly set before returning if an error occurs during the actual decoding, so we don't need to initialize them. But if we happen to return early before even starting to decode the input, due to an initial memory allocation error or invalid parameters, then it makes sense to have set them both to zero.

In the case of successful decoding with no fatal errors these default values of zero will also be returned.

We don't set a default value of zero for buffer_decoded_len here because that variable is also optionally used to supply a hint value for the size of the output buffer and we haven't read that value yet.

if (params->max_token_len > 6 || params->max_token_len < 2 || params->buffer_encoded_len==0) {

params->buffer_decoded_len=0;

return (256);

}

Here we check that the supplied parameters are within a sensible range.

The maximum multi-byte token length is constrained to values between 2 and 6, and zero-length input is also rejected.

If we return due to error here then we set buffer_decoded_len to zero, overwriting any supplied hint value.

#define INITIAL_SIZE (params->buffer_decoded_len == 0 ? params->buffer_encoded_len : params->buffer_decoded_len)

if (INITIAL_SIZE > MAX_OUTBUFFER_SIZE / sizeof(uint32_t)) {

params->buffer_decoded_len=0;

PRINTF ("Initial memory allocation exceeds maximum configured size of %u bytes\n", MAX_OUTBUFFER_SIZE);

return (128);

}

outbuf_size = sizeof(uint32_t) * INITIAL_SIZE;

#undef INITIAL_SIZE

PRINTF ("Allocating %u bytes for output buffer\n", outbuf_size);

params->buffer_decoded=malloc(outbuf_size);

if (params->buffer_decoded == NULL) {

PRINTF ("Initial memory allocation error\n");

params->buffer_decoded_len=0;

return (128);

}

This is the initial memory allocation code.

Since UTF-8 is a variable length encoding, we don't generally know in advance of decoding a stream how many codepoints the input will actually decode to because we don't know anything yet about the frequency and mix of multi-byte tokens. By extension, we usually don't know the exact size required for the output buffer.

We can easily calculate an upper bound for the size, though, as the largest possible number of output codepoints is just the same as the number of bytes in the input. This largest possible value occurs when the input is entirely ASCII characters with no multi-byte tokens and each input byte decodes to exactly one 32-bit integer.

Remember that whilst the input is based on 8-bit bytes, (octets), the output consists of 32-bit integer values.

However large the output ends up being in codepoints, we need to multiply this value by sizeof(uint32_t) to allocate enough bytes to hold it.

On almost all systems, an unsigned 32-bit integer is stored in four bytes, although nothing actually guarantees this so the correct way to find the size is to use sizeof().

Knowing this, there are basically two ways that we can approach the task of buffer memory allocation:

Allocate a small amount of memory and increase it as required.
Allocate the maximum amount that might be required, (and possibly subsequently reduce it).

In reality, the size of a typical UTF-8 data stream is likely to be an order of magnitude smaller than the amount of memory available on a modern system, so in many or even most cases the second option is perfectly reasonable and has the added advantage of simplicity.

This is the approach that the code above takes if buffer_decoded_len is set to zero.

However, it is also possible that we might have already done some pre-processing of the input external to this function and that we either know or have some idea of how big the output will be. In that case it would be useful to communicate this information to utf8_decode() so that we could reduce or avoid the temporary over-allocation of memory.

Such a hint, (expressed as a value in codepoints, not bytes), can be supplied in buffer_decoded_len and if it is then it will be used instead of the size of the input in the calculation of outbuf_size.

Note that the formula used to bounds check the value of INITIAL_SIZE against MAX_OUTBUFFER_SIZE, (which is in bytes), uses division instead of multiplication:

(INITIAL_SIZE > MAX_OUTBUFFER_SIZE / sizeof(uint32_t))

Doing this avoids the possibility of a multiplication causing wrap around, (also commonly called ‘overflow’), which could occur if we used a construction such as this:

(INITIAL_SIZE * sizeof(uint32_t) > MAX_OUTBUFFER_SIZE)

Although in practice the values we're using here typically won't wrap on a system where sizeof returns a 64-bit value, as always future changes to the code could invalidate this assumption so it's generally better to use the division approach in these situations.

And yes, the C standard library on any modern system does typically include functions such as calloc() and reallocarray() which return an error on overflow, so we could avoid this very specific risk by using those in place of malloc here. However, the existence of these alternative functions is not a good substitute for common sense and understanding the risk of overflow anywhere else in the code that we also use multiplication in relation to memory management and data buffers.

If the actual memory allocation fails, we return error code 128.

state=0;

params->buffer_decoded_len=0;

Here we set the initial state of not being within a multi-byte token, just as we did in the validator.

We can also set buffer_decoded_len to zero ready to use as our output length indicator, now that we've finished with any value that was originally supplied in it.

for (i=0; i < params->buffer_encoded_len; i++) {

At this point we enter the main loop that will iterate over the input.

if (params->buffer_decoded_len * sizeof(uint32_t)==outbuf_size) {

if (outbuf_size + REALLOC_INCREASE > MAX_OUTBUFFER_SIZE) {

free (params->buffer_decoded);

params->buffer_decoded_len=0;

params->error_len=1;

STORE_ERRPOS

return (128);

}

new_outbuf=realloc(params->buffer_decoded, outbuf_size + REALLOC_INCREASE);

if (new_outbuf == NULL) {

free (params->buffer_decoded);

params->buffer_decoded_len=0;

params->error_len=1;

STORE_ERRPOS

return (128);

}

PRINTF ("Reallocated output buffer from %u to %u\n", outbuf_size, outbuf_size + REALLOC_INCREASE);

params->buffer_decoded=new_outbuf;

outbuf_size += REALLOC_INCREASE;

}

The first thing we have in the loop is more buffer memory management code, this time to increase the allocation if the output of previous loops has already filled it. Naturally this will never run if the function is callled without a size hint in buffer_decoded_len, as the output buffer is already at least the size that it needs to be.

Of course, if the output buffer was always allocated to be the maximum possible size that could be required for the input, in other words the hinting code mentioned previously hadn't been included, then this entire block of code would also be un-necessary. Remember what I said about the added advantage of simplicity?

Care should be taken to ensure that the value of MAX_OUTBUFFER_SIZE is not configured beyond (0xFFFFFFFF - REALLOC_INCREASE), as doing so would allow the value which is eventually passed to realloc() to wrap. Since MAX_OUTBUFFER_SIZE is a compile time option and virtually no real world usage would require an output buffer anywhere near this size, (just under 4 Gb), the code doesn't perform any further checks to enforce this upper limit.

The logic in the outer conditional is also the reason that REALLOC_INCREASE must be configured as a multiple of sizeof(uint32_t) for reliable operation. Since we check specifically for equality with outbuf_size rather then passing beyond it's value, we need to ensure that outbuf_size is only increased by the same multiple that we use for buffer_decoded_len, otherwise the comparison might never match and we could easily end up writing beyond the allocated memory.

flag_byte_done=0;

bitops=CUR_BYTE;

for (leading_ones=0; (bitops & 0x80)==0x80; leading_ones++, bitops=bitops << 1) ;

Now finally we get to the start of the actual UTF-8 parsing code. These three lines are unchanged from the validator function.

if (state > 0 && leading_ones != 1) {

PRINTF ("Non-continuation byte 0x%02x where continuation byte expected within multibyte sequence at string pos %u\n", CUR_BYTE, i);

if ((params->ignore & 0x10) == 0) {

params->error_len=1;

STORE_ERRPOS

return (16);

}

params->error_flags |= 0x10;

* If error 0x10 is being ignored then we discard the multi-byte sequence that was being constructed.

* Additionally, we maintain flag_byte_done=0 so that the code below will process it in some way.

* This could be as a regular ASCII character or as a new sequence introducer.

state=0;

}

The basic order of operations in this decoding function is the same as it was in the validator.

First we deal with being part way in to a multi-byte token and encountering a non-continuation byte where we would have expected a continuation byte.

The main difference here, (apart from the slightly more verbose debugging message), is in the error handling. Whereas before we just automatically returned to the calling function with error code 16, now we only do that if the corresponding bit in params->ignore is set. Otherwise, we set the same bit in params->error_flags to indicate a non-fatal error, and continue on almost as if nothing happened.

In the case of treating this as a non-fatal error, we do reset state to zero. This ensures that we won't run the code that handles completed multi-byte sequences, as that is within the following (state > 0) conditional. Effectively, setting state to zero here cancels the multi-byte sequence that we were already in.

Of course, what we now do with the unexpected byte, (as well as any that were part of the now discarded multi-byte token being constructed), is an open question. There are several logical options, including at least:

Completely discard it and write nothing to the output in it's place.
Replace it with a specific character such as the replacement character 0xFFFD.
Replace it with one character out of a range of characters in a private use block such as 0xF0000 - 0xF00FF.
Interpret it as if it had been encountered outside of an existing multi-byte token.

As written, the block of code above will take the last option and allow the following code to process the byte as either a regular ASCII character or as a new sequence introducer.

Alternatives

If we wanted to completely discard the byte instead, that could be done simply by setting flag_byte_done:

[...]

state=0;

flag_byte_done=1;

}

[...]

Replacing the unexpected byte with the replacement character can be done with just one more line of code:

[...]

state=0;

*(params->buffer_decoded+params->buffer_decoded_len++)=0xfffd;

flag_byte_done=1;

}

[...]

As can replacing it with one character from a range:

[...]

state=0;

*(params->buffer_decoded+params->buffer_decoded_len++)=0xf0000 + CUR_BYTE;

flag_byte_done=1;

}

[...]

Obviously adding this kind of flexibility as a selectable option at runtime quickly begins to increase the complexity to the code, and in many cases it's un-necessary as the desired action can be fixed at compile time.

Ultimately, if the program calling this decoding function has very specific requirements for handling such errors then it can always leave the corresponding bit of the ignore flag unset, which will cause the decoder to stop and return a fatal error along with the precise location of the error in the input stream. The calling code can then do it's own processing of the bytes involved before calling the decoder again with the start of the input positioned after the erroneous bytes to resume regular decoding.

Note that within the decoder we can easily treat invalid bytes differently depending not only on their value but also where they were encountered. For example, if an 0xFF byte is found as part of a multi-byte sequence in place of a continuation byte in the range 0xC0 - 0xDF, then we might want to completely discard it. On the other hand, if 0xFF occurs on it's own within regular ASCII text, then we might choose to replace it with the replacement character instead.

Finally, before we move on, note that the number of bytes that were in the partially completed multi-byte sequence that we discarded can be calculated with (multibyte_size - state), before resetting state to zero.

Important!

Writing multiple bytes per iteration

Be aware that modifying the code so that it inserts multiple bytes during one iteration of the main loop would require changes to the memory management code. Currently that code only enlarges the buffer when it's completely full, so by extension it only guarantees that there is at least one byte available.

Making this change is left as an exercise for the reader. Failure to do it correctly will result in code which works most of the time but fails in subtle ways depending on the nature of the input data and the exact position of any invalid bytes in it. Have fun!

if (state > 0) {

PRINTF ("Continuation byte 0x%02x at string pos %u\n", CUR_BYTE, i);

codepoint_build = codepoint_build << 6;

codepoint_build |= CUR_BYTE & 0x3F;

state--;

flag_byte_done = 1;

Here we process correctly encountered continuation bytes in the same was as we did in the validator, pushing the lower six bits in to codepoint_build after shifting what we already had left by six bits.

* Finished decoding a multibyte? Then check that the codepoint is valid.

if (state==0) {

* Overly long encoding?

flag_overlong=0;

if (codepoint_build < (multibyte_size == 2 ? 0x80 : (1 << (8 - multibyte_size + 6 * (multibyte_size - 2))))) {

flag_overlong = 1;

}

Previously we used a series of explicit checks to decide whether to set flag_overlong or not.

That was arguably more readable than the algorithmic approach shown above, as we could see the specific values that were being tested for, 0x80, 0x800, and 0x10000. However we now want to process sequences up to 6 bytes which would add two more distinct if statements, and at this point the calculation above starts to look more compact and easier to verify as being correct. After all, the discrete values being tested become longer as the number of bytes increases, making it somewhat less obvious if one of them was accidentally changed during code editing.

PRINTF ("%sultibyte, (%u bytes), codepoint 0x%x, (decimal %u) ending at string pos %u\n", (flag_overlong==1 ? "Overlong m" : "M"),

multibyte_size, codepoint_build, codepoint_build, i);

if (flag_overlong == 1) {

if ((params->ignore & 0x02) == 0) {

params->error_len=multibyte_size;

STORE_ERRPOS

return (2);

}

params->error_flags |= 0x02;

}

There is a slight difference in the handling of overlong encodings as fatal errors here, as we now set error_len to the size of the multi-byte token before returning.

If we are ignoring overlong encoding errors then all we do is set the corresponding bit of error_flags and continue - the code further below will happily process the decoded token.

if (codepoint_build > 0x10FFFF) {

PRINTF ("Codepoints > 0x10FFFF are invalid.\n");

if ((params->ignore & 0x08) == 0) {

params->error_len=multibyte_size;

STORE_ERRPOS

return (8);

}

params->error_flags |= 0x08;

}

if (codepoint_build >= 0xD800 && codepoint_build <= 0xDFFF) {

PRINTF ("Codepoint encodes a UTF-16 surrogate, and is therefore invalid in UTF-8.\n");

if ((params->ignore & 0x04) == 0) {

params->error_len=multibyte_size;

STORE_ERRPOS

return (4);

}

params->error_flags |= 0x04;

}

Likewise with codepoints beyond 0x10ffff and the UTF-16 surrogates. In the case of fatal errors, we set error_len as well as the error position, otherwise we just set the soft error flags in error_flags and continue.

CESU-8

Of course, it's possible that the input might decode to produce two sequential codepoints that together represent a valid surrogate pair in UTF-16. An otherwise valid UTF-8 stream that intentionally contains such pairs of codepoints is known as CESU-8.

Although it would be possible to further decode such surrogate pairs directly within the UTF-8 decoder, this would complicate the codebase somewhat. A typical approach might involve the following steps:

Set a flag whenever a high surrogate, (0xD800 - 0xDBFF), is encountered in a valid position.
Store that codepoint in a temporary variable, but don't write it to the output buffer.

Then, whenever the above mentioned flag is already set, do the following:

Check whether the codepoint decoded from the input is a valid low surrogate, (0xDC00 - 0xDFFF).
If it is, then reconstruct a single codepoint from the two surrogates, according to the UTF-16 specification.
Write the reconstructed codepoint to the output buffer.

Implementing this is left as an exercise for the reader.

*(params->buffer_decoded+params->buffer_decoded_len++)=codepoint_build;

}

At this point we have successfully reconstructed a multi-byte token, and can write it to the output buffer.

If any of errors 2, 4, or 8 are being ignored then the token may have been constructed from invalid input, but we still write what we have to the output buffer and do not discard it.

This also closes the (state > 0) conditional block.

if (flag_byte_done == 0) {

If the previous code didn't set flag_byte_done, then either we were not processing a multi-byte token, or we were doing so but then aborted it's reconstruction because of seeing a non-continuation byte.

if (leading_ones == 1) {

PRINTF ("Continuation byte without multibyte start byte\n");

if ((params->ignore & 0x20) == 0) {

params->error_len=1;

STORE_ERRPOS

return (32);

}

params->error_flags |= 0x20;

* If error 0x20 is being ignored, then unexpected continuation bytes are silently dropped and the decoding continues.

flag_byte_done=1;

}

If the code path reaches here and the current byte is a continuation byte then it must have occurred outside of a valid sequence, otherwise the code above would have set flag_byte_done.

Just like for the previous errors that we've tested for, if the corresponding bit in ignore is not set then we return with an error code having set error_len to the appropriate length which in this case is one.

In the case that this error is set to be ignored then we flag it in error_flags and also set flag_byte_done so that the spurious continuation byte isn't further processed by the following code.

Of course, instead of simply dropping the continuation byte we could easily insert the replacement character in the output with just one extra line of code:

*(params->buffer_decoded+params->buffer_decoded_len++)=0xfffd;

if (leading_ones > params->max_token_len) {

PRINTF ("Invalid byte 0x%02x in UTF-8 stream at position %u\n", CUR_BYTE, i);

if ((params->ignore & 0x01) == 0) {

params->error_len=1;

STORE_ERRPOS

return (1);

}

params->error_flags |= 0x01;

flag_byte_done=1;

}

Now we check for multi-byte introducers that would introduce a sequence that exceeded the maximum length that we configured in max_token_len.

If this error is ignored, then we silently drop the introducer and continue decoding. Any following continuation bytes that were part of such a longer sequence will be processed independently on subsequent iterations of the loop by the code in the previous block, and the behaviour of the routine in that case will depend on whether error 0x20 is being ignored or not.

}

if (flag_byte_done == 0) {

Re-check the flag as we don't want to run the code below if we've already done something useful with the input.

if (leading_ones > 1) {

flag_byte_done = 1;

state = multibyte_size = leading_ones;

* Decrement state by one, as the number of continuation bytes is one less than the total.

state--;

* The leading ones are always followed by a zero.

* Therefore, the number of lower bits that contain codepoint data is (7 - leading_ones).

* We can create the necessary bitmask by right shifting 0xFF by leading_ones places.

PRINTF ("Multibyte start at string pos %u\n", i);

codepoint_build = (CUR_BYTE & (0xFF >> leading_ones));

}

If none of the conditions up to now have matched the input byte, then it's either an introducer for a multi-byte token or a regular ASCII character.

This block of code deals with the case of it being a multi-byte introducer, and is identical to the equivalent code in the validator.

We can assume that such an introducer found here is in a valid place to begin a new multi-byte sequence because error 0x10, (16), would have been returned if such a byte had occurred within an existing sequence, (where we would be expecting a continuation byte instead). If the 0x10 but is set in ignore then instead of returning an error, the code will have dropped the existing sequence and will now allow this new sequence introducer to be parsed as if it was simply within other ASCII text.

As before, the number of leading set bits directly indicates the size of the token in bytes, so we store this value in both multibyte_size and state. The value in state will be decremented each time we process a byte and will therefore serve as our inner loop counter. The value in multibyte_size will not be decremented and will be referenced in places we need to know the overall size of the token, such as checking for over-long encodings after successfully decoding it, or otherwise for setting error_len.

if (flag_byte_done == 0) {

* Must be a regular ASCII character

PRINTF ("Regular ASCII character at string pos %u\n", i);

flag_byte_done=1;

*(params->buffer_decoded+params->buffer_decoded_len++)=CUR_BYTE;

}

Regular ASCII characters are simply written to the output buffer directly, with no further processing except for casting the byte value to 32 bits.

if (flag_byte_done == 0) {

PRINTF ("Logic error in utf-8 parser!\n");

exit(1);

}

This closes the main i loop, after the same code integrity check that we saw in the validator.

if (state != 0) {

PRINTF ("End of string within multibyte character - expecting %u more bytes\n", state);

if ((params->ignore & 0x40) == 0) {

params->error_len=multibyte_size-state;

i--;

STORE_ERRPOS

return (64);

}

params->error_flags |= 0x40;

}

If the input ends unexpectedly in the middle of a multibyte token, this is considered an error just as it was in the validator.

Like the other error conditions, it can be ignored and treated as non-fatal by setting the appropriate bit, (bit 6), in ignore.

However, there is also a subtle but significant change in the error handling here compared to the validator when this condition treated as a fatal error. Previously, the position returned by STORE_ERRPOS was one byte beyond the end of the input, indicating the position of the next expected continuation byte. This made sense for the validator because it only returned the position of the error and not the number of bytes that were involved.

Since here in the decoder we are also returning a value for error_len, it makes more sense to store the position of the last actual byte in the incomplete sequence. Doing this will allow us to easily highlight the entire incomplete sequence in a special error printing routine that we'll see shortly.

PRINTF ("%salid UTF-8 string of %u bytes, decoded to %u codepoints\n", (params->error_flags == 0 ? "V" : "Parsable but strictly inv"),

params->buffer_encoded_len, params->buffer_decoded_len);

The summary information is slightly more detailed than before, now including the number of decoded codepoints as well as an indication of whether any non-fatal errors occurred and were ignored.

#define OVER_ALLOC (outbuf_size - params->buffer_decoded_len * sizeof(uint32_t))

#define DECODED_BYTES (params->buffer_decoded_len * sizeof(uint32_t))

PRINTF ("Total output size is %lu, allocated buffer is %u. Over-allocation was %lu bytes.\n", DECODED_BYTES, outbuf_size, OVER_ALLOC);

if (OVER_ALLOC > 32 * 1024) {

PRINTF ("Reducing output buffer to %lu... ", DECODED_BYTES);

new_outbuf=realloc(params->buffer_decoded, DECODED_BYTES);

if (new_outbuf == NULL) {

* The re-allocation is always going to be a _reduction_, and never an increase.

* It's possible that on a memory-constrained system a large buffer could not be re-allocated and the call to realloc()

* might fail. In this case we can still return normally because the original oversized buffer still exists, but we

* set bit 7 in the error flags to indicate that this non-critical error occurred.

PRINTF ("failed!?\n");

params->error_flags |= 128;

return (0);

}

PRINTF ("OK!\n");

params->buffer_decoded=new_outbuf;

}

return (0);

At this point we have the required decoded output in the output buffer, and could simply return to the calling function.

However, in some cases the amount of memory allocated to the output buffer will be considerably more than is actually used. This is possible because unless the function was called with an accurate sizing hint for the output buffer then an estimated size was calculated assuming the worst possible case of an input stream that happened to be entirely single byte tokens encoding ASCII values.

For input that only contains the occasional non-ASCII codepoint, the overhead is negligible. On modern systems where memory is plentiful, it might not actually matter at all. If the decoded output is going to be further processed or written to disk immediately then the excessive memory allocation is only required for a brief period of time anyway.

Nevertheless, for completeness we might as well see how to reduce the allocation to the actual size of the data stored, and that is what this block of code does.

Two new macros are defined here:

OVER_ALLOC is the amount of memory allocated but not used.

DECODED_BYTES is the size in bytes of the useful data.

We use these values to print some statistics, then proceed to call realloc() with the new size if the difference is greater than 32K. This value of 32K is completely arbitrary and simply serves as a small optimisation to avoid the overhead of the reallocation where the savings are minimal.

Since we are reducing the allocation rather than increasing it, the call to realloc() is unlikely to fail. However, if it did, then the existing allocation would remain valid, so can still return successfully. In that case, we set bit 7 of error_flags to indicate to the caller that this happened.

Note that, in contrast to the uses of the other bits in error_flags, we set bit 7 here regardless of whether the corresponding bit is set in ignore or not. This is the exception that was mentioned earlier when we introduced the new error codes.

This handling of bit 7 makes sense, as a failed realloc() here is always a non-fatal error, and can always be ignored. In contrast, the other four places where memory allocation errors can occur in this function are always fatal and can never be ignored.

The calling code doesn't even need to check bit 7 in error_flags or take any special action, as the returned data in the output buffer is the same either way. It's a purely informational flag.

#undef OVER_ALLOC

#undef CUR_BYTE

#undef STORE_ERRPOS

#undef MAX_OUTBUFFER_SIZE

#undef REALLOC_INCREASE

}

Finally, we undefine the compiler macros that are only used locally within this function, and at this point we're done.

Safely dealing with overlong encodings

As mentioned above, overlong encodings are simply regular multi-byte encodings that follow all of the rules and logic for encoding the bits, except that they include additional zero bits to the left of the actual codepoint value.

Encoding these additional zero bits causes the initial multi-byte indicating byte to be increased, and zero or more additional 0x80 bytes to be inserted immediately after it.

Of course, since the UTF-8 stream is just encoding text, technically there is nothing dangerous per-se about representing any particular codepoint with an arbitrary number of additional leading 0x80 bytes. Furthermore, if a decoder has been written with the intention of decoding such sequences, it's fairly obvious that there is only one logical way that such a multi-byte character could be decoded to a codepoint value so there is almost no possibility of ambiguity in the output.

Therefore, as long as the decoding from UTF-8 to UCS codepoints is performed as the first step of processing before any other, then overlong encodings can be handled without directly creating software reliability and security issues with the subsequent processing of the decoded data.

However, there are several non-obvious details that need to be considered to make sure that this requirement of no prior processing being done is actually met in practice.

These details mostly concern the fact that ASCII characters are no longer guaranteed to appear as themselves in the input stream, and can instead also be represented as multibyte tokens. Therefore, detection of special characters that need escaping to prevent being interpreted as in-band control characters must be performed after the decoding step and not before.

This includes checking for encoding of the null byte, (which is effectively just another control character), as a multi-byte token in the input, and handling this in an appropriate way depending on the specific application.

If the end of the raw input stream is being determined only by an actual 0x00 byte, whilst representations of a null byte as multi-byte tokens are allowed to be written to the output without signaling the end of the decoding process, then the output data stream may contain embedded 0x00 bytes. In this case, any code which further processes the decoded data must be aware of this possibility and either be passed the size of the input or determine the end of it in some way other than looking for a trailing 0x00 byte.

Note: the format described above may indeed be encountered in practical applications and is sometimes known as modified UTF-8.

This method of encoding embedded 0x00 bytes as multi-byte tokens could plausibly be used as a convenient way to transport arbitrary 31-bit binary data over an 8-bit channel as a pseudo UTF-8 stream whilst retaining the ability to signal the end of the input with 0x00, although it's difficult to think of a practical use for such a scheme.

Whilst correct handling of encoded control characters solves most of the reliability and security concerns, there are still further issues to consider.

A UTF-8 stream with overly long encodings potentially creates a covert side-channel where extra arbitrary information can be hidden and later recovered by software that is aware of it's presence. Furthermore, once such a stream is decoded to codepoint values, re-encoding of the same unchanged data as strict UTF-8 will destroy the side-channel data.

This process could be abused by a malicious third party as a way to export secret data such as encryption keys from a compromised system over a network within a seemingly innocent datastream, recover that data remotely with a high degree of anonymity, and ensure that no lasting trace of the side-channel was stored on the receiving system.

More generally, any re-encoding of the data as UTF-8 cannot be guaranteed to match the original input stream, so even if there was no malicious use of a side-channel there could be other potential issues if the code makes that assumption.

Overall, safe handling of overlong encodings is possible as long as due consideration is given to the issues outlined above.

However, best practice would be to limit this relaxation of the standard to situations where it is actually required, such as processing data created by known non-conformant UTF-8 encoders.

Where UTF-8 datastreams are used for data interchange between current systems and there is no guarantee about the software at the remote end of the connection, best practice would usually be to require strict compliance and to reject overlong encodings as errors.

Encoding UTF-8

Now that we've seen how to implement a decoder, the next logical step is to encode a UTF-8 stream from a set of 32-bit codepoints.

The encoding process is essentially the reverse of decoding, so many of the steps that we go through will have corresponding code in the decoder function.

Apart from a reversal of the logic, though, there are two other key differences:

We only need to process one input codepoint at a time, rather than variable numbers of bytes for each token as we did in the decoder.
Some errors that had to be checked for in the decoder cannot occur during encoding, so there will be no code to handle them.

Additionally, our approach to memory allocation will be slightly more complicated, for reasons that we'll discuss shortly. This is more of a design decision, though, and could be avoided, with a simple approach similar to what we saw earlier used instead.

We'll be using the same data structure as before, although we'll now supply the input in buffer_decoded, with it's length in buffer_decoded_len, and the encoded output will appear in buffer_encoded. Essentially, the usage of the encoded and decoded variables is just being swapped over.

Using a single structure for both functions is convenient and has the benefit of allowing us to pass the output of the encoder directly to the decoder without having to copy various values around.

A value for error_len will be correctly set by the encoder if it encounters a fatal error, but this value is effectively superfluous as each of the errors that can occur has a fixed length of either zero or one bytes.

The encoder will happily accept values of five and six for max_token_len and is capable of producing streams that encode values requiring these tokens in excess of the length of four permitted for streams which are strictly standard compliant. However, it will intentionally only actually encode such long tokens if bit 3 of ignore is set. Otherwise, codepoints beyond 0x10FFFF in the input will be treated as errors, and encoding will stop.

Handling of the surrogate halves is done in a similar way, with encoding stopping when a value in this range is encountered, unless bit 2 of ignore is set.

Bits 2 and 3 are the only two bits which have any effect when set in ignore.

Errors are represented by the same bits that they were allocated to in the validator and decoder functions, and the set of possible errors returned by the encoder is therefore a subset of those that could be returned by those functions.

Encoder errors

1	Bit 0	Encoding requires a token longer than max_token_len.
4	Bit 2	Input contains an invalid codepoint, (surrogate half).
8	Bit 3	Input contains an invalid codepoint, (> 0x10FFFF).
128	Bit 7	Memory allocation error.
256	Bit 8	Invalid parameters supplied.

Encoder source code with annotations

Here we go with the annotated source code for the encoder function!

int utf8_encode(struct st_utf8_params * params)

{

#define MAX_OUTBUFFER_SIZE (256 * 1024 * 1024) /* 256 Mb */

#define HIGH_BITS(i) ((0xFF << (8-i)) & 0xFF)

#define MAX_MAX_TOKEN_LEN 6

As with the decoder, we're passing all of the parameters in a single structure.

We also have three defines, two of which are new:

MAX_OUTBUFFER_SIZE is the same as it was for the decoder

HIGH_BITS(i) calculates the high bits to set in the introducer byte of an i byte sequence

MAX_MAX_TOKEN_LEN is the maximum length acceptable for max_token_len

The two new macros probably deserve a more detailed explanation.

HIGH_BITS() returns a byte with i leading set bits and all of the lower bits reset to zero. This value can then be logically or'ed with some of the highest bits of the codepoint value to generate the correct sequence introducer.

For example, the codepoint 0x567 is 10101100111 in binary, and is correctly encoded in UTF-8 as a two byte sequence. Therefore, we would want to combine the following two values:

11000000 - result of HIGH_BITS(2)

00010101 - top five bits of the codepoint

11010101 - resulting introducer byte

MAX_MAX_TOKEN_LEN is a compile-time setting to limit the largest value of max_token_len that the function will accept at runtime.

Although the vast majority of strictly non-compliant but nevertheless quasi-UTF-8 streams don't use sequences beyond six bytes, the encoding logic below could easily cope with seven and eight byte sequences as well. However, to do this it would require some values to be represented in 64 bits rather than 32 bits.

For this reason, and since the value of MAX_MAX_TOKEN_LEN is used in more than one place in the code, we define it's value as a macro rather than hard-coding it as 6.

This provides an easy path for future code changes that would allow the program to be compiled with larger values of MAX_MAX_TOKEN_LEN, and to only use 64-bit data types in such a configuration.

Making this change is left as an exercise for the reader.

unsigned int cur_alloc; Amount of memory currently allocated to the output buffer

unsigned int i; Loop counter

unsigned int j; Secondary loop counter

unsigned int mb; Tertiary loop counter, (‘multi-byte’)

unsigned int new_alloc; Proposed size for output buffer when re-allocating memory

unsigned int max_codepoint[MAX_MAX_TOKEN_LEN + 1]; Array of lowest codepoint values that are invalid for each token length

unsigned char * new_outbuf; Temporary pointer used by the memory re-allocation code

params->error_flags=0;

params->error_len=0;

params->error_pos=0;

params->buffer_encoded_len=0;

All of the return values in the supplied structure are set to a default of zero.

Note that error_len is fairly meaningless for the encoder as errors are always zero codepoints or one codepoint long.

if (params->buffer_decoded_len == 0 || params->max_token_len < 2 || params->max_token_len > MAX_MAX_TOKEN_LEN) {

return (256);

}

Here we do an initial check that the supplied parameters are sensible.

If the supplied buffer is empty, or the maximum token length to encode is non-sensical, then we return immediately.

if (params->buffer_decoded_len > MAX_OUTBUFFER_SIZE / params->max_token_len) {

return (128);

}

cur_alloc=params->buffer_decoded_len * params->max_token_len;

params->buffer_encoded=malloc(cur_alloc);

if (params->buffer_encoded==NULL) {

return (128);

}

This is the memory allocation code.

In contrast to the decoder function where we used some logic to minimise memory usage, here we take a simple approach of just allocating enough memory to cover the worst case of every input codepoint requiring a token of the maximum permitted length.

Whilst it certainly would be possible to re-size the output buffer dynamically, and even use the mix of input bytes already processed to estimate the requirements of the data that is left, in reality modern systems are not usually memory constrained to the point where the added code complexity and processing at runtime would make it worthwhile.

If the memory requirement exceeds MAX_OUTBUFFER_SIZE or allocation fails, we return immediately with error code 128.

Of course, if the calling function is aware that the highest codepoint value in the input will only require a token of two or three bytes rather than the maximum of four permitted in a standards compliant UTF-8 stream, then it can also invoke the encoder with params->max_token_len set to the smaller value and memory usage will be reduced in that case.

for (i=2; i <= params->max_token_len; i++) {

max_codepoint[i] = (1 << (7-i + 6*(i-1)));

}

max_codepoint[0]=0;

max_codepoint[1]=128;

Next we build a lookup table in max_codepoint[] which actually contains the =first invalid codepoint= for each multibyte token length. These values are obviously one more than the actual maximum valid codepoint in each case, but the name max_codepoint was convenient here.

A 32-bit unsigned integer is sufficient to hold the calculated value for a six-byte token, which is the largest commonly found even in sequences which are not strictly UTF-8 compliant.

In the unlikely case of needing to process seven and eight byte tokens, then max_codepoint[] should be declared as a 64-bit type such as uint64_t.

Additionally, the constant 1 should be cast to a uint64_t in the calculation above:

max_codepoint[i] = ((uint64_t)1 << (7-i + 6*(i-1)));

#define CUR_CHAR params->buffer_decoded[i]

CUR_CHAR is analogous to CUR_BYTE in the decoder, it's simply the current input value being processed.

for (i=0; i < params->buffer_decoded_len; i++) {

This is the main loop over the input buffer.

if (CUR_CHAR > 0x10FFFF) {

if ((params->ignore & 0x08) == 0) {

params->error_len=1;

params->error_pos=i;

return (8);

}

params->error_flags |= 0x08;

}

If we encounter a codepoint above 0x10FFFF, then we either return with a fatal error code 8 or simply set bit 3 in error_flags and continue, depending on how bit 3 of ignore was set.

if (CUR_CHAR >= 0xD800 && CUR_CHAR <= 0xDFFF) {

if ((params->ignore & 0x04) == 0) {

params->error_len=1;

params->error_pos=i;

return (4);

}

params->error_flags |= 0x04;

}

Likewise for the codepoints representing the UTF-16 surrogates, except that here we return error code 4, and it's bit 2 of error_flags and ignore.

for (j=1; j <= params->max_token_len; j++) {

if (CUR_CHAR < max_codepoint[j]) {

break ;

}

Here we loop through the max_codepoint[] array checking the current codepoint value against the previously calculated maximum codepoints for each token length.

If CUR_CHAR is within the bounds of a token length that we're configured to allow, then the loop exits with j set to that length value.

Otherwise the loop will exit with j == (max_token_len + 1), which we check for next.

if (j==params->max_token_len+1) {

PRINTF ("Input codepoint 0x%x, (%u), exceeds maximum of 0x%x, (%u), encodable with %d bytes\n", CUR_CHAR, CUR_CHAR,

max_codepoint[params->max_token_len]-1, max_codepoint[params->max_token_len]-1, params->max_token_len);

params->error_pos=i;

params->error_len=1;

return (1);

}

If the codepoint requires more bytes to encode than we configured with the supplied value of max_token_len, then this is a fatal error which cannot be ignored, so we return error code 1.

PRINTF ("Input codepoint of 0x%x, (%u), encodable with %u bytes: ", CUR_CHAR, CUR_CHAR, j);

Output some general debugging information about the encoding.

if (j==1) {

params->buffer_encoded[params->buffer_encoded_len++]=CUR_CHAR;

PRINTF ("%d\n", params->buffer_encoded_len);

}

If the codepoint can be represented by a single byte token, we just write that value to the output buffer and increase buffer_encoded_len.

if (j > 1) {

params->buffer_encoded[params->buffer_encoded_len++]=(CUR_CHAR >> (6 * (j - 1))) | HIGH_BITS(j);

PRINTF ("Bytes: %02x", params->buffer_encoded[params->buffer_encoded_len-1]);

for (mb = j-1; mb > 0; mb--) {

params->buffer_encoded[params->buffer_encoded_len++]=(((CUR_CHAR >> (6 * (mb - 1))) & 0x3f) | 0x80) ;

PRINTF (" %02x", params->buffer_encoded[params->buffer_encoded_len-1]);

}

PRINTF ("\n");

}

The multiple byte case is only slightly more complicated.

The first byte is the combination of the HIGH_BITS() value that we discussed earlier, logically or'ed with however many of the most significant bits of codepoint data can fit in the rest of the token introducer. To get the low bits, we simply shift the codepoint value right by six bits for each continuation byte, which will always be one less than the total sequence length. This gives us the following formula as used in the code above of: (6 * (j - 1)).

The second and subsequent bytes are all regular continuation bytes, so they each have bit 7 set, bit 6 reset, and contain six bits of codepoint data in bit 0 - bit 5. These are calculated in a loop which counts down from one less than the total sequence length to one. The formula used for the bit shifting is the same as before and uses the loop counter minus one as it's argument. This ensures that the final byte of the sequence uses the lowest bits of codepoint data unshifted.

For example, for a four byte sequence we will have three continuation bytes. The loop will start at three and count down to one, with the values 2, 1, and 0 being passed to the bit-shifting formula. This results in the original codepoint value being shifted by 12 bits, 6 bits, and then 0 bits for the three continuation bytes.

These shifted values are then logically and'ed with 0x3f to ensure that bit 6 is not erroneously set, and finally or'ed with 0x80 to set bit 7 as required to indicate a valid continuation byte.

As each byte is written to the output buffer, buffer_encoded_len increased by one.

}

#undef CUR_CHAR

Close the main loop, and if we've finished then undefine CUR_CHAR as we won't need it anymore.

#define OVER_ALLOC (cur_alloc - params->buffer_encoded_len)

PRINTF ("Total output size is %u, allocated buffer is %u. Over-allocation was %u bytes.\n", params->buffer_encoded_len, cur_alloc, OVER_ALLOC);

if (OVER_ALLOC > 32 * 1024) {

PRINTF ("Reducing output buffer to %u... ", params->buffer_encoded_len);

new_outbuf=realloc(params->buffer_encoded, params->buffer_encoded_len);

if (new_outbuf == NULL) {

PRINTF ("failed!?\n");

params->error_flags |= 128;

return (0);

}

PRINTF ("OK!\n");

params->buffer_encoded=new_outbuf;

}

Now that we have finished parsing and encoding the input, we know the exact size of the encoded version and can reduce the memory allocation to match that value.

This code is almost identical to the corresponding code in the decoder.

#undef OVER_ALLOC

#undef HIGH_BITS

#undef MAX_OUTBUFFER_SIZE

return (0);

}

Finally we undefine the other compiler macros used in this function and return zero to the caller to indicate success.

Pretty error printer and other bonus functions

Both the encoding and decoding functions return detailed information about errors that occur. That includes the position in the input stream and the number of bytes or codepoints involved.

In the case of the decoder, the error relates to the encoded bytestream that it was interpreting and we can write a function to print out the bytes surrounding the error and highlight those actually involved in it.

The output from such a pretty error printing routine might look something like this:

r c J ? ? ? ? ? ? d < 8 L & g z

0x72 0x63 0x4a 0xfc 0x93 0x8b 0x87 0x8f 0xbf 0x64 0x3c 0x38 0x4c 0x26 0x67 0x7a

^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^

In this example, a codepoint beyond 0x10FFFF has been detected, so decoding has stopped, and the error has been highlighted.

This routine, along with a few others to allow testing and exercising of the encoding and decoding functions is included in the source code downloadable from the links in the download section below.

Here is a sample of output from the utf8_demo() function, which was set to encode random data including codepoints beyond 0x10FFFF, followed by calls to the validator and decoder functions which were configured to reject anything that wasn't a strictly conforming UTF-8 stream.

Preparing 390 Kb of random data, (100000 codepoints).

Codepoints up to 0x7FFFFFFF, not including surrogates.

Starting encoding

Encoder returned: OK

Encoded length: 120841

The following error flags were set:

Invalid byte NO

Over-long encoding NO

Surrogate half NO

Codepoint above 0x10FFFF YES

Unexpected non-continuation byte NO

Unexpected continuation byte NO

Input ends within multibyte token NO

Memory allocation error NO

Invalid parameters supplied NO

Total types of errors ignored: 1

Starting validation

Validator returned: Invalid byte

Starting decoding

Decoder returned: Codepoint above 0x10FFFF

Decoded length: 61

r c J ? ? ? ? ? ? d < 8 L & g z

0x72 0x63 0x4a 0xfc 0x93 0x8b 0x87 0x8f 0xbf 0x64 0x3c 0x38 0x4c 0x26 0x67 0x7a

^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^

No errors were ignored.

Decoder exited on error.

The source code for these additional functions is fairly straightforward, so we're not going to look at it in detail here. Download the source archive if you're interested to see how they work.

Parsing techniques for niche requirements

Back when we looked at an overview of UTF-8 encoding, I mentioned some tasks that are made fairly easy by it's design.

Specifically, I mentioned:

Detecting with a high degree of confidence whether we're actually looking at a valid UTF-8 stream.
Reading a UTF-8 stream backwards.
Identifying newlines, tabs, and other C0 control characters without fully decoding the input.
Finding the nearest character boundary starting from any arbitrary point in the stream.
Calculating the size of the decoded output without fully decoding the input.

Now that we've learned the basics of UTF-8 encoding and decoding, let's return to these concepts and look at each one in more detail.

We already covered the last point - calculating the size of the decoded output - in the section about assembler optimisations. If you skipped that part because assembler isn't your thing, then basically we just counted the leading set bits in each byte and created a tally of the values that we then put in to the following formula:

number of output codepoints = bits[0] + 2*bits[2] + 3*bits[3] + 4*bits[4] + 5*bits[5] + 6*bits[6]

Of course, the result is only meaningful if the input is valid UTF-8. but it is very fast to calculate, (especially in assembler), and can be used amoungst other things as an initial estimate of the required output buffer size for a decoder.

Taking another item off of the list - identifying C0 control characters - this actually doesn't require any special processing beyond what would be used for ASCII text, and is just a convenient side effect of the fact that UTF-8 encoding encodes these codepoints as single bytes which will never appear as part of another token.

This does assume that the eventual decoding process will either reject any overlong encodings of these control characters or otherwise be fully aware of them, treat them specially and definitely not interpret them as controls, (for example, it could skip over them, or replace them with the replacement character in the output).

If overlong encodings of C0 control characters are required to be decoded and interpreted as those same controls, then it will unfortunately be necessary to fully decode the input in order to be sure of finding them.

That just leaves three items on the list, which we'll look at in turn now.

Detecting a valid UTF-8 stream

The most reliable way to detect a valid UTF-8 stream is obviously to fully decode it, as we did in utf8_validate(). As that function can be implemented with minimal overhead and gives us detailed information about any errors found, it's probably the best choice in most cases.

However there may be times when we want to do the absolute minimum of processing and still quickly reject obviously non-compliant input.

One way to do this without the overhead of full decoding is to assume initially that the stream is indeed valid, then error out if any one of the following three conditions are detected:

Bytes 0xFE or 0xFF occuring anywhere
Bytes 0x80 - 0xBF occuring after an ASCII character
Two or more consecutive 0xC0 - 0xFF bytes

The first two will quickly catch most random or non-textual data, and also most 8-bit text encodings that are based on ASCII but include printable characters in the 0x80 - 0xBF range.

The last condition is useful for increasing the detection rate of ASCII-like encodings which either don't define 0x80 - 0xBF or use them as controls.

Implementing this in C is trivial:

int utf8_fast_reject(unsigned char * buffer, unsigned int len)

{

int flag_prev_ascii;

int flag_prev_introducer;

int i;

flag_prev_ascii=0;

flag_prev_introducer=0;

for (i=0; i < len; i++) {

if ((*(buffer+i) & 0xFE) == 0xFE || (flag_prev_ascii && (*(buffer+i) & 0xC0) == 0x80)) {

return (1);

}

if (((*(buffer+i) & 0xC0) == 0xC0) && flag_prev_introducer) {

return (2);

}

flag_prev_ascii=(*(buffer+i) & 0x80) ? 0 : 1;

flag_prev_introducer=(*(buffer+i) & 0xC0)==0xC0 ? 1 : 0;

}

return (0);

}

Nearest character boundary

This just involves searching forwards and backwards for any non-continuation byte, so basically anything in the range 0x00 - 0x7F or 0xC0 - 0xFF.

The upper bound obviously depending on whether you are processing the stream as strictly compliant UTF-8 or relaxing the parsing rules to include longer tokens. In either case, if the input is known to be valid according to whatever subset of the rules you are following, an upper bound of 0xFF can be used for such a conditional without further problems.

Decoding UTF-8 backwards

This is probably the most interesting special case from a coding perspective.

What we want to do is read the encoded byte sequence one byte at a time, starting from the last byte and working backwards. The output should be the set of decoded codepoints also in reverse order.

Whilst single byte ASCII characters can just be read directly in reverse and produce the expected result, the decoding logic for multi-byte tokens will need to be different.

When reading a valid UTF-8 stream in reverse, the continuation bytes of any multi-byte tokens will appear before the sequence introducer. Therefore, we won't know until the end of the sequence how many bytes it should contain. The six bits that we extract from each continuation byte will belong in increasingly higher order positions in the codepoint being built up, so they can be written directly to the correct position even without knowing the sequence length when we start.

Error checking also needs to change. Some of the failure modes possible when reading the stream forwards no longer exist - continuation bytes are now valid at any point - and there are some new error conditions too, such as finding a sequence introducer without first seeing any continuation bytes.

Obviously if a stream is invalid when reading it in reverse, it would also be invalid when reading forwards. The difference is that the error is detected in a different way. For example, when reading forwards a truncated multi-byte token would be detected by finding a non-continuation byte where another continuation byte was expected. When reading backwards the same error will be detected by reading a sequence introducer that represents a larger number of continuation bytes than we have already read. It's the same part of the stream that is in error, but the way it's detected is different.

The annotated source code follows. Since the memory management code here is identical to the same code in the regular decoding function, it's not repeated here for brevity. The commentary starts at the opening of the main i loop, and the initial buffer enlargement code within it has also been replaced with a placeholder comment.

for (i=0; i < params->buffer_encoded_len; i++) {

* Code to perform bounds checking and enlargement of the output buffer has

* been removed for brevity. Refer to the corresponding code in utf8_decode().

flag_byte_done=0;

bitops=CUR_BYTE;

for (leading_ones=0; (bitops & 0x80)==0x80; leading_ones++, bitops=bitops << 1) ;

We start the same way as before, counting the leading set bits in the current byte.

if (state > 0) {

First we will check for a number of conditions that would complete a multi-byte sequence that has already been started, (after encounting an initial continuation byte).

if (leading_ones == 0) {

PRINTF ("ASCII byte 0x%02x encountered during multi-byte sequence at string pos %u\n", CUR_BYTE, i);

if ((params->ignore & 0x200) == 0) {

params->error_len=1;

STORE_ERRPOS

return (512);

}

params->error_flags |= 0x200;

state=0;

}

An ASCII character would be invalid here, since when reading backwards a multi-byte sequence needs to end with the correct sequence introducer. Although we could just reject anything without the correct number of leading bits in a generic way using a single bit in the error bitfield, it's more useful to identify this specific case with it's own dedicated error code. In this case, we've used 512 since bit 9 is the lowest unused error flag bit.

This error can be ignored by setting bit 9 of ignore. In that case, the token being constructed is discarded, and the current ASCII byte will be processed as a regular character by the code later on.

if (leading_ones == 1) {

if (state == params->max_token_len - 1) {

PRINTF ("Continuation byte forms a sequence longer than the configured maximum length\n");

params->error_len=state;

STORE_ERRPOS;

return(1024);

}

codepoint_build |= ((CUR_BYTE & 0x3F) << state * 6);

state++;

flag_byte_done=1;

}

A continuation byte during a multi-byte sequence simply adds it's lower bits to the codepoint value being built up.

The position is easy to calculate because by reading the stream backwards we encounter the least significant 6-bit group first, and then they progress upwards to more significant bit positions. Therefore, the number of bit places to shift is a simple multiple of state by six.

If we've already encountered enough continuation bytes to form a sequence of the maximum length permitted by max_token_len when combined with the correct introducer, then finding another continuation byte is considered a fatal error.

if (leading_ones > 1 && leading_ones != state + 1) {

if ((params->ignore & 0x800) == 0) {

params->error_len=1;

STORE_ERRPOS

return (2048);

}

params->error_flags |= 0x800;

state=0;

flag_byte_done=1;

}

Encountering a sequence introducer that doesn't match the number of continuation bytes already seen is another error condition.

If this error is ignored, we discard the codepoint that was being constructed and do no further processing of the current byte, but otherwise continue decoding the stream.

if (leading_ones > 1 && leading_ones == state + 1) {

Now we deal with the case of a corresponding sequence introducer after the correct number of leading continuation bytes.

This will lead to a correctly re-constructed codepoint, although that codepoint might still be outside of the permitted range.

codepoint_build |= ((CUR_BYTE & (0xFF >> leading_ones)) << 6 * state);

state++;

flag_byte_done=1;

First, we complete the codepoint being built up, by adding the appropriate low bits from the sequence introducer to the value that we already have.

Then we increase state by one so that it holds the actual correct length of the whole sequence, and set flag_byte_done, since the current byte was at least valid within the broad terms of UTF-8 encoding logic, and we certainly don't want to process it in any other way.

flag_overlong=0;

if (codepoint_build < (state == 2 ? 0x80 : (1 << (8 - state + 6 * (state - 2))))) {

flag_overlong = 1;

}

PRINTF ("%sultibyte, (%u bytes), codepoint 0x%x, (decimal %u) ending at string pos %u\n", (flag_overlong==1 ? "Overlong m" : "M"),

state, codepoint_build, codepoint_build, i);

if (flag_overlong == 1) {

if ((params->ignore & 0x02) == 0) {

params->error_len=state;

STORE_ERRPOS

return (2);

}

params->error_flags |= 0x02;

}

Next we set a flag if the token was an overlong encoding, and deal with that case in the same way that we did in the regular decoder function.

This error can be ignored, in which case we continue to process the token.

if (codepoint_build > 0x10FFFF) {

PRINTF ("Codepoints > 0x10FFFF are invalid.\n");

if ((params->ignore & 0x08) == 0) {

params->error_len=state;

STORE_ERRPOS

return (8);

}

params->error_flags |= 0x08;

}

if (codepoint_build >= 0xD800 && codepoint_build <= 0xDFFF) {

PRINTF ("Codepoint encodes a UTF-16 surrogate, and is therefore invalid in UTF-8.\n");

if ((params->ignore & 0x04) == 0) {

params->error_len=state;

STORE_ERRPOS

return (4);

}

params->error_flags |= 0x04;

}

Likewise, checks for high codepoints and those that encode the UTF-16 surrogates use the same code as before.

* Write the reconstructed codepoint to the output buffer.

*(params->buffer_decoded+params->buffer_decoded_len++)=codepoint_build;

state=0;

}

Unless we error'ed out due to one of the above situations, we write the codepoint to the output buffer.

In this case, flag_byte_done has already been set, so none of the rest of the code below within the loop will run and the function will continue with the next input byte.

}

This closes the state > 0 conditional. The following code deals with bytes received when not already decoding a multi-byte token, or when one of errors 512, or 2048 above causes the processing of a multi-byte token to stop and the current byte to be re-interpreted on it's own merits.

if (state == 0 && flag_byte_done==0) {

If we're not dealing with an existing multi-byte sequence, and we haven't already processed this byte...

if (leading_ones == 1) {

codepoint_build = (CUR_BYTE & 0x3F);

state=1;

flag_byte_done=1;

}

A continuation byte at this point would introduce a new multi-byte token, so we store it's lower six bits in codepoint_build, set state and set flag_byte_done.

if (leading_ones > 1) {

PRINTF ("Sequence introducer encountered without continuation bytes.\n");

if ((params->ignore & 0x800) == 0) {

params->error_len=1;

STORE_ERRPOS

return (2048);

}

params->error_flags |= 0x800;

flag_byte_done=1;

}

A lone sequence introducer is meaningless and therefore an error.

If ignored, we just completely skip the byte and write nothing to the output buffer for it.

if (leading_ones == 0) {

PRINTF ("Regular ASCII character at string pos %u\n", i);

flag_byte_done=1;

*(params->buffer_decoded+params->buffer_decoded_len++)=CUR_BYTE;

}

If we've reached here and have an ASCII byte, then it's in a valid location and we just write it to the output buffer.

}

Close the second main conditional for (state == 0 && flag_byte_done==0).

}

Close the main i loop.

if (state != 0) {

PRINTF ("End of string within multibyte character.\n");

if ((params->ignore & 0x40) == 0) {

params->error_len=state;

i--;

STORE_ERRPOS

return (64);

}

params->error_flags |= 0x40;

}

PRINTF ("%salid UTF-8 string of %u bytes, decoded to %u codepoints\n", (params->error_flags == 0 ? "V" : "Parsable but strictly inv"),

params->buffer_encoded_len, params->buffer_decoded_len);

Finally we check for the input ending within a multi-byte sequence.

This code is almost identical to the same code in the regular decoding function, except that the value of error_len is just set to state instead of being calculated as (multibyte_size - state).

Decoding the bits in the background picture

Answers!

Reader exercise solution

If you've read and understood everything up to here, then solving the challenge I set you in the first ‘worked example’ section and decoding the bits shown in the background image should be trivial.

Nevertheless, here is the answer and a quick run-through so that you can check your method and results:

The bits that we can see in the picture are 11110000 10011111, (on the spool of tape), and 010010 010110, (in front of the stapler).

This is a four byte sequence, and the bits under the stapler are from the last two continuation bytes which have already had the 10 prefix stripped off.

The bits that are still on the tape are complete bytes, and the first eight are the multi-byte token introducer.

So to decode the value we simply take the last three bits of the first byte on the tape, the last six bits of the second byte, and all of the bits in front of the stapler.

This gives us 000 011111 010010 010110, which would be written as 00000001 11110100 10010110 when grouped in to 8-bit bytes.

The codepoint is therefore 0x1F496, or ‘sparkling heart’ 💖.

Downloads

Grab the source for the encoder, decoder, supporting functions, and benchmarking routines.

Use our signify key to check the hashes of the extracted files.

Summary

Today we've covered a lot of material, starting with a broad overview of the UTF-8 encoding scheme, then going on to build a comprehensive validator function before finally seeing the programming required to encode and decode valid UTF-8 streams.

We've also looked at some of the finer details of the encoding, including good practices for error detecting and handling, as well as some of the more common non-standard extensions to the scheme.

Finally, we looked at some examples of special handling of UTF-8 data, such as reading it backwards.

Now it's your turn to put all of this valuable new knowledge to good use and improve the UTF-8 handling in your own programs! Good luck!