The code in the example routines presented on this page is licensed under the
ESASSL.
Here is the annotated C source code for a function that does a comprehensive validation of a UTF-8 stream.
On success it returns zero, and if an error is found it returns an error code from the following table and sets the supplied variable errpos to the position in the input where the error occurred.
Error codes
1 | Bit 0 - | String contains an invalid byte such as 0xFE or 0xFF. |
2 | Bit 1 - | String contains an over-long encoding. |
4 | Bit 2 - | String contains an invalid codepoint, (surrogate half). |
8 | Bit 3 - | String contains an invalid codepoint, (> 0x10FFFF). |
16 | Bit 4 - | String contains non-continuation byte when continuation byte was expected. |
32 | Bit 5 - | String contains continuation byte when non-continuation byte was expected. |
64 | Bit 6 - | String ends in the middle of a multibyte character. |
int utf8_validate(unsigned char * buffer, unsigned int len, unsigned int * error_pos)
{
The function declaration is straightforward enough. We pass a pointer to the UTF-8 stream to be decoded and pass it's length separately, since the stream isn't necessarily null-terminated and may contain embedded 0x00 bytes.
The error_pos argument should be a pointer to an already allocated integer variable that the validator can use to return the position of the first error encountered, (if any). It's value is undefined if no error occurs.
unsigned int i; Loop counter
unsigned int codepoint_build; Codepoint value being built up over several bytes of a multi-byte token
unsigned char bitops; Copy of the byte currently being processed, used to count the leading set bits
unsigned char state; Zero if we are not already in a multibyte character, otherwise the number of bytes left
unsigned char leading_ones; Number of leading one bits in the current byte
unsigned char flag_byte_done; Flag set once we have processed the current byte
unsigned char flag_overlong; Flag set if a decoded multi-byte token was longer than it needed to be
unsigned char multibyte_size; Size of the multibyte token currently being processed, (if processing such a token)
#define CUR_BYTE *(buffer+i)
#define STORE_ERRPOS if (error_pos != NULL) { *error_pos = i; }
Two defines purely to make the source code more readable.
CUR_BYTE gives us the value of the byte currently being processed.
STORE_ERRPOS writes the current position to the supplied error_pos variable, unless no variable was supplied, (by the caller passing NULL instead).
state=0;
At the start of processing we are not yet within a multi-byte token.
for (i=0; i < len; i++) {
flag_byte_done=0;
This is the main loop over the input.
For each character, we reset flag_byte_done and then set it once we've processed it in some way to avoid falling through to code below.
bitops=CUR_BYTE;
for (leading_ones=0; (bitops & 0x80)==0x80; leading_ones++, bitops=bitops << 1) ;
Here we count the number of leading set bits in each byte. This is the fundamental way that the different types of byte are identified in UTF-8 encoding, and we will use the value of leading_ones in several places further on.
The actual count is done by copying the value to the temporary variable bitops and repeatedly shifting it left by one bit until the high bit is zero.
Note that this technique works to correctly count the set bits even in the case of 0xFF where all bits are set, because as the original number is shifted left the low-order bits that are introduced to fill the space are set to zero.
if (state > 0 && leading_ones != 1) {
STORE_ERRPOS
PRINTF ("Non-continuation byte within multibyte sequence at string pos %u\n", i);
return (16);
}
Now we start a series of tests for each possible situation. The first is simple enough, if we are already processing a multi-byte token, (state > 0), then only continuation bytes are valid. ASCII characters which would have zero leading ones, and further sequence introducers which would have two or more, are not valid, so if we encounter them then we return error code 16.
Top tip!
Debug output and the PRINTF() macro
Notice that the code extract above includes a call to PRINTF() in uppercase. This isn't a typo, but a convenient macro, which we'll use throughout and should be defined as follows near the start of your source code:
#ifdef DEBUG
#define PRINTF printf
#else
#define PRINTF(...) {}
#endif
This handly macro allows you to do two things with ease:
- Activate or silence all of the debug output by defining or undefining DEBUG at the beginning of your source code.
- When disabled globally with DEBUG undefined, quickly enable a single PRINTF() by case-inverting it to a regular printf().
The second point is particularly easy to do in editors based on vi, using the tlide key in command mode.
Although more comprehensive debugging aids and output methods do exist, this simple technique is useful for smaller and uncomplicated projects where you don't really need multiple levels of debug output.
if (state > 0) {
PRINTF ("Continuation byte at string pos %u\n", i);
codepoint_build = codepoint_build << 6;
codepoint_build |= CUR_BYTE & 0x3F;
state--;
flag_byte_done = 1;
At this point, if state is still non-zero then we know that we are actually processing a continuation byte in the range 0x80 - 0xBF.
The data may or may not be valid since it could be encoding an over-long encoding, but at the moment we don't care because it's certainly decodable to a codepoint and we'll check that codepoint once it's completely decoded.
Continuation bytes always contribute exactly their lower six bits to the codepoint being built up. The way we do it here is to shift whatever we already have left by six bits, then store the new bits in the lowest bits of codepoint_build. If there is another continuation byte to follow, then the bits that we've just written will be moved as required.
Of course we could just calculate the final position of each set of six bits from each continuation byte and store it in the correct location to begin with. In that case, the code above would need to change to something like this instead:
if (state > 0) {
PRINTF ("Continuation byte at string pos %u\n", i);
codepoint_build |= (CUR_BYTE & 0x3F) << (6*(state-1));
state--;
flag_byte_done = 1;
But note that with this alternate code in place it would also be necessary to change the code that handles the sequence introducer byte which we'll see shortly.
For now, we'll stay within the (state > 0) conditional block and check for having just completed reassembling a multibyte token.
Remember that we've just decreased state by one, so the following check for state==0 makes sense even though we're inside the (state > 0) conditional block.
/*
* Finished decoding a multibyte? Then check that the codepoint is valid.
*/
if (state==0) {
/*
* Overly long encoding?
*/
flag_overlong=0;
if (multibyte_size == 2 && codepoint_build < 0x80) {
flag_overlong = 1;
}
if (multibyte_size == 3 && codepoint_build < 0x800) {
flag_overlong = 1;
}
if (multibyte_size == 4 && codepoint_build < 0x10000) {
flag_overlong = 1;
}
PRINTF ("%sultibyte, (%u bytes), codepoint 0x%x, (decimal %u) ending at string pos %u\n", (flag_overlong==1 ? "Overlong m" : "M"),
multibyte_size, codepoint_build, codepoint_build, i);
This nested conditional for state==0 handles all of the possible outcomes following the re-construction of a multi-byte token.
First we check for it being an over-long encoding. We're doing that here with a series of explicit checks for better code readability, but later on when we write a full decoder we'll change this to do it algorithmically, partly because we'll want to support token lengths up to six bytes instead of just up to four.
If we see an over-long encoding, we just set a flag which will be checked in the next block of code.
if (flag_overlong == 1) {
STORE_ERRPOS
return (2);
}
if (codepoint_build > 0x10FFFF) {
PRINTF ("Codepoints > 0x10FFFF are invalid.\n");
STORE_ERRPOS
return (8);
}
if (codepoint_build >= 0xD800 && codepoint_build <= 0xDFFF) {
PRINTF ("Codepoint encodes a UTF-16 surrogate, and is therefore invalid in UTF-8.\n");
STORE_ERRPOS
return (4);
}
}
} /* Closes (state > 0) conditional */
To finish off the multi-byte token handling, we check for three specific issues, the overlong encoding flag that we just discussed, high codepoints above 0x10FFFF, and codepoints that encode the UTF-16 surrogates.
If any of these conditions is encountered, we return with the corresponding error code.
if (flag_byte_done == 0) {
if (leading_ones == 1) {
PRINTF ("Continuation byte without multibyte start byte\n");
STORE_ERRPOS
return (32);
}
If we haven't matched anything so far, yet the current byte is a continuation byte, (leading_ones == 1), then it can't have been part of a valid multi-byte sequence following an introducer.
In this case, we return error code 32.
if (leading_ones > 4) {
PRINTF ("Invalid byte 0x%02x in UTF-8 stream at position %u\n", CUR_BYTE, i);
STORE_ERRPOS
return (1);
}
Equally, if the number of leading ones exceeds four then we are either looking at a five or six byte introducer, or possibly an 0xFE or 0xFF byte.
Logically, 0xFE and 0xFF might be thought of as seven and eight byte sequence introducers, although they have never been assigned to such a role in any official version of the UTF-8 standard. Nevertheless, since our bit-counting code counts them in this way, the (leading_ones > 4) conditional above will conveniently catch 0xFE and 0xFF.
if (leading_ones > 1) {
flag_byte_done = 1;
state = multibyte_size = leading_ones;
/*
* Decrement state by one, as the number of continuation bytes is one less than the total.
*/
state--;
/*
* The leading ones are always followed by a zero.
* Therefore, the number of lower bits that contain codepoint data is (7 - leading_ones).
* We can create the necessary bitmask by right shifting 0xFF by leading_ones places.
*/
PRINTF ("Multibyte start at string pos %u\n", i);
codepoint_build = (CUR_BYTE & (0xFF >> leading_ones));
}
}
At this point we've already checked for introducers of tokens beyond four bytes in length, so if leading_ones is two or more than we have the start of a multi-byte token that we want to decode, (but which could still turn out to be invalid for other reasons).
We store the number of leading ones, (which directly represents the overall number of bytes in the sequence), in multibyte_size so that we can refer to it once the sequence is complete and check that it isn't an over-long encoding.
The same value is also initially stored in state, but since this represents the number of remaining continuation bytes and doesn't include the introducer, we immediately decrement state by one to get the correct value.
Any low bits of the introducer which form part of the final codepoint are copied in to the lower bits of codepoint_build, and will be shifted left as required every time we encounter a continuation byte.
As with the continuation bytes, it's also possible to write the bits to the correct positions directly:
codepoint_build = (CUR_BYTE & (0xFF >> leading_ones)) << (6 * state);
Doing it this way involves more calculations overall, though, including multiplication which is not by a factor of two.
Top tip!
Calculating the bitmask
Up to five of the least significant bits of the introducer contain codepoint data.
Since the final re-constructed codepoint value will be used directly without any masking of it's significant bits, we need to mask off the variable number of leading set bits in the introducer byte here.
This applies in the same way regardless of whether we initially store these bit values directly in the corresponding low bits of codepoint_build and let the code we saw earlier shift them left by six bits each time it encounters a valid continuation byte, or whether we store them directly in their final position.
To create a suitable mask, we can simply take the value 0xFF and shift it right by the number of leading ones. For example, for a three byte sequence introducer we shift by three bits:
11111111 0xFF
00011111 (0xFF >> 3) = 0x1F
11101010 Typical three byte sequence introducer
00011111 Mask
00001010 Bits containing codepoint data
Note that strictly speaking, the bits that represent actual codepoint data would be correctly masked by (0xFF >> (leading_ones + 1)):
00001111 (0xFF >> 4) = 0x0F
00001010 Bits containing codepoint data
However, by definition the bit immediately following the leading set bits must be a zero, so including it in the mask as a set bit will not affect the value extracted from the introducer and doing it this way allows us to make a slight optimisation by avoiding the calculation of (leading_ones + 1).
if (flag_byte_done == 0) {
/*
* Must be a regular ASCII character
*/
PRINTF ("Regular ASCII character at string pos %u\n", i);
flag_byte_done=1;
}
Having dealt with every possible instance of multi-byte introducers and continuation bytes, if flag_byte_done is still zero then we're looking at a regular ASCII character, 0x00 - 0x7f.
In a decoder the ASCII value would just be written out directly as the 32-bit codepoint value, (with possible special handling of 0x00 and other control characters). Since we're only validating the input and not writing the decoded output anywhere there is no special processing required and we just set flag_byte_done to skip the integrity check below.
if (flag_byte_done == 0) {
PRINTF ("Logic error in utf-8 parser!\n");
exit(1);
}
This final block of code should never run with the program in it's current form. It was used during development, (together with a now removed check for CUR_BYTE < 0x80 in the previous block), to catch errors in the decoding logic before everything was complete and fully debugged.
Now that the function runs successfully and produces the correct results we could indeed remove these lines, along with the line that sets flag_byte_done when an ASCII character is found.
However, as well as being included here for completeness as an example of how to add such debugging code to your own programs, the assumption that the function will always produce the expected results could be broken by any of the following:
- Future changes to the code
- Over-zealous use of compiler optimisation flags
- Bugs in the C compiler, OS code, or system libraries
- Hardware failures
Whilst it's not really practical to make any guarantees about programs running on faulty hardware or that have been compiled with buggy compilers, we do plan to make further changes and additions to this code. During this process it's useful to keep the integrity check above enabled.
}
This closes the big for (i=0; i < len; i++) loop.
if (state != 0) {
PRINTF ("End of string within multibyte character - expecting %u more bytes\n", state);
STORE_ERRPOS
return (64);
}
PRINTF ("Valid UTF-8 string of %u bytes\n", len);
return (0);
#undef CUR_BYTE
#undef STORE_ERRPOS
}
All that is left to do once we've reached the end of input is to check that we didn't finish within an uncompleted multi-byte token.
If we did have an uncompleted token then we return error 64, otherwise the validation was successful and the function returns 0.
Here is the annotated source code for the decoder function.
int utf8_decode(struct st_utf8_params * params)
{
This time we are passing all of the required values within a single structure.
unsigned int i;
unsigned int codepoint_build;
unsigned int outbuf_size; Amount of memory currently allocated to the output buffer
unsigned char bitops;
unsigned char state;
unsigned char leading_ones;
unsigned char flag_byte_done;
unsigned char flag_overlong;
unsigned char multibyte_size;
unsigned int * new_outbuf; Temporary pointer used when re-allocating the output buffer
Most of the local variables here serve the same purpose as they did in the validator function. The two new ones relate to the management of the output buffer, which obviously didn't exist in the validator.
Note that outbuf_size is distinct from params->buffer_decoded_len.
Whereas buffer_decoded_len will store the actual number of valid elements in the output buffer, (in other words the current position to append data to), outbuf_size will store the amount of memory currently allocated which will always need to be more than is currently in use.
#define CUR_BYTE *(params->buffer_encoded + i)
#define STORE_ERRPOS { params->error_pos = i; }
#define MAX_OUTBUFFER_SIZE (256 * 1024 * 1024) /* 256 Mb */
#define REALLOC_INCREASE (64 * 1024) /* 64 Kb */
The first two of these macros also serve the same purpose here as they did in utf8_validate(), but be aware that the actual definitions have changed.
Also note that the value of REALLOC_INCREASE should always be a multiple of sizeof(uint32_t). See below for further details.
CUR_BYTE is functionally the same, with just a change of variable name from buffer to buffer_encoded.
STORE_ERRPOS no longer checks for error_pos being NULL, as it is now params->error_pos and as part of the supplied structure it always exists.
MAX_OUTBUFFER_SIZE is used to constrain the amount of memory that can be allocated to the output buffer.
REALLOC_INCREASE is a fairly arbitrary constant specifying the size of increments to use if the initial memory allocation is insufficient.
params->error_flags=0;
params->error_pos=0;
params->error_len=0;
These are values that will be returned to the calling function.
Since error_flags is a bitfield that will be built up one bit at a time if and when we encounter any non-fatal erros, we need to initialize it to zero before use.
The other return values for position and length will be explicitly set before returning if an error occurs during the actual decoding, so we don't need to initialize them. But if we happen to return early before even starting to decode the input, due to an initial memory allocation error or invalid parameters, then it makes sense to have set them both to zero.
In the case of successful decoding with no fatal errors these default values of zero will also be returned.
We don't set a default value of zero for buffer_decoded_len here because that variable is also optionally used to supply a hint value for the size of the output buffer and we haven't read that value yet.
if (params->max_token_len > 6 || params->max_token_len < 2 || params->buffer_encoded_len==0) {
params->buffer_decoded_len=0;
return (256);
}
Here we check that the supplied parameters are within a sensible range.
The maximum multi-byte token length is constrained to values between 2 and 6, and zero-length input is also rejected.
If we return due to error here then we set buffer_decoded_len to zero, overwriting any supplied hint value.
#define INITIAL_SIZE (params->buffer_decoded_len == 0 ? params->buffer_encoded_len : params->buffer_decoded_len)
if (INITIAL_SIZE > MAX_OUTBUFFER_SIZE / sizeof(uint32_t)) {
params->buffer_decoded_len=0;
PRINTF ("Initial memory allocation exceeds maximum configured size of %u bytes\n", MAX_OUTBUFFER_SIZE);
return (128);
}
outbuf_size = sizeof(uint32_t) * INITIAL_SIZE;
#undef INITIAL_SIZE
PRINTF ("Allocating %u bytes for output buffer\n", outbuf_size);
params->buffer_decoded=malloc(outbuf_size);
if (params->buffer_decoded == NULL) {
PRINTF ("Initial memory allocation error\n");
params->buffer_decoded_len=0;
return (128);
}
This is the initial memory allocation code.
Since UTF-8 is a variable length encoding, we don't generally know in advance of decoding a stream how many codepoints the input will actually decode to because we don't know anything yet about the frequency and mix of multi-byte tokens. By extension, we usually don't know the exact size required for the output buffer.
We can easily calculate an upper bound for the size, though, as the largest possible number of output codepoints is just the same as the number of bytes in the input. This largest possible value occurs when the input is entirely ASCII characters with no multi-byte tokens and each input byte decodes to exactly one 32-bit integer.
Remember that whilst the input is based on 8-bit bytes, (octets), the output consists of 32-bit integer values.
However large the output ends up being in codepoints, we need to multiply this value by sizeof(uint32_t) to allocate enough bytes to hold it.
On almost all systems, an unsigned 32-bit integer is stored in four bytes, although nothing actually guarantees this so the correct way to find the size is to use sizeof().
Knowing this, there are basically two ways that we can approach the task of buffer memory allocation:
- Allocate a small amount of memory and increase it as required.
- Allocate the maximum amount that might be required, (and possibly subsequently reduce it).
In reality, the size of a typical UTF-8 data stream is likely to be an order of magnitude smaller than the amount of memory available on a modern system, so in many or even most cases the second option is perfectly reasonable and has the added advantage of simplicity.
This is the approach that the code above takes if buffer_decoded_len is set to zero.
However, it is also possible that we might have already done some pre-processing of the input external to this function and that we either know or have some idea of how big the output will be. In that case it would be useful to communicate this information to utf8_decode() so that we could reduce or avoid the temporary over-allocation of memory.
Such a hint, (expressed as a value in codepoints, not bytes), can be supplied in buffer_decoded_len and if it is then it will be used instead of the size of the input in the calculation of outbuf_size.
Note that the formula used to bounds check the value of INITIAL_SIZE against MAX_OUTBUFFER_SIZE, (which is in bytes), uses division instead of multiplication:
(INITIAL_SIZE > MAX_OUTBUFFER_SIZE / sizeof(uint32_t))
Doing this avoids the possibility of a multiplication causing wrap around, (also commonly called ‘overflow’), which could occur if we used a construction such as this:
(INITIAL_SIZE * sizeof(uint32_t) > MAX_OUTBUFFER_SIZE)
Although in practice the values we're using here typically won't wrap on a system where sizeof returns a 64-bit value, as always future changes to the code could invalidate this assumption so it's generally better to use the division approach in these situations.
And yes, the C standard library on any modern system does typically include functions such as calloc() and reallocarray() which return an error on overflow, so we could avoid this very specific risk by using those in place of malloc here. However, the existence of these alternative functions is not a good substitute for common sense and understanding the risk of overflow anywhere else in the code that we also use multiplication in relation to memory management and data buffers.
If the actual memory allocation fails, we return error code 128.
state=0;
params->buffer_decoded_len=0;
Here we set the initial state of not being within a multi-byte token, just as we did in the validator.
We can also set buffer_decoded_len to zero ready to use as our output length indicator, now that we've finished with any value that was originally supplied in it.
for (i=0; i < params->buffer_encoded_len; i++) {
At this point we enter the main loop that will iterate over the input.
if (params->buffer_decoded_len * sizeof(uint32_t)==outbuf_size) {
if (outbuf_size + REALLOC_INCREASE > MAX_OUTBUFFER_SIZE) {
free (params->buffer_decoded);
params->buffer_decoded_len=0;
params->error_len=1;
STORE_ERRPOS
return (128);
}
new_outbuf=realloc(params->buffer_decoded, outbuf_size + REALLOC_INCREASE);
if (new_outbuf == NULL) {
free (params->buffer_decoded);
params->buffer_decoded_len=0;
params->error_len=1;
STORE_ERRPOS
return (128);
}
PRINTF ("Reallocated output buffer from %u to %u\n", outbuf_size, outbuf_size + REALLOC_INCREASE);
params->buffer_decoded=new_outbuf;
outbuf_size += REALLOC_INCREASE;
}
The first thing we have in the loop is more buffer memory management code, this time to increase the allocation if the output of previous loops has already filled it. Naturally this will never run if the function is callled without a size hint in buffer_decoded_len, as the output buffer is already at least the size that it needs to be.
Of course, if the output buffer was always allocated to be the maximum possible size that could be required for the input, in other words the hinting code mentioned previously hadn't been included, then this entire block of code would also be un-necessary. Remember what I said about the added advantage of simplicity?
Care should be taken to ensure that the value of MAX_OUTBUFFER_SIZE is not configured beyond (0xFFFFFFFF - REALLOC_INCREASE), as doing so would allow the value which is eventually passed to realloc() to wrap. Since MAX_OUTBUFFER_SIZE is a compile time option and virtually no real world usage would require an output buffer anywhere near this size, (just under 4 Gb), the code doesn't perform any further checks to enforce this upper limit.
The logic in the outer conditional is also the reason that REALLOC_INCREASE must be configured as a multiple of sizeof(uint32_t) for reliable operation. Since we check specifically for equality with outbuf_size rather then passing beyond it's value, we need to ensure that outbuf_size is only increased by the same multiple that we use for buffer_decoded_len, otherwise the comparison might never match and we could easily end up writing beyond the allocated memory.
flag_byte_done=0;
bitops=CUR_BYTE;
for (leading_ones=0; (bitops & 0x80)==0x80; leading_ones++, bitops=bitops << 1) ;
Now finally we get to the start of the actual UTF-8 parsing code. These three lines are unchanged from the validator function.
if (state > 0 && leading_ones != 1) {
PRINTF ("Non-continuation byte 0x%02x where continuation byte expected within multibyte sequence at string pos %u\n", CUR_BYTE, i);
if ((params->ignore & 0x10) == 0) {
params->error_len=1;
STORE_ERRPOS
return (16);
}
params->error_flags |= 0x10;
/*
* If error 0x10 is being ignored then we discard the multi-byte sequence that was being constructed.
* Additionally, we maintain flag_byte_done=0 so that the code below will process it in some way.
* This could be as a regular ASCII character or as a new sequence introducer.
*/
state=0;
}
The basic order of operations in this decoding function is the same as it was in the validator.
First we deal with being part way in to a multi-byte token and encountering a non-continuation byte where we would have expected a continuation byte.
The main difference here, (apart from the slightly more verbose debugging message), is in the error handling. Whereas before we just automatically returned to the calling function with error code 16, now we only do that if the corresponding bit in params->ignore is set. Otherwise, we set the same bit in params->error_flags to indicate a non-fatal error, and continue on almost as if nothing happened.
In the case of treating this as a non-fatal error, we do reset state to zero. This ensures that we won't run the code that handles completed multi-byte sequences, as that is within the following (state > 0) conditional. Effectively, setting state to zero here cancels the multi-byte sequence that we were already in.
Of course, what we now do with the unexpected byte, (as well as any that were part of the now discarded multi-byte token being constructed), is an open question. There are several logical options, including at least:
- Completely discard it and write nothing to the output in it's place.
- Replace it with a specific character such as the replacement character 0xFFFD.
- Replace it with one character out of a range of characters in a private use block such as 0xF0000 - 0xF00FF.
- Interpret it as if it had been encountered outside of an existing multi-byte token.
As written, the block of code above will take the last option and allow the following code to process the byte as either a regular ASCII character or as a new sequence introducer.
If we wanted to completely discard the byte instead, that could be done simply by setting flag_byte_done:
[...]
state=0;
flag_byte_done=1;
}
[...]
Replacing the unexpected byte with the replacement character can be done with just one more line of code:
[...]
state=0;
*(params->buffer_decoded+params->buffer_decoded_len++)=0xfffd;
flag_byte_done=1;
}
[...]
As can replacing it with one character from a range:
[...]
state=0;
*(params->buffer_decoded+params->buffer_decoded_len++)=0xf0000 + CUR_BYTE;
flag_byte_done=1;
}
[...]
Obviously adding this kind of flexibility as a selectable option at runtime quickly begins to increase the complexity to the code, and in many cases it's un-necessary as the desired action can be fixed at compile time.
Ultimately, if the program calling this decoding function has very specific requirements for handling such errors then it can always leave the corresponding bit of the ignore flag unset, which will cause the decoder to stop and return a fatal error along with the precise location of the error in the input stream. The calling code can then do it's own processing of the bytes involved before calling the decoder again with the start of the input positioned after the erroneous bytes to resume regular decoding.
Note that within the decoder we can easily treat invalid bytes differently depending not only on their value but also where they were encountered. For example, if an 0xFF byte is found as part of a multi-byte sequence in place of a continuation byte in the range 0xC0 - 0xDF, then we might want to completely discard it. On the other hand, if 0xFF occurs on it's own within regular ASCII text, then we might choose to replace it with the replacement character instead.
Finally, before we move on, note that the number of bytes that were in the partially completed multi-byte sequence that we discarded can be calculated with (multibyte_size - state), before resetting state to zero.
Important!
Writing multiple bytes per iteration
Be aware that modifying the code so that it inserts multiple bytes during one iteration of the main loop would require changes to the memory management code. Currently that code only enlarges the buffer when it's completely full, so by extension it only guarantees that there is at least one byte available.
Making this change is left as an exercise for the reader. Failure to do it correctly will result in code which works most of the time but fails in subtle ways depending on the nature of the input data and the exact position of any invalid bytes in it. Have fun!
if (state > 0) {
PRINTF ("Continuation byte 0x%02x at string pos %u\n", CUR_BYTE, i);
codepoint_build = codepoint_build << 6;
codepoint_build |= CUR_BYTE & 0x3F;
state--;
flag_byte_done = 1;
Here we process correctly encountered continuation bytes in the same was as we did in the validator, pushing the lower six bits in to codepoint_build after shifting what we already had left by six bits.
/*
* Finished decoding a multibyte? Then check that the codepoint is valid.
*/
if (state==0) {
/*
* Overly long encoding?
*/
flag_overlong=0;
if (codepoint_build < (multibyte_size == 2 ? 0x80 : (1 << (8 - multibyte_size + 6 * (multibyte_size - 2))))) {
flag_overlong = 1;
}
Previously we used a series of explicit checks to decide whether to set flag_overlong or not.
That was arguably more readable than the algorithmic approach shown above, as we could see the specific values that were being tested for, 0x80, 0x800, and 0x10000. However we now want to process sequences up to 6 bytes which would add two more distinct if statements, and at this point the calculation above starts to look more compact and easier to verify as being correct. After all, the discrete values being tested become longer as the number of bytes increases, making it somewhat less obvious if one of them was accidentally changed during code editing.
PRINTF ("%sultibyte, (%u bytes), codepoint 0x%x, (decimal %u) ending at string pos %u\n", (flag_overlong==1 ? "Overlong m" : "M"),
multibyte_size, codepoint_build, codepoint_build, i);
if (flag_overlong == 1) {
if ((params->ignore & 0x02) == 0) {
params->error_len=multibyte_size;
STORE_ERRPOS
return (2);
}
params->error_flags |= 0x02;
}
There is a slight difference in the handling of overlong encodings as fatal errors here, as we now set error_len to the size of the multi-byte token before returning.
If we are ignoring overlong encoding errors then all we do is set the corresponding bit of error_flags and continue - the code further below will happily process the decoded token.
if (codepoint_build > 0x10FFFF) {
PRINTF ("Codepoints > 0x10FFFF are invalid.\n");
if ((params->ignore & 0x08) == 0) {
params->error_len=multibyte_size;
STORE_ERRPOS
return (8);
}
params->error_flags |= 0x08;
}
if (codepoint_build >= 0xD800 && codepoint_build <= 0xDFFF) {
PRINTF ("Codepoint encodes a UTF-16 surrogate, and is therefore invalid in UTF-8.\n");
if ((params->ignore & 0x04) == 0) {
params->error_len=multibyte_size;
STORE_ERRPOS
return (4);
}
params->error_flags |= 0x04;
}
Likewise with codepoints beyond 0x10ffff and the UTF-16 surrogates. In the case of fatal errors, we set error_len as well as the error position, otherwise we just set the soft error flags in error_flags and continue.
CESU-8
Of course, it's possible that the input might decode to produce two sequential codepoints that together represent a valid surrogate pair in UTF-16. An otherwise valid UTF-8 stream that intentionally contains such pairs of codepoints is known as CESU-8.
Although it would be possible to further decode such surrogate pairs directly within the UTF-8 decoder, this would complicate the codebase somewhat. A typical approach might involve the following steps:
- Set a flag whenever a high surrogate, (0xD800 - 0xDBFF), is encountered in a valid position.
- Store that codepoint in a temporary variable, but don't write it to the output buffer.
Then, whenever the above mentioned flag is already set, do the following:
- Check whether the codepoint decoded from the input is a valid low surrogate, (0xDC00 - 0xDFFF).
- If it is, then reconstruct a single codepoint from the two surrogates, according to the UTF-16 specification.
- Write the reconstructed codepoint to the output buffer.
Implementing this is left as an exercise for the reader.
*(params->buffer_decoded+params->buffer_decoded_len++)=codepoint_build;
}
}
At this point we have successfully reconstructed a multi-byte token, and can write it to the output buffer.
If any of errors 2, 4, or 8 are being ignored then the token may have been constructed from invalid input, but we still write what we have to the output buffer and do not discard it.
This also closes the (state > 0) conditional block.
if (flag_byte_done == 0) {
If the previous code didn't set flag_byte_done, then either we were not processing a multi-byte token, or we were doing so but then aborted it's reconstruction because of seeing a non-continuation byte.
if (leading_ones == 1) {
PRINTF ("Continuation byte without multibyte start byte\n");
if ((params->ignore & 0x20) == 0) {
params->error_len=1;
STORE_ERRPOS
return (32);
}
params->error_flags |= 0x20;
/*
* If error 0x20 is being ignored, then unexpected continuation bytes are silently dropped and the decoding continues.
*/
flag_byte_done=1;
}
If the code path reaches here and the current byte is a continuation byte then it must have occurred outside of a valid sequence, otherwise the code above would have set flag_byte_done.
Just like for the previous errors that we've tested for, if the corresponding bit in ignore is not set then we return with an error code having set error_len to the appropriate length which in this case is one.
In the case that this error is set to be ignored then we flag it in error_flags and also set flag_byte_done so that the spurious continuation byte isn't further processed by the following code.
Of course, instead of simply dropping the continuation byte we could easily insert the replacement character in the output with just one extra line of code:
*(params->buffer_decoded+params->buffer_decoded_len++)=0xfffd;
if (leading_ones > params->max_token_len) {
PRINTF ("Invalid byte 0x%02x in UTF-8 stream at position %u\n", CUR_BYTE, i);
if ((params->ignore & 0x01) == 0) {
params->error_len=1;
STORE_ERRPOS
return (1);
}
params->error_flags |= 0x01;
flag_byte_done=1;
}
Now we check for multi-byte introducers that would introduce a sequence that exceeded the maximum length that we configured in max_token_len.
If this error is ignored, then we silently drop the introducer and continue decoding. Any following continuation bytes that were part of such a longer sequence will be processed independently on subsequent iterations of the loop by the code in the previous block, and the behaviour of the routine in that case will depend on whether error 0x20 is being ignored or not.
}
if (flag_byte_done == 0) {
Re-check the flag as we don't want to run the code below if we've already done something useful with the input.
if (leading_ones > 1) {
flag_byte_done = 1;
state = multibyte_size = leading_ones;
/*
* Decrement state by one, as the number of continuation bytes is one less than the total.
*/
state--;
/*
* The leading ones are always followed by a zero.
* Therefore, the number of lower bits that contain codepoint data is (7 - leading_ones).
* We can create the necessary bitmask by right shifting 0xFF by leading_ones places.
*/
PRINTF ("Multibyte start at string pos %u\n", i);
codepoint_build = (CUR_BYTE & (0xFF >> leading_ones));
}
}
If none of the conditions up to now have matched the input byte, then it's either an introducer for a multi-byte token or a regular ASCII character.
This block of code deals with the case of it being a multi-byte introducer, and is identical to the equivalent code in the validator.
We can assume that such an introducer found here is in a valid place to begin a new multi-byte sequence because error 0x10, (16), would have been returned if such a byte had occurred within an existing sequence, (where we would be expecting a continuation byte instead). If the 0x10 but is set in ignore then instead of returning an error, the code will have dropped the existing sequence and will now allow this new sequence introducer to be parsed as if it was simply within other ASCII text.
As before, the number of leading set bits directly indicates the size of the token in bytes, so we store this value in both multibyte_size and state. The value in state will be decremented each time we process a byte and will therefore serve as our inner loop counter. The value in multibyte_size will not be decremented and will be referenced in places we need to know the overall size of the token, such as checking for over-long encodings after successfully decoding it, or otherwise for setting error_len.
if (flag_byte_done == 0) {
/*
* Must be a regular ASCII character
*/
PRINTF ("Regular ASCII character at string pos %u\n", i);
flag_byte_done=1;
*(params->buffer_decoded+params->buffer_decoded_len++)=CUR_BYTE;
}
Regular ASCII characters are simply written to the output buffer directly, with no further processing except for casting the byte value to 32 bits.
if (flag_byte_done == 0) {
PRINTF ("Logic error in utf-8 parser!\n");
exit(1);
}
}
This closes the main i loop, after the same code integrity check that we saw in the validator.
if (state != 0) {
PRINTF ("End of string within multibyte character - expecting %u more bytes\n", state);
if ((params->ignore & 0x40) == 0) {
params->error_len=multibyte_size-state;
i--;
STORE_ERRPOS
return (64);
}
params->error_flags |= 0x40;
}
If the input ends unexpectedly in the middle of a multibyte token, this is considered an error just as it was in the validator.
Like the other error conditions, it can be ignored and treated as non-fatal by setting the appropriate bit, (bit 6), in ignore.
However, there is also a subtle but significant change in the error handling here compared to the validator when this condition treated as a fatal error. Previously, the position returned by STORE_ERRPOS was one byte beyond the end of the input, indicating the position of the next expected continuation byte. This made sense for the validator because it only returned the position of the error and not the number of bytes that were involved.
Since here in the decoder we are also returning a value for error_len, it makes more sense to store the position of the last actual byte in the incomplete sequence. Doing this will allow us to easily highlight the entire incomplete sequence in a special error printing routine that we'll see shortly.
PRINTF ("%salid UTF-8 string of %u bytes, decoded to %u codepoints\n", (params->error_flags == 0 ? "V" : "Parsable but strictly inv"),
params->buffer_encoded_len, params->buffer_decoded_len);
The summary information is slightly more detailed than before, now including the number of decoded codepoints as well as an indication of whether any non-fatal errors occurred and were ignored.
#define OVER_ALLOC (outbuf_size - params->buffer_decoded_len * sizeof(uint32_t))
#define DECODED_BYTES (params->buffer_decoded_len * sizeof(uint32_t))
PRINTF ("Total output size is %lu, allocated buffer is %u. Over-allocation was %lu bytes.\n", DECODED_BYTES, outbuf_size, OVER_ALLOC);
if (OVER_ALLOC > 32 * 1024) {
PRINTF ("Reducing output buffer to %lu... ", DECODED_BYTES);
new_outbuf=realloc(params->buffer_decoded, DECODED_BYTES);
if (new_outbuf == NULL) {
/*
* The re-allocation is always going to be a _reduction_, and never an increase.
* It's possible that on a memory-constrained system a large buffer could not be re-allocated and the call to realloc()
* might fail. In this case we can still return normally because the original oversized buffer still exists, but we
* set bit 7 in the error flags to indicate that this non-critical error occurred.
*/
PRINTF ("failed!?\n");
params->error_flags |= 128;
return (0);
}
PRINTF ("OK!\n");
params->buffer_decoded=new_outbuf;
}
return (0);
At this point we have the required decoded output in the output buffer, and could simply return to the calling function.
However, in some cases the amount of memory allocated to the output buffer will be considerably more than is actually used. This is possible because unless the function was called with an accurate sizing hint for the output buffer then an estimated size was calculated assuming the worst possible case of an input stream that happened to be entirely single byte tokens encoding ASCII values.
For input that only contains the occasional non-ASCII codepoint, the overhead is negligible. On modern systems where memory is plentiful, it might not actually matter at all. If the decoded output is going to be further processed or written to disk immediately then the excessive memory allocation is only required for a brief period of time anyway.
Nevertheless, for completeness we might as well see how to reduce the allocation to the actual size of the data stored, and that is what this block of code does.
Two new macros are defined here:
OVER_ALLOC is the amount of memory allocated but not used.
DECODED_BYTES is the size in bytes of the useful data.
We use these values to print some statistics, then proceed to call realloc() with the new size if the difference is greater than 32K. This value of 32K is completely arbitrary and simply serves as a small optimisation to avoid the overhead of the reallocation where the savings are minimal.
Since we are reducing the allocation rather than increasing it, the call to realloc() is unlikely to fail. However, if it did, then the existing allocation would remain valid, so can still return successfully. In that case, we set bit 7 of error_flags to indicate to the caller that this happened.
Note that, in contrast to the uses of the other bits in error_flags, we set bit 7 here regardless of whether the corresponding bit is set in ignore or not. This is the exception that was mentioned earlier when we introduced the new error codes.
This handling of bit 7 makes sense, as a failed realloc() here is always a non-fatal error, and can always be ignored. In contrast, the other four places where memory allocation errors can occur in this function are always fatal and can never be ignored.
The calling code doesn't even need to check bit 7 in error_flags or take any special action, as the returned data in the output buffer is the same either way. It's a purely informational flag.
#undef OVER_ALLOC
#undef CUR_BYTE
#undef STORE_ERRPOS
#undef MAX_OUTBUFFER_SIZE
#undef REALLOC_INCREASE
}
Finally, we undefine the compiler macros that are only used locally within this function, and at this point we're done.
Encoder source code with annotations
Here we go with the annotated source code for the encoder function!
int utf8_encode(struct st_utf8_params * params)
{
#define MAX_OUTBUFFER_SIZE (256 * 1024 * 1024) /* 256 Mb */
#define HIGH_BITS(i) ((0xFF << (8-i)) & 0xFF)
#define MAX_MAX_TOKEN_LEN 6
As with the decoder, we're passing all of the parameters in a single structure.
We also have three defines, two of which are new:
MAX_OUTBUFFER_SIZE is the same as it was for the decoder
HIGH_BITS(i) calculates the high bits to set in the introducer byte of an i byte sequence
MAX_MAX_TOKEN_LEN is the maximum length acceptable for max_token_len
The two new macros probably deserve a more detailed explanation.
HIGH_BITS() returns a byte with i leading set bits and all of the lower bits reset to zero. This value can then be logically or'ed with some of the highest bits of the codepoint value to generate the correct sequence introducer.
For example, the codepoint 0x567 is 10101100111 in binary, and is correctly encoded in UTF-8 as a two byte sequence. Therefore, we would want to combine the following two values:
11000000 - result of HIGH_BITS(2)
00010101 - top five bits of the codepoint
11010101 - resulting introducer byte
MAX_MAX_TOKEN_LEN is a compile-time setting to limit the largest value of max_token_len that the function will accept at runtime.
Although the vast majority of strictly non-compliant but nevertheless quasi-UTF-8 streams don't use sequences beyond six bytes, the encoding logic below could easily cope with seven and eight byte sequences as well. However, to do this it would require some values to be represented in 64 bits rather than 32 bits.
For this reason, and since the value of MAX_MAX_TOKEN_LEN is used in more than one place in the code, we define it's value as a macro rather than hard-coding it as 6.
This provides an easy path for future code changes that would allow the program to be compiled with larger values of MAX_MAX_TOKEN_LEN, and to only use 64-bit data types in such a configuration.
Making this change is left as an exercise for the reader.
unsigned int cur_alloc; Amount of memory currently allocated to the output buffer
unsigned int i; Loop counter
unsigned int j; Secondary loop counter
unsigned int mb; Tertiary loop counter, (‘multi-byte’)
unsigned int new_alloc; Proposed size for output buffer when re-allocating memory
unsigned int max_codepoint[MAX_MAX_TOKEN_LEN + 1]; Array of lowest codepoint values that are invalid for each token length
unsigned char * new_outbuf; Temporary pointer used by the memory re-allocation code
params->error_flags=0;
params->error_len=0;
params->error_pos=0;
params->buffer_encoded_len=0;
All of the return values in the supplied structure are set to a default of zero.
Note that error_len is fairly meaningless for the encoder as errors are always zero codepoints or one codepoint long.
if (params->buffer_decoded_len == 0 || params->max_token_len < 2 || params->max_token_len > MAX_MAX_TOKEN_LEN) {
return (256);
}
Here we do an initial check that the supplied parameters are sensible.
If the supplied buffer is empty, or the maximum token length to encode is non-sensical, then we return immediately.
if (params->buffer_decoded_len > MAX_OUTBUFFER_SIZE / params->max_token_len) {
return (128);
}
cur_alloc=params->buffer_decoded_len * params->max_token_len;
params->buffer_encoded=malloc(cur_alloc);
if (params->buffer_encoded==NULL) {
return (128);
}
This is the memory allocation code.
In contrast to the decoder function where we used some logic to minimise memory usage, here we take a simple approach of just allocating enough memory to cover the worst case of every input codepoint requiring a token of the maximum permitted length.
Whilst it certainly would be possible to re-size the output buffer dynamically, and even use the mix of input bytes already processed to estimate the requirements of the data that is left, in reality modern systems are not usually memory constrained to the point where the added code complexity and processing at runtime would make it worthwhile.
If the memory requirement exceeds MAX_OUTBUFFER_SIZE or allocation fails, we return immediately with error code 128.
Of course, if the calling function is aware that the highest codepoint value in the input will only require a token of two or three bytes rather than the maximum of four permitted in a standards compliant UTF-8 stream, then it can also invoke the encoder with params->max_token_len set to the smaller value and memory usage will be reduced in that case.
for (i=2; i <= params->max_token_len; i++) {
max_codepoint[i] = (1 << (7-i + 6*(i-1)));
}
max_codepoint[0]=0;
max_codepoint[1]=128;
Next we build a lookup table in max_codepoint[] which actually contains the =first invalid codepoint= for each multibyte token length. These values are obviously one more than the actual maximum valid codepoint in each case, but the name max_codepoint was convenient here.
A 32-bit unsigned integer is sufficient to hold the calculated value for a six-byte token, which is the largest commonly found even in sequences which are not strictly UTF-8 compliant.
In the unlikely case of needing to process seven and eight byte tokens, then max_codepoint[] should be declared as a 64-bit type such as uint64_t.
Additionally, the constant 1 should be cast to a uint64_t in the calculation above:
max_codepoint[i] = ((uint64_t)1 << (7-i + 6*(i-1)));
#define CUR_CHAR params->buffer_decoded[i]
CUR_CHAR is analogous to CUR_BYTE in the decoder, it's simply the current input value being processed.
for (i=0; i < params->buffer_decoded_len; i++) {
This is the main loop over the input buffer.
if (CUR_CHAR > 0x10FFFF) {
if ((params->ignore & 0x08) == 0) {
params->error_len=1;
params->error_pos=i;
return (8);
}
params->error_flags |= 0x08;
}
If we encounter a codepoint above 0x10FFFF, then we either return with a fatal error code 8 or simply set bit 3 in error_flags and continue, depending on how bit 3 of ignore was set.
if (CUR_CHAR >= 0xD800 && CUR_CHAR <= 0xDFFF) {
if ((params->ignore & 0x04) == 0) {
params->error_len=1;
params->error_pos=i;
return (4);
}
params->error_flags |= 0x04;
}
Likewise for the codepoints representing the UTF-16 surrogates, except that here we return error code 4, and it's bit 2 of error_flags and ignore.
for (j=1; j <= params->max_token_len; j++) {
if (CUR_CHAR < max_codepoint[j]) {
break ;
}
}
Here we loop through the max_codepoint[] array checking the current codepoint value against the previously calculated maximum codepoints for each token length.
If CUR_CHAR is within the bounds of a token length that we're configured to allow, then the loop exits with j set to that length value.
Otherwise the loop will exit with j == (max_token_len + 1), which we check for next.
if (j==params->max_token_len+1) {
PRINTF ("Input codepoint 0x%x, (%u), exceeds maximum of 0x%x, (%u), encodable with %d bytes\n", CUR_CHAR, CUR_CHAR,
max_codepoint[params->max_token_len]-1, max_codepoint[params->max_token_len]-1, params->max_token_len);
params->error_pos=i;
params->error_len=1;
return (1);
}
If the codepoint requires more bytes to encode than we configured with the supplied value of max_token_len, then this is a fatal error which cannot be ignored, so we return error code 1.
PRINTF ("Input codepoint of 0x%x, (%u), encodable with %u bytes: ", CUR_CHAR, CUR_CHAR, j);
Output some general debugging information about the encoding.
if (j==1) {
params->buffer_encoded[params->buffer_encoded_len++]=CUR_CHAR;
PRINTF ("%d\n", params->buffer_encoded_len);
}
If the codepoint can be represented by a single byte token, we just write that value to the output buffer and increase buffer_encoded_len.
if (j > 1) {
params->buffer_encoded[params->buffer_encoded_len++]=(CUR_CHAR >> (6 * (j - 1))) | HIGH_BITS(j);
PRINTF ("Bytes: %02x", params->buffer_encoded[params->buffer_encoded_len-1]);
for (mb = j-1; mb > 0; mb--) {
params->buffer_encoded[params->buffer_encoded_len++]=(((CUR_CHAR >> (6 * (mb - 1))) & 0x3f) | 0x80) ;
PRINTF (" %02x", params->buffer_encoded[params->buffer_encoded_len-1]);
}
PRINTF ("\n");
}
The multiple byte case is only slightly more complicated.
The first byte is the combination of the HIGH_BITS() value that we discussed earlier, logically or'ed with however many of the most significant bits of codepoint data can fit in the rest of the token introducer. To get the low bits, we simply shift the codepoint value right by six bits for each continuation byte, which will always be one less than the total sequence length. This gives us the following formula as used in the code above of: (6 * (j - 1)).
The second and subsequent bytes are all regular continuation bytes, so they each have bit 7 set, bit 6 reset, and contain six bits of codepoint data in bit 0 - bit 5. These are calculated in a loop which counts down from one less than the total sequence length to one. The formula used for the bit shifting is the same as before and uses the loop counter minus one as it's argument. This ensures that the final byte of the sequence uses the lowest bits of codepoint data unshifted.
For example, for a four byte sequence we will have three continuation bytes. The loop will start at three and count down to one, with the values 2, 1, and 0 being passed to the bit-shifting formula. This results in the original codepoint value being shifted by 12 bits, 6 bits, and then 0 bits for the three continuation bytes.
These shifted values are then logically and'ed with 0x3f to ensure that bit 6 is not erroneously set, and finally or'ed with 0x80 to set bit 7 as required to indicate a valid continuation byte.
As each byte is written to the output buffer, buffer_encoded_len increased by one.
Close the main loop, and if we've finished then undefine CUR_CHAR as we won't need it anymore.
#define OVER_ALLOC (cur_alloc - params->buffer_encoded_len)
PRINTF ("Total output size is %u, allocated buffer is %u. Over-allocation was %u bytes.\n", params->buffer_encoded_len, cur_alloc, OVER_ALLOC);
if (OVER_ALLOC > 32 * 1024) {
PRINTF ("Reducing output buffer to %u... ", params->buffer_encoded_len);
new_outbuf=realloc(params->buffer_encoded, params->buffer_encoded_len);
if (new_outbuf == NULL) {
PRINTF ("failed!?\n");
params->error_flags |= 128;
return (0);
}
PRINTF ("OK!\n");
params->buffer_encoded=new_outbuf;
}
Now that we have finished parsing and encoding the input, we know the exact size of the encoded version and can reduce the memory allocation to match that value.
This code is almost identical to the corresponding code in the decoder.
#undef OVER_ALLOC
#undef HIGH_BITS
#undef MAX_OUTBUFFER_SIZE
return (0);
}
Finally we undefine the other compiler macros used in this function and return zero to the caller to indicate success.
This is probably the most interesting special case from a coding perspective.
What we want to do is read the encoded byte sequence one byte at a time, starting from the last byte and working backwards. The output should be the set of decoded codepoints also in reverse order.
Whilst single byte ASCII characters can just be read directly in reverse and produce the expected result, the decoding logic for multi-byte tokens will need to be different.
When reading a valid UTF-8 stream in reverse, the continuation bytes of any multi-byte tokens will appear before the sequence introducer. Therefore, we won't know until the end of the sequence how many bytes it should contain. The six bits that we extract from each continuation byte will belong in increasingly higher order positions in the codepoint being built up, so they can be written directly to the correct position even without knowing the sequence length when we start.
Error checking also needs to change. Some of the failure modes possible when reading the stream forwards no longer exist - continuation bytes are now valid at any point - and there are some new error conditions too, such as finding a sequence introducer without first seeing any continuation bytes.
Obviously if a stream is invalid when reading it in reverse, it would also be invalid when reading forwards. The difference is that the error is detected in a different way. For example, when reading forwards a truncated multi-byte token would be detected by finding a non-continuation byte where another continuation byte was expected. When reading backwards the same error will be detected by reading a sequence introducer that represents a larger number of continuation bytes than we have already read. It's the same part of the stream that is in error, but the way it's detected is different.
The annotated source code follows. Since the memory management code here is identical to the same code in the regular decoding function, it's not repeated here for brevity. The commentary starts at the opening of the main i loop, and the initial buffer enlargement code within it has also been replaced with a placeholder comment.
for (i=0; i < params->buffer_encoded_len; i++) {
/*
* Code to perform bounds checking and enlargement of the output buffer has
* been removed for brevity. Refer to the corresponding code in utf8_decode().
*/
flag_byte_done=0;
bitops=CUR_BYTE;
for (leading_ones=0; (bitops & 0x80)==0x80; leading_ones++, bitops=bitops << 1) ;
We start the same way as before, counting the leading set bits in the current byte.
First we will check for a number of conditions that would complete a multi-byte sequence that has already been started, (after encounting an initial continuation byte).
if (leading_ones == 0) {
PRINTF ("ASCII byte 0x%02x encountered during multi-byte sequence at string pos %u\n", CUR_BYTE, i);
if ((params->ignore & 0x200) == 0) {
params->error_len=1;
STORE_ERRPOS
return (512);
}
params->error_flags |= 0x200;
state=0;
}
An ASCII character would be invalid here, since when reading backwards a multi-byte sequence needs to end with the correct sequence introducer. Although we could just reject anything without the correct number of leading bits in a generic way using a single bit in the error bitfield, it's more useful to identify this specific case with it's own dedicated error code. In this case, we've used 512 since bit 9 is the lowest unused error flag bit.
This error can be ignored by setting bit 9 of ignore. In that case, the token being constructed is discarded, and the current ASCII byte will be processed as a regular character by the code later on.
if (leading_ones == 1) {
if (state == params->max_token_len - 1) {
PRINTF ("Continuation byte forms a sequence longer than the configured maximum length\n");
params->error_len=state;
STORE_ERRPOS;
return(1024);
}
codepoint_build |= ((CUR_BYTE & 0x3F) << state * 6);
state++;
flag_byte_done=1;
}
A continuation byte during a multi-byte sequence simply adds it's lower bits to the codepoint value being built up.
The position is easy to calculate because by reading the stream backwards we encounter the least significant 6-bit group first, and then they progress upwards to more significant bit positions. Therefore, the number of bit places to shift is a simple multiple of state by six.
If we've already encountered enough continuation bytes to form a sequence of the maximum length permitted by max_token_len when combined with the correct introducer, then finding another continuation byte is considered a fatal error.
if (leading_ones > 1 && leading_ones != state + 1) {
if ((params->ignore & 0x800) == 0) {
params->error_len=1;
STORE_ERRPOS
return (2048);
}
params->error_flags |= 0x800;
state=0;
flag_byte_done=1;
}
Encountering a sequence introducer that doesn't match the number of continuation bytes already seen is another error condition.
If this error is ignored, we discard the codepoint that was being constructed and do no further processing of the current byte, but otherwise continue decoding the stream.
if (leading_ones > 1 && leading_ones == state + 1) {
Now we deal with the case of a corresponding sequence introducer after the correct number of leading continuation bytes.
This will lead to a correctly re-constructed codepoint, although that codepoint might still be outside of the permitted range.
codepoint_build |= ((CUR_BYTE & (0xFF >> leading_ones)) << 6 * state);
state++;
flag_byte_done=1;
First, we complete the codepoint being built up, by adding the appropriate low bits from the sequence introducer to the value that we already have.
Then we increase state by one so that it holds the actual correct length of the whole sequence, and set flag_byte_done, since the current byte was at least valid within the broad terms of UTF-8 encoding logic, and we certainly don't want to process it in any other way.
flag_overlong=0;
if (codepoint_build < (state == 2 ? 0x80 : (1 << (8 - state + 6 * (state - 2))))) {
flag_overlong = 1;
}
PRINTF ("%sultibyte, (%u bytes), codepoint 0x%x, (decimal %u) ending at string pos %u\n", (flag_overlong==1 ? "Overlong m" : "M"),
state, codepoint_build, codepoint_build, i);
if (flag_overlong == 1) {
if ((params->ignore & 0x02) == 0) {
params->error_len=state;
STORE_ERRPOS
return (2);
}
params->error_flags |= 0x02;
}
Next we set a flag if the token was an overlong encoding, and deal with that case in the same way that we did in the regular decoder function.
This error can be ignored, in which case we continue to process the token.
if (codepoint_build > 0x10FFFF) {
PRINTF ("Codepoints > 0x10FFFF are invalid.\n");
if ((params->ignore & 0x08) == 0) {
params->error_len=state;
STORE_ERRPOS
return (8);
}
params->error_flags |= 0x08;
}
if (codepoint_build >= 0xD800 && codepoint_build <= 0xDFFF) {
PRINTF ("Codepoint encodes a UTF-16 surrogate, and is therefore invalid in UTF-8.\n");
if ((params->ignore & 0x04) == 0) {
params->error_len=state;
STORE_ERRPOS
return (4);
}
params->error_flags |= 0x04;
}
Likewise, checks for high codepoints and those that encode the UTF-16 surrogates use the same code as before.
/*
* Write the reconstructed codepoint to the output buffer.
*/
*(params->buffer_decoded+params->buffer_decoded_len++)=codepoint_build;
state=0;
}
Unless we error'ed out due to one of the above situations, we write the codepoint to the output buffer.
In this case, flag_byte_done has already been set, so none of the rest of the code below within the loop will run and the function will continue with the next input byte.
This closes the state > 0 conditional. The following code deals with bytes received when not already decoding a multi-byte token, or when one of errors 512, or 2048 above causes the processing of a multi-byte token to stop and the current byte to be re-interpreted on it's own merits.
if (state == 0 && flag_byte_done==0) {
If we're not dealing with an existing multi-byte sequence, and we haven't already processed this byte...
if (leading_ones == 1) {
codepoint_build = (CUR_BYTE & 0x3F);
state=1;
flag_byte_done=1;
}
A continuation byte at this point would introduce a new multi-byte token, so we store it's lower six bits in codepoint_build, set state and set flag_byte_done.
if (leading_ones > 1) {
PRINTF ("Sequence introducer encountered without continuation bytes.\n");
if ((params->ignore & 0x800) == 0) {
params->error_len=1;
STORE_ERRPOS
return (2048);
}
params->error_flags |= 0x800;
flag_byte_done=1;
}
A lone sequence introducer is meaningless and therefore an error.
If ignored, we just completely skip the byte and write nothing to the output buffer for it.
if (leading_ones == 0) {
PRINTF ("Regular ASCII character at string pos %u\n", i);
flag_byte_done=1;
*(params->buffer_decoded+params->buffer_decoded_len++)=CUR_BYTE;
}
If we've reached here and have an ASCII byte, then it's in a valid location and we just write it to the output buffer.
Close the second main conditional for (state == 0 && flag_byte_done==0).
if (state != 0) {
PRINTF ("End of string within multibyte character.\n");
if ((params->ignore & 0x40) == 0) {
params->error_len=state;
i--;
STORE_ERRPOS
return (64);
}
params->error_flags |= 0x40;
}
PRINTF ("%salid UTF-8 string of %u bytes, decoded to %u codepoints\n", (params->error_flags == 0 ? "V" : "Parsable but strictly inv"),
params->buffer_encoded_len, params->buffer_decoded_len);
Finally we check for the input ending within a multi-byte sequence.
This code is almost identical to the same code in the regular decoding function, except that the value of error_len is just set to state instead of being calculated as (multibyte_size - state).