ExoticSilicon.com - Jay looks through the OpenBSD kernel softraid code

Reckless Guide

Part 5

This week, we're continuing our journey through the OpenBSD kernel source, but we're leaving behind the console and framebuffer for something rather more abstract as Jay delves into the softraid code in search of things to modify.

This article is part of a series - check out the index.

Website themes

The Exotic Silicon website is available in ten themes, but you haven't chosen one yet!

Hacking the softraid code

“One wrong move and his entire disk could be toast!”

Have you got the courage to read along with us?

Password-less and keydisk-less full disk encryption

Let's start by adding a simple but useful feature to the softraid code, whilst we learn how it works.

Full disk encryption is fairly easy to set up during the installation of OpenBSD, and I'm going to assume that readers are already familiar with the procedure for doing so, creating a softraid volume using the bioctl command.

After all, this is an article about modifying the softraid code, so you really need to have at least used it to understand what we'll be doing here.

Note that this guide was originally written with reference to OpenBSD 7.0, which did not include guided set-up of full disk encryption in the installer.

Additionally, the issue mentioned below of boot failures when using a single disk to store both a keydisk partition and the encrypted volume was also later fixed.

Now, there are two options for providing the encryption key. Either we can use a passphrase, which is then used by a key derivation routine to produce the actual encryption key, or alternatively we can use a keydisk. A keydisk is simply a RAID partition that stores a single encryption key, rather than a volume of encrypted storage.

Sometimes, however, we might want to set up an encrypted volume even though there is no particular need for security. The most obvious case would be testing the softraid code itself, but you might also own a laptop which is rarely taken outside of the home or office. In this situation, it's inconvenient to have to type a passphrase on every boot, but it would be nice to enable the feature when you travelled with the machine.

You might think that you could simply set up the encrypted volume using a keydisk, then make a second small RAID partition on the same disk as the main storage volume and copy the keydisk data to it. That way, the machine should be able to find the key and boot automatically, and when you wanted extra security, you could just erase the on-board copy of the keydisk and travel with a copy on an external USB flash drive. Unfortunately, last time I tried this, although the installation completed without problems, the machine wouldn't boot as it was unable to find the key when it was stored on the same disk as the encrypted volume. Besides, a keydisk and a passphrase address different security requirements, so it would be nice to keep passphrase entry as an option.

The solution was to set up the softraid crypto volume using a passphrase, and then later hard-code that passphrase into the bootloader. On boot, the machine would effectively type it's own passphrase, and everything would work as expected. The normal behaviour of prompting for the passphrase to be entered on the keyboard, could easily be restored before taking the laptop out of the office for use on the move.

Hard-coding a passphrase

As you might expect, the code to prompt for a passphrase and read it from the keyboard is fairly simple. It's located in /usr/src/sys/lib/libsa/softraid.c, in the function sr_crypto_passphrase_decrypt().

If we didn't know where to find it, a search across the entire kernel source tree for the ‘Passphrase’ prompt would quickly enlighten us:

# grep -r Passphrase: /usr/src/sys/

Finding the source code we're looking for

Reading the code, we can see that it's basically two nested loops. The outer loop prompts for, accepts, and checks a passphrase entered on the keyboard, jumping to the label ‘done:’, if it's correct. The inner loop near the top of the function simply loops round adding the entered characters to a buffer, adding a null terminator when the user hits enter. At this point, variable i will contain the length of the passphrase, and the character array passphrase[] will contain the passphrase itself.

In principle, all we need to do to hard-code our passphrase is to comment out the inner loop, store each character and a trailing 0x00 byte in the array elements, set variable i to the length of the password, and assign a value to c in order to silence a compiler warning about an unused variable.

c=0; i=6;

passphrase[0]='f';

passphrase[1]='o';

passphrase[2]='o';

passphrase[3]='b';

passphrase[4]='a';

passphrase[5]='r';

passphrase[6]=0;

Hard-coding a passphrase in the softraid code

Handy hint!

Making changes easier to revert

We could have completely removed the declaration for variable ‘c’ in the code referenced above rather than setting it to a dummy value of zero to silence the compiler warning. However, since we are only commenting out the code that uses it, it's good practice to leave the definition in place.

The reason for this is that in more complex code a bug could easily be introduced when restoring such a variable definition if the wrong type is used, such as changing an unsigned type for signed.

Installing the modified code

If you've rushed ahead, compiled the new code and rebooted, you'll probably be disappointed to have seen that you were still prompted to enter a passphrase.

To write the new code to the boot blocks, where it will be loaded from during boot, we first need to perform a few additional steps.

First, we run ‘make’ and ‘make install’ in /usr/src/sys/arch/amd64/stand/boot, or the corresponding directory for the architecture if it isn't amd64:

# cd /usr/src/sys/arch/amd64/stand/boot

# make

# make install

Compile the bootcode, and install it... Somewhere...

Important note!

This does NOT write the bootcode to the boot sectors on the boot disk.

Running ‘make install’ in the boot directory writes a copy of the second stage bootloader to /usr/mdec/boot. The /usr/mdec/ directory is traditionally a location that contains copies of binary code used for bootstrapping, however these are not used directly during the boot process as bootloaders often lack the ability to parse a standard filesystem due to code size limitations. Instead the files are usually written to specific sectors on the boot media, or in the case of network booting with pxeboot, served over the network using the TFTP protocol.

So we now have our modified version of softraid.c compiled into /usr/mdec/boot, but the boot blocks on our disk haven't been updated.

To do this, we run installboot, and specify the softraid device that contains the root filesystem. This might sound counter-intuitive, as at least the initial bootstrap code will obviously have be read directly from the un-encrypted physical device, since no code to access encrypted softraid volumes has yet been loaded. However, installboot is clever enough to figure this out itself, and will write the boot blocks to the correct disk:

# installboot -v sd3

Using / as root

installing bootstrap on /dev/rsd3c

using first-stage /usr/mdec/biosboot, second-stage /usr/mdec/boot

sd3: softraid volume with 1 disk(s)

sd3: installing boot loader on softraid volume

/usr/mdec/boot is 6 blocks x 16384 bytes

sd0d: installing boot blocks on /dev/rsd0c, part offset 144

master boot record (MBR) at sector 0

partition 3: type 0xA6 offset 64 size 937697921

/usr/mdec/biosboot will be written at sector 64

Copy the bootcode from /usr/mdec/ to the actual boot blocks

You might be wondering where exactly the contents of /usr/mdec/boot is being written to. When installing to a normal, un-encrypted volume, it's written to the file /boot on the root filesystem. However, if the un-encrypted physical disk, (sd0 in this example), only has a RAID partition in it's disklabel and no FFS filesystem partitions, then clearly it must be written elsewhere. The encrypted root partition, (sd3a in this example), doesn't contain a /boot file, so where is it?

The output from installboot above tells us that it has been written to partition offset 144, which is block 144 of the physical disk, not bock 144 of the RAID partition. The RAID partition, in this case, starts at offset 64, so /usr/mdec/boot has been copied to the RAID partition starting at offset 80.

We can see that this is really the case by comparing a hex dump of those disk sectors with a hex dump of /usr/mdec/boot:

# hexdump -C -s 73728 /dev/sd0c | head -20

# hexdump -C /usr/mdec/boot | head -20

Compare the contents of the boot blocks with the contents of /usr/mdec/boot

These blocks form a reserved area of the RAID partition on the physical disk, before the start of the space where the actual data from the softraid volume will be stored. The code that performs these operations is in /usr/src/usr.sbin/installboot/i386_softraid.c, and reading the code we can see an interesting design decision. As well as writing the raw /usr/mdec/boot image to the reserved sectors, the code also manually creates a miniature FFS1 filesystem so that the code in the partition boot record which loads the /usr/mdec/boot image from a regular filesystem on an un-encrypted disk, doesn't need to be changed to read this copy in the reserved sectors of the RAID partition.

The code in function sr_install_bootldr() writes several values to a dinode structure, which some readers might not be familiar with. This is defined in /usr/src/sys/ufs/ufs/dinode.h. At the end of the function we see various calls to sym_set_value(), these are the values which will eventually be hard-coded into the PBR so that it doesn't need to actually parse the filesystem itself. This patching is done to an image of the PBR in memory by function pbr_set_symbols() in /usr/src/usr.sbin/installboot/i386_installboot.c. The assembler source for the PBR code is in /usr/src/sys/arch/amd64/stand/biosboot/biosboot.S

In any case, now that we have correctly installed the modified boot code, subsequent reboots will no longer ask us for our passphrase.

Convenient switching between two different versions of /boot

We can keep several versions of the boot code in /usr/mdec/ and re-run installboot whenever we want to switch between them.

Assuming that we remembered to back up the original source file to softraid.c.dist, we could do something like this:

# cd /usr/mdec

# mv boot boot.custom

# cd /usr/src/sys/lib/libsa

# mv softraid.c softraid.c.custom

# mv softraid.c.dist softraid.c

# touch softraid.c Update the timestamp, as it will be older than our modified version

# cd /usr/src/sys/arch/amd64/stand/boot

# make

# make install

Switch between several custom versions of the boot code

Now we have the original /usr/mdec/boot, and our custom /usr/mdec/boot.custom which has the passphrase hard-coded.

To restore the normal behaviour of prompting for the passphrase on every boot, we just need to run installboot again:

# installboot -v sd3

Restore the original boot code, after following the steps above

Or to have our boot-time passphrase entered automatically, we can install our custom bootcode again:

# installboot -v sd3 /usr/mdec/biosboot /usr/mdec/boot.custom

Re-install our custom boot code

Important!

Security concerns:

This example of hard-coding a passphrase in the bootloader is really intended as an exercise in learing about the softraid crypto code. Whilst it does have a few practical use-cases, the implications of storing your passphrase un-encrypted on the disk should be obvious.

Anybody with physical access to the machine can read data from the encrypted volume, and also recover the original plaintext password, (which you might have used elsewhere).

It's also worth pointing out that when overwriting the boot code with the original version it should not be assumed that the previous version of the boot code has actually been overwritten on the media. In the case of a magnetic disk, recovery would likely be impossible without custom firmware, and difficult even with it. In the case of an SSD however, we suspect that the bare flash memory ICs could practically be removed and read directly, with equipment typically available in a large university laboratory. Recovery of the plaintext passphrase from an SSD in this scenario seems plausible.

Tweaking RAID 1 mirror sets - preamble

The softraid code in it's entirety is fairly complex, but we quickly can get a feel for how it works by looking at one of the simpler RAID disciplines, RAID 1 mirroring.

Obviously any significant local changes to the code should be thoroughly tested before being used on a production machine, to minimise the risk of data corruption, so we'll stick to a broad overview here and leave interested readers to explore deeper themselves.

The code that implements the actual RAID disciplines is contained in a set of files in /usr/src/sys/dev/, one file for each discipline. They all follow a similar format, and during initialisation they set function pointers in an sr_discipline structure to point to their own discipline-specific functions for volume creation, read/write, and so on.

In the case of RAID 1, any writes to the RAID volume need to be written to all of the functioning disks, (except any hot spares). For reads, however, we actually have a choice. We could read from any one of the disks in the mirror, perhaps optimising for performance, or we could read the same data from multiple disks and check that each copy of the data read back matches. If we have three working disks in the mirror rather than just two, we could even potentially correct bad data from one disk using an averaging algorithm.

Currently, the code in softraid_raid1.c reads from each disk in a round-robin fashion, with each successive read being done from the next disk.

This highlights one of the potential limitations of a RAID 1 mirror, in that if any one of the individual disks ever returns bad data without detecting it, then that bad data will happily be passed through the softraid layer and eventually to userland, even though a good copy of the same data may exist on another disk. If the bad data is then written back to the array, it will be written to all of the disks in the mirror, and silent data corruption will result.

An urban myth about mirror sets

Urban myth alert!

RAID 1 mirrors don't usually cross-check data on each disk!

Unfortunately, many users wrongly believe that RAID 1 mirrors habitually read from all of the disks and verify the integrity of the data by comparing all of the copies. Whilst that certainly can be done with some setups, it's not a guaranteed assumption. In any case, doing so would also likely result in worse read performance than a single non-RAID'ed disk.

It can easily be demonstrated that RAID 1 mirrors as implemented by softraid on OpenBSD don't behave in this way, by creating a small RAID 1 mirror and deliberately writing different data to each of the component disks.

For demonstration purposes, this can easily and conveniently be done using virtual vnode disks to avoid the need for several physical disks.

First, we'll mount a ramdisk on /mirror, create two empty files of 256 Mb each to hold the images of each vnode disk, then create the vnode disks themselves:

# mkdir /mirror

# mount_mfs -s 1g swap /mirror

# dd if=/dev/zero of=/mirror/0 bs=1m count=256

# dd if=/dev/zero of=/mirror/1 bs=1m count=256

# vnconfig vnd0 /mirror/0

# vnconfig vnd1 /mirror/1

Creating two vnode disks on a ramdisk

With our two new virtual devices created - vnd0 and vnd1 - we can now create RAID partitions on them. Then we proceed to create the actual RAID array, noting down the number of the sd device that it attaches as.

# echo "d: 524288 0 RAID" | disklabel -R vnd0 /dev/stdin

# echo "d: 524288 0 RAID" | disklabel -R vnd1 /dev/stdin

# bioctl -c 1 -l vnd0d,vnd1d softraid0

softraid0: RAID 1 volume attached as sdX

Creating the RAID partitions and the softraid volume

Next, we write some bytes to the softraid volume, taking care to substitute the correct sd device as noted above.

# echo "FOOBAR" > /dev/sdX

Writing data to the softraid volume

Now we can search for the bytes that we just wrote to the softraid device, on one of the raw vnd partitions.

# dd if=/dev/vnd0d 2> /dev/null | hexdump -C | grep FOOBAR

00042000 46 4f 4f 42 41 52 0a 00 00 00 00 00 00 00 00 00 |FOOBAR..........|

Finding the bytes on one of the raw vnd partitions

The next step is to Convert the hex byte offset 0x42000 into a decimal block number.

The hex value 0x42000 is 270336 in decimal, and since each block contains 512 bytes, we divide by 512 to get 528.

This calculation is easily done using the desk calculator program /usr/bin/dc.

# echo 16i42000 200/p | dc

528

Converting the hex byte offset into a decimal block number using dc

With this information in hand, we can now observe the same bytes on each of the vnd devices:

# dd if=/dev/vnd0c skip=528 count=1 | hexdump -C

00000000 46 4f 4f 42 41 52 0a 00 00 00 00 00 00 00 00 00 |FOOBAR..........|

# dd if=/dev/vnd1c skip=528 count=1 | hexdump -C

00000000 46 4f 4f 42 41 52 0a 00 00 00 00 00 00 00 00 00 |FOOBAR..........|

Reading the data from the raw vnd devices that make up the RAID mirror

Now, we write different data to each of the raw vnd devices:

echo "Disk zero" | dd of=/dev/vnd0c seek=528

echo "Disk one" | dd of=/dev/vnd1c seek=528

Writing different data to each raw vnd partition

Finally, we can read the RAID mirror in the normal way, and observe the read data changing:

# dd if=/dev/sdXc count=1 bs=16 | hexdump -C

1+0 records in

1+0 records out

00000000 44 69 73 6b 20 7a 65 72 6f 0a 00 00 00 00 00 00 |Disk zero.......|

00000010

# dd if=/dev/sdXc count=1 bs=16 | hexdump -C

1+0 records in

1+0 records out

00000000 44 69 73 6b 20 6f 6e 65 0a 00 00 00 00 00 00 00 |Disk one........|

00000010

# dd if=/dev/sdXc count=1 bs=16 | hexdump -C

1+0 records in

1+0 records out

00000000 44 69 73 6b 20 7a 65 72 6f 0a 00 00 00 00 00 00 |Disk zero.......|

00000010

# dd if=/dev/sdXc count=1 bs=16 | hexdump -C

1+0 records in

1+0 records out

00000000 44 69 73 6b 20 6f 6e 65 0a 00 00 00 00 00 00 00 |Disk one........|

00000010

Reading back variable data from the RAID device

With this example, we can clearly see that the data is being read from the RAID partitions in a round-robin fashion.

In the next section, we'll take a closer look at the code that does this.

Fun fact!

Manual page discrepancy

The manual page for bioctl clearly states that the -s option to read the passphrase for a softraid crypto volume from /dev/stdin cannot be used when creating a new volume. However it can be used, as the trivial example below, which follows on from the RAID 1 examples we've just seen above, clearly shows:

# dd if=/dev/zero of=/mirror/2 bs=1m count=256

# vnconfig vnd2 /mirror/2

# echo "d: 524288 0 RAID" | disklabel -R vnd2 /dev/stdin

# echo "foobar␍foobar" | bioctl -s -c C -l vnd2d softraid0

softraid0: CRYPTO volume attached as sdY

Back to tweaking RAID 1 mirror sets

As we mentioned above, the code in softraid_raid1.c reads from each disk sequentially. This can be seen by looking at the function sr_raid1_rw(). This structure of this section of code is really quite awful for reasons that I'll explain shortly, but let's try to understand it anyway. Basically, any read or write operation revolves around a structure called a ‘workunit’, which holds just about everything needed to perform that operation. Burried deep within this workunit structure is a flag that indicates whether this particular operation is indeed a read or a write. This is what is being tested for when we see

xs->flags & SCSI_DATA_IN

in the source code.

Next we come to a bizarre part of the code, in which any read operation is performed within a for loop that is hard-coded to iterate exactly once. The code would be much clearer and more logical if the read and write operations were handled entirely separately using a large ‘if’ or ‘switch’ statement testing

xs->flags & SCSI_DATA_IN.

To make things even less intuitive, even though the outer for loop only iterates once, the code creates a loop using ‘goto ragain’, to read from the next disk in the array if the disk just attempted is offline, rebuilding, or marked as a hot spare.

To be fair, this looks to me like C code written in the style of assembly, and once you've read through it a few times it becomes fairly obvious what it's doing. However this style of coding does tend to make testing any local changes more time-consuming, because you've got to be especially careful that you haven't broken a subtle but important behaviour of the code that wasn't immediately obvious.

Although we want to tweak the read code, the write case is actually simpler to look at first. The variable ‘ios’ is set to the number of chunks in the array, so in the common case of a mirror set over two disks, this would be set to 2. The for loop then iterates over the values 0, and 1, running the code in the else part of the outer if statement. Error conditions send the code off to the label ‘bad’ using a goto statement, only to immediately exit from the function with a return value of 1. If there was some kind of clean-up to do before exiting, structuring the code like this might make sense. However there obviously isn't, so it would be clearer to simply replace the gotos with ‘return (1);’, perhaps with a comment explaining the scenario that caused the error.

Handy hint!

Understanding the logic in sr_raid1_rw()

If you're struggling to understand the logic of the code for the sr_raid1_rw() function in softraid_raid1.c, the following code demonstrates the case for writes in a simplified form:

#include <stdio.h>

#include <stdlib.h>

int main()

{

int i;

for (i=0; i<4; i++) {

if (0==1) {

printf ("Not reached\n");

} else {

printf ("Switch of loop iteration: %d\n", i);

switch (arc4random_uniform(6)) {

case 0:

case 1:

case 3:

case 4:

break ; /* jumps to 'after switch' */

case 2:

continue ; /* jumps back to the for statement */

default:

goto foobar;

}

printf ("After switch, still within if/else block\n");

}

printf ("Outside of if block, end of loop iteration %d\n", i);

}

printf ("Normal program termination\n");

return (0);

foobar:

printf ("Abnormal program termination\n");

return (1);

}

If you're tempted to make any changes to the sr_raid1_rw() function, you might want to experiment with making equivalent changes to the code above first to see if it produces the output you're expecting.

For now, let's turn our attention back to the actual sr_raid1_rw() code. A successful write results in a break from the switch statement, and the code below it is executed for that iteration of the for loop, thereby queuing a write to that particular disk. If the disk is offline or a hot spare, the code jumps immediately to the next iteration of the loop without executing the code below, due to the continue statement.

Programming exercise

Suggested programming exercise:

The combination of using a continue within a switch statement which is itself within an if statement, a loop constructed with if and goto, and the main if statement controlling whether we are reading or writing being within the for loop instead of outside of it, makes the flow of the whole sr_raid1_rw() function somewhat non-intuitive.

Re-write and clean up this function to make it's logic clearer, whilst maintaining the same functionality.

In the case of a read, the value of ‘ios’ is always set to one, so the for loop always iterates exactly once. Every time this function is called to perform a read, a counter variable is incremented and the value of ‘chunk’ is set to this value modulo the number of disks in the array.

This obviously implements the round-robin reading behaviour that we saw above.

By changing the formula used to derive the value of ‘chunk’, we can favour some disks over others, or exclude some disks from read requests altogether. This might be useful if we have an array which pairs an SSD with a magnetic disk. If the SSD is known to be disk zero, and the HD is disk one, then we could just change the code to read from sv_chunks[rt] instead of sv_chunks[chunk], and it would prefer reading from the SSD, only reading from the HD if the SSD was off-line, rebuilding, or marked as a hot spare.

Programming exercise

Suggested programming exercise:

Implement reading the same data blocks from all of the chunks in a RAID 1 array, comparing them, and reporting an error if they are not identical. If there are three disks in the array, errors on a single disk can be corrected.

Remember that sectors which have not yet been written can be expected to read different data from each disk. Think of ways to cope with this.

And on that note, we'll wrap up this week's installment. Don't forget to backup your data if you're going to experiment further in private study!

Summary

This week, we've looked at some of the softraid code in the OpenBSD kernel. We started by modifying the boot code to enter a passphrase automatically, then looked in more detail about how the bootloader is installed when booting from a softraid crypto volume. Next, we demonstrated that the RAID 1 code reads from each disk in the mirror set in sequence when reading from the RAID volume. Finally, we examined some of the code that implements this logic, and saw how we could modify it to read from a specific member of the mirror.

Next week we'll take a break from kernel code to discuss disk partitioning, including some little known nuances of BSD disklabels.

IN NEXT WEEK'S INSTALLMENT, WE'LL BE DISCOVERING INTERESTING DETAILS AND HIDDEN COMPLEXITY WITHIN DISKLABELS. SEE YOU THERE!