EXOTIC SILICON
“Jumping down the /dev/null rathole to see what's inside”
Adding new memory special devices to the kernel
We recently had a need to implement a new pseudo device that would generate an endless stream of data consisting of one particular byte repeated indefinitely.
0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69 0x69# echo i > /dev/fill
Conceptually similar to the /dev/zero device, except that this new device would repeat back the last byte written to it instead of always outputting 0x00.
Let's take a look at how the memory-based device special files are implemented on OpenBSD, add our new device to them, discover and remove some un-needed legacy code, throw in a version of /dev/full which OpenBSD is currently lacking, and finally for good measure, look at some of the differences between OpenBSD and NetBSD in this area.
# mknod -m 0644 /dev/fill c 2 3
Motivating require­ment
The main reason for implementing this new device was to provide a source of non-null data to fill raw disk partitions for testing purposes. Whilst we could use /dev/random, this is limited by the speed of random number generation which, although fast, is not as fast as reading from /dev/zero.
New device specific­ations - meet /dev/fill
# cat /dev/fill
cat: /dev/fill: No medium found
Initial state of the new device
Our new device is intentionally very simple, intended for use from the command line and in shell scripts.
Before we've set it's initial byte value, reads from it should return ENOMEDIUM.
# echo A > /dev/fill
Priming /dev/fill
We can prime it with a simple redirected echo command from the shell.
Only the first character is used, any further data is discarded.
Now, reads from the new device return an endless string of A characters:
# dd if=/dev/fill bs=1m count=128 | hexdump -C
00000000 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|
*
128+0 records in
128+0 records out
134217728 bytes transferred in 0.347 secs (386577590 bytes/sec)
08000000
Ideally, we want the output to be produced as efficiently as possible. By way of comparison, generating equivalent data using /usr/bin/jot is an order of magnitude slower:
# jot -b A -s '' 0 | dd bs=16k count=8192 | hexdump -C
00000000 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|
*
8192+0 records in
8192+0 records out
134217728 bytes transferred in 9.971 secs (13460574 bytes/sec)
08000000
Too slow!
Looking at /dev/null and /dev/zero
It might not be obvious to the uninitiated exactly where the null and zero pseudo devices are actually implemented in the source tree. In order to answer this question, it helps to know that these are part of a larger set of special devices known as ‘memory’ special files, that also includes /dev/mem, and /dev/kmem, as well as the depreciated aperture driver and the historic /dev/io.
On OpenBSD, the memory special files are implemented in architecture-specific code, (as they were in 4.4BSD-Lite). Other modern BSD systems now tend to implement them in a more general fashion, FreeBSD having changed way back in 2000, and NetBSD also now using a lot of machine independent code.
We'll look specifically at the amd64 platform here, although the parts of the code that we're interested in are actually almost completely identical between all of the currently supported architectures.
The specific file of interest is sys/arch/amd64/amd64/mem.c, and the function that actually does the work is mmrw.
All of the memory special device files are character devices with major number 2, (at least on OpenBSD and most modern BSD systems - non-BSD systems such as Linux will likely use a different major number).
The minor numbers are shown in this table:
DeviceMinor number
mem0
kmem1
null2
zero12
Minor numbers for various memory special devices
Reading the code for mmrw, we can see that the minor device number is checked using a switch statement, and then the relevant I/O is performed.
This kind of I/O within the kernel relies heavily on the uiomove function, and the struct uio structure. Both of these are described in the manual page for uiomove.
In the case of null, we don't want to do anything except ensure that the I/O request completes. Essentially the function just returns 0 to indicate success, however we first check to see if this is a write operation, (as opposed to a read), and if so, set uio_resid to zero. This ensures that data is drained from the input buffer, and that the operation doesn't just hang. Read operations simply return no data.
The zero device is handled slightly differently for writes, but essentially does the same thing, skipping over and discarding any input data supplied.
Reads from the zero device are somewhat more interesting, as we will use a similar technique for our new 'fill' device. Basically a single page is allocated, and it's address is stored in the global variable zeropage. This entire page is permanently filled with 0x00 bytes, ready for use whenever we need them, and all this happens exactly once - the first time that the zero device is read from. The zero page is never free'd, so second and subsequent calls to this code can be serviced very quickly.
Adding our new device, /dev/fill
We'll use the same concept of a single page of memory allocated once and never free'd to implement the fill device. The main difference is that single byte writes to fill will be copied to every byte of that page. Reads will work the same way as they do for the zero device.
First we need to decide on an appropriate minor number for it. We'll use minor device number 3. This is an arbitrary choice and we could just as easily use any unused slot, although it's always worth checking that you're not about to re-purpose a major-minor pair that was recently used for an entirely different function and subsequently removed or one that is already in use on a different architecture.
To make the device file we use mknod in the normal way:
# mknod -m 0644 /dev/fill c 2 3
# ls -l /dev/fill
crw-r--r-- 1 root wheel 2, 3 Feb 21 14:59 /dev/fill
If we try to do anything with the newly created device before writing any code to support it, unsurprisingly, we get an error. In this case it's ENXIO.
This is returned from the function mmopen, which performs some initial validation that the supplied minor number is indeed valid, returning EPERM or ENXIO if not.
# cat /dev/fill
cat: /dev/fill: Device not configured
Accessing a device file that has no code in the kernel to support it.
So the first thing we need to do is to add a new case label to the switch statement in that function:
case 3:
At this point, if we re-compiled and booted in to the new kernel, we'd still get ENXIO, but it would be coming from mmrw instead of mmopen.
Before we add the new code to mmrw, we need to declare a global variable analogous to zeropage:
caddr_t fillpage;
Now in mmrw we need a loop counter variable:
int i;
And finally the main bulk of the new code itself:
/* minor device 3 is /dev/fill */
case 3:
if (uio->uio_rw == UIO_WRITE) {
if (fillpage == NULL)
fillpage = malloc(PAGE_SIZE, M_TEMP, M_WAITOK);
uiomove(fillpage, 1, uio);
for (i=1; i<PAGE_SIZE; i++)
*(fillpage+i)=*(fillpage);
uio->uio_resid = 0;
return (0);
}
if (fillpage == NULL)
return (ENOMEDIUM);
c = ulmin(iov->iov_len, PAGE_SIZE);
error = uiomove(fillpage, c, uio);
continue;
Note that on writes, we read a single byte from the buffer containing the written data, then promptly discard the rest by setting uio_resid=0.
If we didn't do this, then the routine would be called repeatedly, reading one byte at a time and using it to fill the buffer pointed to by fillpage. This would result in us using the last byte written as the fill pattern instead of the first.
This would be particularly undesirable. Consider the following command issued from the shell:
# echo foobar > /dev/fill
Rather than set the fill pattern to ascii character 'r', (0x72), we would get the trailing newline, (0x0a).
This is all we need to make our new pseudo device functional. It can now be used as described two sections above.
Adding eject support to /dev/fill
Since we return ENOMEDIUM if fill is read from before being written to, some readers might be wondering if we could implement ejecting of an already primed fill device to return it to it's initial state.
Well, we can. This is probably of little practical value beyond a programming exercise, but we'll take a look at it anyway.
Those of you who have read the article on using eject with usb flash drives will know that eject uses the MTIOCTOP ioctl with either the MTOFFL command, (in the case of an eject), or MTRETEN, (in the case of media re-loading using eject -t).
This is a magnetic tape ioctl, and not surprisingly the defines for these are not included in mem.c by default. To do this, we need to add the following include to mem.c:
#include <sys/mtio.h>
Now we can add code to the mmioctl function to handle it, instead of just returning ENODEV:
if (minor(dev)==3 && cmd==MTIOCTOP && ((struct mtop *)data)->mt_op == MTOFFL)
{
free(fillpage, M_TEMP, PAGE_SIZE);
fillpage=NULL;
return (0);
}
Note that we don't need to concern ourselves about the possibility of free'ing fillpage before it's been allocated, as the kernel free routine explicitly checks for NULL and returns immediately in this case.
This code only handles off-lining of the fill device, since reloading it as would be done with eject -t doesn't really make sense in that there is no logical action to assign to such a function.
Of course, whilst the code above works, it has something of a shortcoming in as much as it allows any user with read access to the device to eject it. This is almost certainly not a good idea, so we'll add a check for root capabilities and return EACCES if called as a regular user:
if (minor(dev)==3 && cmd==MTIOCTOP && ((struct mtop *)data)->mt_op == MTOFFL)
{
if (suser(p) != 0) {
return (EACCES);
}
free(fillpage, M_TEMP, PAGE_SIZE);
fillpage=NULL;
return (0);
}
Vestigial references to /dev/io
Whilst poking around the code in sys/arch/amd64/amd64/, I noticed that the function iskmemdev in conf.c not only checks for minor numbers 0, and 1, but also for 14.
The character device with major number 2, and minor number 14, was /dev/io. This device was intended as a way to access the I/O bus of the i386 processor, (which is separate to the memory bus on this architecture).
Code to support a related concept of three separate devices with minor numbers 14, 15, and 16, for different sized I/O operations was actually present in 4.4BSD-Lite, but disabled by default. Although that code did make it's way in to very early versions of NetBSD, it never really saw the light of day before being removed.
This mysterious device didn't even gain a manual page there until after it's functionality was reduced to merely enabling such I/O operations, and handed over to the i386_iopl system call instead.
Amusingly, the reference to this long obsolete device in the iskmemdev function in OpenBSD seems to have been copied to mem.c on other architectures where /dev/io was never implemented in the first place.
The full device
Readers who are familiar with Linux, NetBSD, or FreeBSD will likely know that these systems implement another special device in the shape of /dev/full. The full device is similar to the zero device on reads, but always returns ENOSPC on writes.
As of OpenBSD 7.2-release, such a full device has not yet been implemented on OpenBSD. Since we're here looking at the mem.c code, let's implement it!
Once again, we'll need to allocate a minor device number to our new device. We'll use 5.
# mknod -m 0666 /dev/full c 2 5
Creating the device file
As before, we also need to add a new case label to the switch statement in mmopen:
case 5:
In mmrw, we'll just add the same case 5: before the existing case 12: so that the /dev/zero code runs when accessing /dev/full as well.
The actual code is super trivial, just two lines immediately within the test for uio_rw == UIO_WRITE:
if (minor(dev)==5)
return (ENOSPC);
Testing the new code from the shell, it works as expected:
# dd if=/dev/zero of=/dev/full
dd: /dev/full: No space left on device
1+0 records in
0+0 records out
0 bytes transferred in 0.000 secs (0 bytes/sec)
Don't be fooled in to thinking that something is wrong by testing this functionality using the shell redirection:
# echo foobar > /dev/full
#
No error message?
If you were expecting an error message, you've fallen in to the trap.
In fact, the shell did receive ENOSPC, as we can tell by looking at the shell variable $?:
# echo foobar > /dev/full
# echo $?
1
A non-zero exit code indicates an error. Compare with a successful write:
# echo foobar > /tmp/foo
# echo $?
0
But if we fill a 'real' filesystem, we get a verbose error message:
# echo foobar > /tmp/foo
/tmp: write failed, file system is full
This is not a bug in our code!
It's simply due to the fact that the above error message is not generated by the shell at all, but rather within the ufs code, specifically in sys/ufs/ffs/ffs_alloc.c. This obviously doesn't run when dealing with a non-ufs filesystem.
But! But! On a Linux system I do get an error writing to /dev/full using the shell redirection, so there must be something wrong with our code!”
Nope. Most likely on the Linux system you are using the bash shell, rather than the default ksh on OpenBSD. The bash shell does indeed produce it's own error message on the standard error output in this situation. Using bash on OpenBSD, we would see both errors printed.
Of course, we could add a uprintf call to our code to produce similar output to that which the ufs code generates, but since such an outcome is fully expected when using /dev/full, doing so seems rather superfluous.
Differences with NetBSD
Whereas the code to implement the memory special devices is separate for each hardware architecture on OpenBSD, the corresponding code in NetBSD is in sys/dev/mm.c, with various defines in sys/sys/conf.h.
Although the layout of the code is somewhat different between the two systems, the way /dev/null and /dev/zero are actually implemented is effectively the same. This shouldn't come as any surprise, as the only significant change since 4.4BSD-Lite is the way that discarding any data written to /dev/null is optimised by setting the residual I/O value to zero, rather than actually reading it out.
Summary
Today we've seen how the memory-based device special files /dev/null and /dev/zero are implemented in the OpenBSD kernel, added a new device /dev/fill to them, and also implemented a version of /dev/full, as is commonly found on other modern BSD and Linux systems.