ExoticSilicon.com - adding /dev/fill and /dev/full to the OpenBSD kernel

Now, reads from the new device return an endless string of A characters:

# dd if=/dev/fill bs=1m count=128 | hexdump -C

00000000 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|

128+0 records in

128+0 records out

134217728 bytes transferred in 0.347 secs (386577590 bytes/sec)

08000000

Ideally, we want the output to be produced as efficiently as possible. By way of comparison, generating equivalent data using /usr/bin/jot is an order of magnitude slower:

# jot -b A -s '' 0 | dd bs=16k count=8192 | hexdump -C

00000000 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|

8192+0 records in

8192+0 records out

134217728 bytes transferred in 9.971 secs (13460574 bytes/sec)

08000000

Too slow!

It might not be obvious to the uninitiated exactly where the null and zero pseudo devices are actually implemented in the source tree. In order to answer this question, it helps to know that these are part of a larger set of special devices known as ‘memory’ special files, that also includes /dev/mem, and /dev/kmem, as well as the depreciated aperture driver and the historic /dev/io.

On OpenBSD, the memory special files are implemented in architecture-specific code, (as they were in 4.4BSD-Lite). Other modern BSD systems now tend to implement them in a more general fashion, FreeBSD having changed way back in 2000, and NetBSD also now using a lot of machine independent code.

We'll look specifically at the amd64 platform here, although the parts of the code that we're interested in are actually almost completely identical between all of the currently supported architectures.

The specific file of interest is sys/arch/amd64/amd64/mem.c, and the function that actually does the work is mmrw.

All of the memory special device files are character devices with major number 2, (at least on OpenBSD and most modern BSD systems - non-BSD systems such as Linux will likely use a different major number).

The minor numbers are shown in this table:

Device	Minor number
mem	0
kmem	1
null	2
zero	12

Minor numbers for various memory special devices

Reading the code for mmrw, we can see that the minor device number is checked using a switch statement, and then the relevant I/O is performed.

This kind of I/O within the kernel relies heavily on the uiomove function, and the struct uio structure. Both of these are described in the manual page for uiomove.

In the case of null, we don't want to do anything except ensure that the I/O request completes. Essentially the function just returns 0 to indicate success, however we first check to see if this is a write operation, (as opposed to a read), and if so, set uio_resid to zero. This ensures that data is drained from the input buffer, and that the operation doesn't just hang. Read operations simply return no data.

The zero device is handled slightly differently for writes, but essentially does the same thing, skipping over and discarding any input data supplied.

Reads from the zero device are somewhat more interesting, as we will use a similar technique for our new 'fill' device. Basically a single page is allocated, and it's address is stored in the global variable zeropage. This entire page is permanently filled with 0x00 bytes, ready for use whenever we need them, and all this happens exactly once - the first time that the zero device is read from. The zero page is never free'd, so second and subsequent calls to this code can be serviced very quickly.

We'll use the same concept of a single page of memory allocated once and never free'd to implement the fill device. The main difference is that single byte writes to fill will be copied to every byte of that page. Reads will work the same way as they do for the zero device.

First we need to decide on an appropriate minor number for it. We'll use minor device number 3. This is an arbitrary choice and we could just as easily use any unused slot, although it's always worth checking that you're not about to re-purpose a major-minor pair that was recently used for an entirely different function and subsequently removed or one that is already in use on a different architecture.

To make the device file we use mknod in the normal way:

# mknod -m 0644 /dev/fill c 2 3

# ls -l /dev/fill

crw-r--r-- 1 root wheel 2, 3 Feb 21 14:59 /dev/fill

If we try to do anything with the newly created device before writing any code to support it, unsurprisingly, we get an error. In this case it's ENXIO.

This is returned from the function mmopen, which performs some initial validation that the supplied minor number is indeed valid, returning EPERM or ENXIO if not.

So the first thing we need to do is to add a new case label to the switch statement in that function:

case 3:

At this point, if we re-compiled and booted in to the new kernel, we'd still get ENXIO, but it would be coming from mmrw instead of mmopen.

Before we add the new code to mmrw, we need to declare a global variable analogous to zeropage:

caddr_t fillpage;

Now in mmrw we need a loop counter variable:

int i;

And finally the main bulk of the new code itself:

/* minor device 3 is /dev/fill */

case 3:

if (uio->uio_rw == UIO_WRITE) {

if (fillpage == NULL)

fillpage = malloc(PAGE_SIZE, M_TEMP, M_WAITOK);

uiomove(fillpage, 1, uio);

for (i=1; i<PAGE_SIZE; i++)

*(fillpage+i)=*(fillpage);

uio->uio_resid = 0;

return (0);

}

if (fillpage == NULL)

return (ENOMEDIUM);

c = ulmin(iov->iov_len, PAGE_SIZE);

error = uiomove(fillpage, c, uio);

continue;

Note that on writes, we read a single byte from the buffer containing the written data, then promptly discard the rest by setting uio_resid=0.

If we didn't do this, then the routine would be called repeatedly, reading one byte at a time and using it to fill the buffer pointed to by fillpage. This would result in us using the last byte written as the fill pattern instead of the first.

This would be particularly undesirable. Consider the following command issued from the shell:

# echo foobar > /dev/fill

Rather than set the fill pattern to ascii character 'r', (0x72), we would get the trailing newline, (0x0a).

This is all we need to make our new pseudo device functional. It can now be used as described two sections above.

Since we return ENOMEDIUM if fill is read from before being written to, some readers might be wondering if we could implement ejecting of an already primed fill device to return it to it's initial state.

Well, we can. This is probably of little practical value beyond a programming exercise, but we'll take a look at it anyway.

Those of you who have read the article on using eject with usb flash drives will know that eject uses the MTIOCTOP ioctl with either the MTOFFL command, (in the case of an eject), or MTRETEN, (in the case of media re-loading using eject -t).

This is a magnetic tape ioctl, and not surprisingly the defines for these are not included in mem.c by default. To do this, we need to add the following include to mem.c:

#include <sys/mtio.h>

Now we can add code to the mmioctl function to handle it, instead of just returning ENODEV:

if (minor(dev)==3 && cmd==MTIOCTOP && ((struct mtop *)data)->mt_op == MTOFFL)

{

free(fillpage, M_TEMP, PAGE_SIZE);

fillpage=NULL;

return (0);

}

This code only handles off-lining of the fill device, since reloading it as would be done with eject -t doesn't really make sense in that there is no logical action to assign to such a function.

Of course, whilst the code above works, it has something of a shortcoming in as much as it allows any user with read access to the device to eject it. This is almost certainly not a good idea, so we'll add a check for root capabilities and return EACCES if called as a regular user:

if (minor(dev)==3 && cmd==MTIOCTOP && ((struct mtop *)data)->mt_op == MTOFFL)

{

if (suser(p) != 0) {

return (EACCES);

}

free(fillpage, M_TEMP, PAGE_SIZE);

fillpage=NULL;

return (0);

}

Whilst poking around the code in sys/arch/amd64/amd64/, I noticed that the function iskmemdev in conf.c not only checks for minor numbers 0, and 1, but also for 14.

The character device with major number 2, and minor number 14, was /dev/io. This device was intended as a way to access the I/O bus of the i386 processor, (which is separate to the memory bus on this architecture).

Code to support a related concept of three separate devices with minor numbers 14, 15, and 16, for different sized I/O operations was actually present in 4.4BSD-Lite, but disabled by default. Although that code did make it's way in to very early versions of NetBSD, it never really saw the light of day before being removed.

This mysterious device didn't even gain a manual page there until after it's functionality was reduced to merely enabling such I/O operations, and handed over to the i386_iopl system call instead.

Amusingly, the reference to this long obsolete device in the iskmemdev function in OpenBSD seems to have been copied to mem.c on other architectures where /dev/io was never implemented in the first place.

Readers who are familiar with Linux, NetBSD, or FreeBSD will likely know that these systems implement another special device in the shape of /dev/full. The full device is similar to the zero device on reads, but always returns ENOSPC on writes.

As of OpenBSD 7.2-release, such a full device has not yet been implemented on OpenBSD. Since we're here looking at the mem.c code, let's implement it!

Once again, we'll need to allocate a minor device number to our new device. We'll use 5.

# mknod -m 0666 /dev/full c 2 5

Creating the device file

As before, we also need to add a new case label to the switch statement in mmopen:

case 5:

In mmrw, we'll just add the same case 5: before the existing case 12: so that the /dev/zero code runs when accessing /dev/full as well.

The actual code is super trivial, just two lines immediately within the test for uio_rw == UIO_WRITE:

if (minor(dev)==5)

return (ENOSPC);

Testing the new code from the shell, it works as expected:

# dd if=/dev/zero of=/dev/full

dd: /dev/full: No space left on device

1+0 records in

0+0 records out

0 bytes transferred in 0.000 secs (0 bytes/sec)

Don't be fooled in to thinking that something is wrong by testing this functionality using the shell redirection:

# echo foobar > /dev/full

No error message?

If you were expecting an error message, you've fallen in to the trap.

In fact, the shell did receive ENOSPC, as we can tell by looking at the shell variable $?:

# echo foobar > /dev/full

# echo $?

A non-zero exit code indicates an error. Compare with a successful write:

# echo foobar > /tmp/foo

# echo $?

But if we fill a 'real' filesystem, we get a verbose error message:

# echo foobar > /tmp/foo

/tmp: write failed, file system is full

This is not a bug in our code!

It's simply due to the fact that the above error message is not generated by the shell at all, but rather within the ufs code, specifically in sys/ufs/ffs/ffs_alloc.c. This obviously doesn't run when dealing with a non-ufs filesystem.

Nope. Most likely on the Linux system you are using the bash shell, rather than the default ksh on OpenBSD. The bash shell does indeed produce it's own error message on the standard error output in this situation. Using bash on OpenBSD, we would see both errors printed.

Of course, we could add a uprintf call to our code to produce similar output to that which the ufs code generates, but since such an outcome is fully expected when using /dev/full, doing so seems rather superfluous.

Whereas the code to implement the memory special devices is separate for each hardware architecture on OpenBSD, the corresponding code in NetBSD is in sys/dev/mm.c, with various defines in sys/sys/conf.h.

Although the layout of the code is somewhat different between the two systems, the way /dev/null and /dev/zero are actually implemented is effectively the same. This shouldn't come as any surprise, as the only significant change since 4.4BSD-Lite is the way that discarding any data written to /dev/null is optimised by setting the residual I/O value to zero, rather than actually reading it out.