Loading bars simulator - Part four: Creating matching audio waveforms
Material covered in this part
Creating an audio waveform to match the video output
Writing an audio file
The wav file format
The au file format
Writing the 256 × 192 framebuffer data to a raw data file
Making use of the final output
Creating an audio waveform to match the video output
Essentially, all we need to do in order to create a matching audio waveform is to write functions that correspond to video_silence, video_leader, and video_byte_out.
The actual waveform is obviously only mono and only 1 bit deep, so we can write a single channel, 8-bit audio file at any arbitrary sample rate that suits us. The only slight difficulty is that the timings of the various pulses are not integer multiples of any commonly used audio sampling rate. For example, the pulses of the pilot tone last for 2168 T-states, which is 2168/3500000 of a second. At a 48 Khz sampling rate, this would be 29.73 samples. At 44.1 Khz, it would be 27.32 samples.
Of course, this doesn't mean that we can't accurately reproduce the timing of the original waveform at these sampling rates, we most certainly can.
Although we describe the signal as a square wave, in fact no such thing as a true square wave really exists at all, and at every 'edge' of the signal the waveform will actually oscillate around the high or low level for a brief period. The correct way to accurately represent the timing of the original waveform is to calculate and use those intermediate values as our samples.
This is fairly complicated however, and for our purposes a simpler approach is sufficient. In the case of the pilot tone pulses, we simply represent them as 30 samples instead of 29.73. This rounding error, combined with the similar but much smaller rounding error in the timing of the video output, will mean that the audio is approximately 0.00899 seconds longer than the video per second of leader. In practice, this doesn't seem to cause a noticeable problem with syncronization.
We use a variable 'samples' in the main function to hold the current position in the audio output buffer. This is broadly equivalent to beampos, except that as we are generating a continuous stream of audio data which will eventually be written to a single file, there is no need to have audio functions equivalent to output_frame and set_beam_pixel, we just write the sample values as bytes to the buffer and increment the position stored in 'samples'. We allocate two buffers for audio data, a small one to hold the wav header which we will eventually write, and a larger one to hold the actual audio data:
int samples;
unsigned char *audiobuffer;
unsigned char *audioheader;
audiobuffer=malloc(48000*70);
if (audiobuffer==NULL) { printf ("Unable to allocate memory for audiobuffer.\n"); return(1); }
audioheader=malloc(44);
if (audioheader==NULL) { printf ("Unable to allocate memory for audioheader.\n"); return(1); }
samples=0;
An audio data buffer large enough to hold 70 seconds worth of audio should be enough. The longest possible time for the data bytes part of the waveform would occur when writing a constant stream of 1 bits, which have a pulse duration of 1710 T-states. 6912 × 8 × 1710 / 3500000 = 54, leaving 16 seconds for any pilot tones, silent intervals, and the header.
We'll represent the two possible states of the waveform as sample values 32, and 223. We could use 0 and 255, but using 32 and 233 has two advantages: firstly a square wave at maximum peak-to-peak output will sound very loud and somewhat distorted when reproduced on most hardware, although be aware that the square wave output that we create using 32 and 223 will still be fairly loud, so set any mixer volume controls appropriately. Secondly, if the audio data is interpreted as signed, rather than unsigned, 233 will become -32, so we still end up with a playable waveform, albeit at a lower volume. If we use 0 and 255, then reproducing the waveform as signed will result in almost complete silence.
The audio equivalent to the video_silence function, audio_silence, simply writes 32 for every sample:
/* Output mute audio for the specified number of t-states. */
/* Each t-state represents (48000)/3500000 pixels, or approximately 0.14 samples per t-state. */
int audio_silence(unsigned char * audiobuffer, int samples, long t)
{
int n;
for (n=0; n<(t*48000/3500000); n++) { *(audiobuffer+samples+n)=32; }
return (samples+n);
}
The audio_leader function is almost identical to it's video counterpart:
/* Each high/low pulse lasts for 2168 t-states. */
/* 2168*48000/3500000=29.7, so each high/low pulse should last for approximately 29.7 samples. We use 30. */
/* 3500000/2168 gives 1614.39 pulses per second, so we use 1614. */
/* Rounding error means that the audio is approximately 0.00899 seconds longer than the video per second of leader. */
int audio_leader(unsigned char * audiobuffer, int samples, int duration_seconds)
{
int n,m,s;
m=0;
s=0;
for (n=0; n<(((duration_seconds*1614)|1)*30); n++) {
if ((n % 30)==0) { m=1-m; }
*(audiobuffer+samples+n)=32+191*m;
}
m=1-m;
/* The duration of the sync tone is 667 t-states for the first pulse, and 735 t-states for the second pulse. */
/* In samples, this works out as 9.1 and 10.1. We use 9 and 10, a total of 19, with the high/low switching point at 9. */
/* The timing difference verses the video output is virtually non-existent. */
for (s=0; s<19; s++) {
if (s==9) { m=1-m; }
*(audiobuffer+samples+n+s)=32+191*m;
}
return (samples+n+s);
}
Finally, audio_byte_out is similar in principle to video_byte_out, but we have an additional complication in that the pulse durations are far enough away from being integer multiples of the sample rate that we can't successfully use a fixed value for them. The 1710 T-states of the pulses for a binary 1 work out to approximately 23.45 samples at a 48000 Hz sampling rate, and the 855 T-states of a binary 0 work out to about 11.73.
We solve this problem by alternating between outputing either 23 or 24 samples for a binary 1, and for a binary 0, we use a repeating sequence of 11, 12, 12. The loss of syncronization between the video and the audio is only just about noticeable by the end of the loading sequence. If you watch the last few lines of the monochrome bitmap very closely, you'll see that the plotting of the pixel data and the raster bars are both very slightly ahead of the audio. The transition from loading the monochrome bitmap to loading the attribute bytes is also just about detectable. The degree of loss of syncronization depends on the mixture of 0 and 1 bits in the image data. Data consisting of all zeros shows almost no loss of syncronization at all.
However, for all practical purposes, even in the worst case of all binary 1s the slight loss of sync is not really a problem. It certainly doesn't detract from the overall effect.
The way that we do the alternation between the two possible values for each sample count is simply to use a static variable for each possible line state, which is incremented each time we output a pulse for a bit in that state. We take it's value modulus 2 or modulus 3 and if it's zero, then we adjust by one the number of samples that we output.
/* Add the tones for each bit of "byte", to the audio buffer starting at offset "sample", and return the new offset. */
/* The duration of a binary 1 is 1710 t-states per pulse, and of a binary 0 is 855 t-states. */
/* For a binary 1, 1710*48000/3500000=23.5, so we alternate between 23 and 24. */
/* For a binary 0, 855*48000/3500000=11.7, so we use a sequence of 11,12,12. */
int audio_byte_out(unsigned char * audiobuffer, int samples, unsigned char byte)
{
static int step_space=0; /* Used to alternate the audio output of a zero bit between 11 and 12. */
static int step_mark=0; /* Used to alternate the audio output of a one bit between 23 and 24. */
Now we add calls to the audio functions wherever we already have calls to the corresponding video functions, so the main part of our main function looks something like this:
And there we have it! The complete audio-visual experience has been simulated. All we need to do now is to write the audio data to a file in a suitable format.
Writing an audio file
The two most useful file formats for uncompressed pcm data are wav and au. The wav format is not particularly convenient to work with, but it's useful to see how to write it, as it's widely supported by other software. The au format is considerably simpler and probably just as well supported.
In either case, we'll need to declare another integer variable in our main function for the output file descriptor:
int fd_out;
The wav file format
The wav file format is built on the RIFF file format, which is a chunked format with it's own outer header encompassing the actual wav header. It's somewhat tedious for a simple single chunk audio file, because the length of the data has to be stored separately in both headers.
The code to write our audio data as a wav file looks like this:
The wav audio header is made up of two parts. First, a 12 byte RIFF header which is a literal "RIFF" followed by the length of the file in four bytes, followed by a literal "WAVE". Next we have a 24 byte subheader, which is a literal "fmt ", then four bytes to indicate the length of this header chunk, which are 16, 0, 0, 0, then two bytes to indicate PCM, 1, 0, followed by another two bytes to indicate mono, 1, 0, four bytes indicating the numbers of samples per second, 128, 187, 0, 0, another four bytes indicating the number of bytes per second, which in our case is the same, 128, 187, 0, 0, two bytes indicating the number of bytes per sample, which are 1, 0, in our case, then two bytes indicating the number of bits in each sample, 8, 0.
The data chunk follows immediately, beginning with four bytes containing the literal "data", followed by four bytes containing the number of samples.
This is then finally followed by PCM data, which is supposed to be an even number of bytes, and not an odd number.
The au file format
In contrast to the wav file header, the au file header is quite simple. It's basically a set of six, four-byte words, with all values stored in big-endian format. The first word is a literal ".snd", then we have the offset to the start of the audio data, the length of the audio data in bytes, a word that indicates the format of the audio data, which is 2 for 8-bit signed PCM, a word indicating the sampling rate, and a word indicating the number of channels. This is then followed by arbitrary padding until the start of the audio data, which is wherever we set it to be in the second word of the header.
The following code will write an audio file in au format from our audio data buffer:
Writing the 256 × 192 framebuffer data to a raw data file
It's quite easy to write the bitmap data, along with the fake attribute bytes that we generate by averaging the red, green, and blue color values, to a data file in the layout of the ZX Spectrum framebuffer. We just need to allocate some memory for another buffer:
unsigned char *databuffer;
databuffer=malloc(6912);
if (databuffer==NULL) { printf ("Unable to allocate memory for databuffer.\n"); return(1); }
Then add an extra line at the end of each of the loops that write out the bitmap data and the color attributes:
*(databuffer+n)=*(input_bitmap+m);
*(databuffer+6144+n)=false_attribute;
And finally write the block of 6912 bytes to disk:
Parsing of our fake color attributes on the real machine
If this raw data is loaded into a program that understands such image files, we can see how our fake attribute bytes would be displayed by the real machine. In some cases, the effect is quite interesting, especially with images that have large areas of pixels that are near black or near white.
Our color averaging algorithm turns areas of near black into an attribute byte of 0, which specifies black for both possible pixel values, so we do indeed get an area of pure black when reproduced with the 15-color pallette. White areas create an attribute byte of 63, which specifies white, albeit with the brightness bit switched off so it's not as bright as a 24-bit RGB value of 255, 255, 255, but nevertheless looks quite good.
Other colors don't really have an obvious or particularly visually pleasing mapping from 24-bit RGB to the 15-color pallette of the ZX Spectrum, but at least areas of similar color are mostly reproduced as blocks of a single color, unless they happen to straddle the threshold between two colors. Of course, unlike with our true 24-bit image, the dithering from the monochrome bitmap does show through, and this helps to fill in some details.
Making use of the final output
After all this programming, we can finally sit back and enjoy the end results of our work. To produce a nice demonstration of the code's capabilities, I prepared the following two 256 × 192 images. The is 1-bit monochrome, and the is 24-bit RGB color:
1-bit
24-bit
Invoking the raster bar generation program with these images as input produced 2144 output frames and 42.88 seconds of audio. Here are a few samples of the output frames:
This video file is approximately 27.9 Mb. It uses the nut container format, and the lossless ffv1 video codec, with single channel, 8-bit linear pcm audio.
The video content has been converted from RGB to YUV, and subsampled at 4:2:2, to better approximate the visual effect of PAL color encoding on the video signal, as it would have been seen on the original hardware connected to a television set.
Project summary
In this project, we've learned a lot about the timings of video signals, and how to simulate an effect from an old home computer to a high degree of accuracy.
We've also been creative and produced an interesting video effect that wouldn't have been possible on the original hardware, whilst retaining an authentic feel. We've seen how to parse ppm file headers, and write two different audio file formats. All this has been possible with under 500 lines of C source code, a total of about 20 kilobytes, and using only a few library functions from the operating system.