Video shuttle - Part one: Background information and theory
Material covered in this part
Background information and theory
Approaches to the problem, and what won't work
How to successfully implement it
Details we will overlook
Introduction
For the third installment of this real programming series, we'll create another visual effect. This time we'll tackle video rewind and shuttle effects, creating artificial noise bars over our set of background images. This is a useful effect in video production for showing time advancing, or an instant replay of material just shown. Since the size and movement of the noise bars is relative to the speed and direction of the tape, it's a useful sub-concious visual clue as to how quickly things are being sped up.
As always, our code will be written from first principles, in pure procedural C without recourse to third-party libraries. We'll focus on making the effect realistic by simulating the real process that creates it. This will help to ensure that we reproduce some of the more subtle nuances which would likely be lost if we simply tried to create the effect visually. Altogether, the source code will only be about 500 lines, or about 20 kilobytes. Let's get going!
Background information and theory
First, we need to establish what exactly we are trying to re-create. I'm going to assume that you're at least somewhat familiar with the concepts of analogue videotape recorders. If you've never used, or seen the output from one at all, then you might want to do some of your own research to get a foundation first before continuing.
Assuming that you have used this equipment, either in a domestic home or professional setting, you'll know that with a few exceptions, most videotape recorders will display noise of varying degrees on the picture, if the playback is at anything other than normal speed. Unless steps have been taken to reduce or inhibit the noise, as the tape is shuttled forwards and backwards at increasingly higher speeds, the noise bars become thinner and there are more of them.
Domestic VCRs will usually display actual noise, snow, or shash, and mute the audio whilst in shuttle. Professional equipment will often display a solid white bar, and leave the audio unmuted. We can easily re-create both effects in the same program. To do so, we need to understand where the effect comes from and by extension, what we will need to simulate to re-create it.
The vast majority of cassette videotape recording formats that have enjoyed widespread use since the late 1970s, have used a track layout known as helican scan, were each field of the video signal is recorded as a diagonal stripe across the width of the tape. Open-reel video recorders may use other scanning techniques, some of which impart their own interesting artifacts to the picture, specific to that format. For the purposes of this project, though, we'll concentrate on modelling the process of reading back the video signal from a tape recorded with helical scan tracks.
We can visualise the recording as a series of diagonal lines across the tape. The video heads sweep the same sort of path in a similar diagonal motion, picking up whatever signal is underneath them at the time. In early systems, there was a space of unrecorded tape between each track, to reduce cross-talk interference from adjacent signals. This was somewhat wasteful of tape, so in later systems the unused space was reduced or even eliminated altogether and instead the angles of the heads were offset from each other with different azimuths, again to reduce crosstalk. Both approaches effectively achieve the same end result, in that at normal playback speed, each head only sees the signals it's intended to see. For our simulation we can happily use the simpler unrecorded guard band approach and achieve the desired result. We don't particularly need to simulate different head azimuths.
The key point to take home, with regards to what causes the noise bars, is that at normal playback speed the video heads will move diagonally across the tape directly in line with a single recorded track. By the time the active head has reached the top of the tape and read the last line, the tape will have advanced just enough such that the beginning of the next track is in place for the next head to read it. All of the lines of the field being replayed will come from the same field recorded on the tape. At any other playback speed, this is not going to happen.
Even if the tape is simply paused, then the video heads will not be able to read a single track it in it's entiriety. The reason for this is that at normal playback speed, the tape is moving forward linearly in the same direction as the heads are spinning against it, albeit an order of manitude more slowly. Since the absolute position of the tape changed whilst the rotating video heads swept once across it during the original recording of any particular field, the diagonal line that needed to be followed has effectively been tightened up and is now presented at a steeper angle than it would be if the same head arrangement was to have recorded a single field on to a stationary piece of tape. Obviously the position and adjustment of the head drum is optimised for the case of normal speed playback. If we now imagine that the tape was somehow moving forward just as fast as the video heads were spinning horizontally, (which is basically impossible in practice as the head speed might typically be in the order of, say, five metres per second), then the video head in contact with the tape would effectively be moving straight upwards as the linear motions of both head and tape would have cancelled each other out. In this scenario, clearly the heads would be catching signals from multiple sequential fields and assembling a single field from a few lines of each of them, (with noise inbetween).
So when we shuttle the tape forward at, for example, five times normal speed, the picture we see at any point is actually made up of parts of several consecutive fields. The noise bars come from reading the unrecorded spaces between tracks on the tape. Or to look at it another way, the noise bars are equivalent to the black space between frames on a cinema film.
All we need to do to accurately re-create the visual effect, is to similate the movement of the tape, and the movement of the video heads, and calculate exactly what signal - or lack thereof - the heads are reading at any particular instant.
A practical approach to the problem
OK, theory is all very well and good, but in practice we come up against several barriers to accurately simulating all of this.
Firstly, trying to compute the actual signal level being read off of a tape, even if we had all of the specifications to hand, is extremely complicated. We would even need to take into account friction of the head drum against the tape, wear on the heads, and all sorts of other factors that are beyond being practical for a simple video effect. More importantly, though, most of the digital video that we might want to apply the effect to is likely not in a format that ever had an analogue equivalent.
Whilst in some cases we might actually want to convert our digital video to standard definition pixel dimensions to get an authentic retro look, it would also be nice to have the option to use this effect natively for high definition material. If the intended use is just to give the viewer a visual clue as to the passage of time being sped up, slowed, reversed or paused, rather than create a period videotape look, we want to remain in HD.
So solve these problems, what we need to do is to invent our own 'virtual' recording format, imagining that we're recording 1080 line HD or whatever the source material happens to be, as an analogue signal and then simulate reading it back complete with noise bars and interference as required.
Why simple solutions won't produce the desired effect
At this point, you might be asking whether all of this complexity is really necessary. Surely we could just overlay random noise over the picture, and vary the number and width of the bars until it 'looks right'. Add in a bit of jitter or other distortion, and the effect is done, right?
Wrong. Feel free to try it, and you will discover that it's harder than it seems to create a believable effect.
Take as an example the case of playback at double speed. If you simply convert some digital material shot at 30 fps, speeding it up by a factor of two, then assuming that you keep the output as 30 fps, your output file will just consist of every second frame from the input, so you'll have input frame 1, input frame 3, input frame 5, and so on. Frames 2, 4, and 6 from the input will have been dropped. You can certainly add a band of noise in the middle of the screen, but the output from a real analogue videotape would consist of different source fields above and below the noise, whereas your digital approximation will show the same field above and below. Believe it or not, this will be noticeable in cases where the original material has large objects moving on screen, because they will appear to be 'torn' when the effect is created authentically, but not with the simple digital approach.
Also, although at higher shuttle speeds the general case is simply that as the tape accelerates the noise bars become thinner and that there are more of them, the effect is somewhat more complicated at slower playback speeds between zero, (paused), and two. When the machine is running at these speeds, we often see almost full fields of noise, alternating with a clear picture, and when normal forward playback speed is reached, there can also be a few frames which suffer interference.
These subtle details might appear trivial, but ignoring them will result in an effect that just doesn't quite look right to the trained eye.
The good news is that by approaching the task in the way that we are going to, as a simulation of the actual tape reading process, a lot of these nuances don't need much particular attention and end up just working.
How we're going to implement it
Our input will be a set of sequentially numbered ppm files for the video, and an audio file in au format. We'll output another set of sequentially numbered ppm files, and another au file with the completed effect. Obviously there may be more or fewer ppm files input than output, depending on whether we are speeding up or slowing down the material. The output audio file may also be correspondingly longer or shorter.
Although these are file formats that we've used in previous projects, up to now we've mostly read and written files with specific, hard-coded dimensions and other properties. In this project, we'll need to be a bit more flexible to make it useful, as our input files are likely to be more varied. Since there will be a lot of computation going on to simulate the effect, the program will likely run somewhat slowly, especially during development before it's optimised for performance and still has a lot of debugging code in it. For this reason, too, it's useful to be able to supply low resolution inputs to be able to run tests more quicly.
As well as the audio and video input, we'll also require some sort of control input, to indicate at which points the simulated tape should speed up and slow down. This can be a plain text file.
We need to deal gracefully with situations such as reading past the end of the input, missing input frames, or input frames with mis-matched dimensions.
Obviously, the linear position of the tape can be represented by an integer, but it's slighty more complicated than it would be if we were dealing with just audio. In that case, there would be a direct relationship between the current offset into the digital audio data, the current sample that we were processing, and the simulated tape position. We could just count samples.
With a sequence of video frames, it's a bit more complicated. We can't just use the frame number, because we need an order of magnitude more granularity than that. Plus, we need to consider the movement of the tape not just once for every frame, but at the start of every line of video. Since we want to use integer maths rather than floating point, the way to represent the tape position will be SCALE_FACTOR × (frame number × height of image + current line), where SCALE_FACTOR is an arbitrary constant. During development of the program, I used a SCALE_FACTOR of 10, which was sufficiently granular for most effects, but not really enough for smooth slow motion. Changing it to 20 or 40 proved to be sufficient for almost every use case.
You might be wondering why, as well as multiplying the frame number by the height of the image and our new scale factor, we add the current line. Remember that by definition, at normal playback speed, we will be increasing the tape's linear position by 1 for each scan line we read and write out, so we need to subtract that 1 when we come to read out the next line. What we're actually calling 'tape position' is actually a bit of a mis-nomer, because what we're actually interested in is not the physical linear position of the tape at any one point, but which recorded frame is passing beneath us. Since the tracks are diagonal, each end of them is at a different physical linear offset on the tape.
An example might make this clearer. Imagine that we are dealing with HD video at 1920 × 1080 pixels, and we've set the SCALE_FACTOR to 10. The horizontal pixel dimension is unimportant to us, we only need to concern ourselves with the 1080 scanning lines. If the tape is playing at normal speed, and we're about to read line 0 of frame 5, then the tape position will be 10 × (5 × 1080 + 0) = 54000. Knowing this number, the tape position, and which scan line we want to read out, line 0, we can calculate which of the input frames it should come from: (54000 / 10 - 0 ) / 1080 = 5. When we come to scan the next line, line 1, the tape has advanced by SCALE_FACTOR units, and the formula becomes: (54010 / 10 - 1) / 1080 = 5. Note that the result is exactly 5, there is no remainder, no fractional part.
This is to be expected, because we are simulating playback at normal speed. Everything should be perfect, with no noise, and we can just copy the input frames to the output. The last line of frame 5 will give us (64790 / 10 - 1079) / 1080 = 5, and the first line of frame 6 will give us (64800 / 10 - 0)/1080 = 6.
So we can now compute which input frame we are reading, given the current tape position and current scanning line.
Now let's consider the same tape moving at three times normal speed. If we start at the same position reading line 0 of frame 5, then by the time we get to line 540, the tape will have advanced SCALE_FACTOR × 540 × 3, which is 16200, rather than just SCALE_FACTOR × 540. So the new tape position will be 54000+16200=70200, and putting this number into the formula above gives us: (70200 / 10 - 540) / 1080 = 6 exactly. By line 1000 we get: 84000 /10 -1000 = 6.85, and by line 1079 we get 86370 / 10 - 1079 = 6.9981.
Does this seem wrong to you? Are you thinking that, if we are moving at three times normal speed, we should be almost at frame 8 by now, given that we started at 5 and 5 + 3 = 8? Surely something is wrong with the calculations, and we shouldn't only just be at frame 7 by this point?
Actually, the calculations are completely fine.
To realise why, we have to consider two things. Firstly, the movement of the video heads on the drum, and secondly the unrecorded space between the tracks.
Remember that our 1080 lines are numbered 0 to 1079. It might seem logical that line 1080 would be the same as line 0 on the next frame, but it's not:
If we continued reading an extra line of the current frame, our formula would give us: ((86400/10)-1080)/1080=7.00, so we would be reading frame 7 square on.
If we read the first line from the next frame, the same formula gives us: ((86400/10)-0)/1080=8.00, so we would be reading frame 8 square on.
This is exactly how it should be! Imagine a real videotape recorder for a moment. As one of the heads is at the top of the head drum, reading the final piece of the signal from frame 7, the opposite head will be at the bottom of the tape drum, positioned probably about 9cm, (if this is a domestic VCR), further along the video tape, perfectly aligned with the start of frame 8.
Realise that, unlike an audio signal which is continuous, a video signal is made up of discrete frames. With an audio signal, we simply look at each sample in sequence, 1, 2, 3, and so on. With a video signal, we stay on the same frame for however many lines there are, in this example 1080, then abruptly move to the next frame in the sequence. The frame number that we get out of our formula should always be an exact round integer, if we are squarely aligned with one particular track on the simulated tape. At any moment that the division leaves a remainder, we are between too of the recorded tracks, and reading a signal from the un-recorded guard band. Returning to the example above, where we start on frame 5, but the tape is moving at three times the normal playback speed, by the time we reach line 268, the tape is at position (54000+(268*10*3))=62040, and the formula looks like this: (62040/10)-268)/1080=5.49. At this point, we are almost exactly half way between the recorded tracks of frame 5 and frame 6. We would certainly be reading noise at this point.
So now we have our basic formula. In principle, all we need to do is to loop round for each line of each frame, calculate whether we are close enough to a signal to display it, or whether we should just display noise, and output the result to a new ppm file.
Of course, there are a lot of small details to consider, such as exactly how close to a whole integer do we need to be in order to read a signal, and how exactly to generate the random noise, and of course there is the small detail of handling the audio, too. Before all of that, though, we need to write some basic functions to handle the input and output.
Details that we'll ignore
There are a couple of details that I'm deliberately going to ignore in this project, and in the interests of clarity and completeness, I'll mention them here.
Firstly, overscan. In a real analogue television signal, there are areas of overscan, as we saw in the last project where we actually needed to consider them when creating our output. In this project, we are not trying to accurately simulate any particular analogue television system, the accuracy is simulating the tracking of the video heads across the tape. As such, although it would be fairly trivial to include an overscan area, I don't see the point in doing this. It might even detract from the overall visual effect, as the noise bars would from time to time be located in the overscan area and therefore not be visible.
Secondly, interlace. Just about every modern analogue standard definition television system uses an interlaced signal, and this creates subtle effects when noise only falls in one of the two fields at any point on the screen. There is no single clear-cut, obvious, and mathematically correct way to re-create this in a progressively scanned video sequence, and since the focus is on creating an effect to use on new footage rather than creating a retro look, I consider it to be somewhat redundant.
Summary so far
In this first part, we've introduced the basic concepts and worked out a formula that will central to our simulation. In part two we'll start writing the supporting functions that we will need to handle I/O, specifically reading and parsing the input data.