[Coco] Update on new Coco 3 game engine
Richard Goedeken
Richard at fascinationsoftware.com
Tue Aug 13 01:26:44 EDT 2013
Hello Coco fans!
In my long-term quest to write a side-scrolling arcade/adventure game for my
daughter, I began earlier this year with one of the hardest parts: building a
fast enough graphics engine to handle the scrolling and sprites. I figured
that if I couldn't get this working well enough, then there would be no point
in doing all of the work creating the game elements. I'm writing this email
because this graphics code is nearly complete, and I wanted to share some of
the many interesting things that I learned about the Coco and the 6809 during
this development.
In the design of the graphics engine, there are many decisions to be made
which trade off between performance and visual quality. The one major
advantage that the Coco has over the other 8-bit micros of the era is the
large available memory pool of 512k. I wanted to use this to my advantage as
much as possible, and you will see it in some of the choices that I made.
I decided to use double buffering, which is very common, to eliminate tearing
and flashing artifacts. This requires twice as much memory usage, and also
requires us to redraw twice as many background pixels as we otherwise would.
For example, consider the case in which the screen background is moving at 1
byte (2 pixels) per frame horizontally. Buffer 0 is drawn at a starting point
of (0,0). For the next frame, buffer 1 is drawn at (1,0). For the following
frame we will switch back to buffer 0. We need to draw the new pixels for
this screen buffer at a starting point of (2,0). We have already drawn this
buffer at (0,0), so we only need to add two columns of bytes (4 pixels) on the
right side to paint in the missing part of the screen. So we must draw a
column 4 pixels wide, even though we have only moved 2 pixels from the
previous frame. This is because each buffer only gets updated every other frame.
I also decided to use the 256-byte wide screen mode. This increases memory
usage for the screens by 60%, but it gives us some good advantages: 1. Pixel
location calculations are greatly simplified (no need to multiply by 160). 2.
We do not need to clip sprites on the sides when we draw them, because it's
okay to draw a little offscreen. 3. Background block redrawing can be faster
and more consistent in time between frames by always drawing the full width of
the blocks.
I really wish the GIME designers had provided for byte-level horizontal screen
positioning. It is extremely unfortunate that it can only set the horizontal
scroll position in 2-byte (4 pixel) increments. The only way to make it
scroll smoothly with this constraint is scroll A) very fast, and B) at a
constant speed. Some games (Crystal City) do this and it looks impressive,
but this scrolling is faster than I want, and I also would like to vary the
scrolling speed. Slower scrolling is too jerky with 2-byte positioning. In
software, we can do 2-pixel scrolling by using a pair of screen planes (even
and odd) for each of the front and back buffers. One screen plane is offset
by one byte, and we choose which plane to display on the monitor when we are
flipping front/back buffers in the vertical interrupt by looking at the lowest
bit of the X screen start position in bytes. The penalty for this finer
scrolling is doubling the video memory usage, and about 30% more time to draw
the background pixels.
So that's the background scrolling engine. I posted a demo on this list a few
months ago. I recently rewrote the block drawing functions with an improved
copying algorithm, so that it is now interrupt friendly (this is required for
sound), and also a little bit faster.
Regarding performance, the amount of time which passes between one field of
the NTSC video output from the Coco and the next is 16.7 milliseconds. This
is our "time budget". To achieve 60fps operation, we must draw/erase/redraw
everything necessary, as well as read input keyboard/joystick state and do
physics calculations in less than this time. Similarly, to run at 30fps we
need to finish all these calculations in less than 33.4ms. One thing that I
realized is that the computational workload for the game can vary greatly
depending upon number of objects on the screen, the positions of the objects,
whether the background is scrolling and by how much, etc. Rather than try to
achieve a constant frame rate (at which every frame will be bound by the worst
case), it is better to support a variable frame rate. This is a common
technique used in modern games, and in fact I even noticed this is the new
Pikmin 3 game for the Wii U. Since we already use double-buffering, this can
be supported with a small penalty when doing the physics calculations. So my
game engine does this: I track the number of 60hz fields which pass between
frame updates, and use this value for updating the game state ('physics'
calculations). For example, all objects will move at 3* their nominal speed
if there were 3 field durations which passed between the last pair of frame
updates. For simplicity and performance, I only support 1x, 2x, and 3x field
times for the variable frame rate. If it takes more than 3 field durations to
calculate a frame, then the game will appear to 'lag' or slow down.
Otherwise, it will just get a little choppier as it slows down, but will
appear to run at the same perceptual speed.
The performance of the scrolling engine is pretty good. Here is a table which
shows the number of milliseconds required to update the background (in terms
of bytes for horizontal scrolling, and rows for vertical scrolling):
Time (millisec) -8 -6 -4 -2 0 2 4 6 8
-------------------------------------------------------------------
Horizontal 12.3 9.7 7.2 4.3 1.4 4.3 7.2 9.8 12.3
Vertical 12.0 10.2 8.4 5.0 1.4 5.0 8.3 10.1 12.0
-------------------------------------------------------------------
The total overhead of the engine with no objects running is 1.4 milliseconds.
This includes reading 2 axes of one joystick and all the screen redraw logic.
One thing that I noticed is that IRQ overhead of the 6809 is really high. The
horizontal interrupt is a killer. The overhead for even an FIRQ is 21 cycles
(10 cycles to enter, 5 for the LBRA at $FExx, and 6 for the RTI). The fastest
routine that I can come up with handle both VSync and sound is a minimum of 45
cycles in 9 instructions, and I would probably need more than this to
dynamically update the screen based on row number. So we need a minimum of 66
cycles for this interrupt routine, and here's the kicker: the horizontal
interrupt signals arrive only 114 cycles apart. Therefore, using the
horizontal interrupt will occupy a minimum of 58% of all clock cycles,
regardless of frame rate. This is too steep for me, so I will not use this
and won't be able to split up the screen into horizontal regions, like Nick is
doing for Popstar Pilot. I'll run the sound at a lower frequency off of the
12-bit timer, and turn off this interrupt source when the sound is not playing.
During the last few months I've made a lot of progress on the sprite portion
of the graphics engine. I believe that my design for this is novel, and it is
about as fast as it could be. My goal here was maximum theoretical
performance. Again, I traded off memory consumption for speed. Part of the
challenge with drawing sprites is that the 16 color mode packs 2 pixels into a
single byte. If you want to support sprites with 1-pixel wide features, you
must mask the background bytes with a logical AND, and then OR/ADD the results
with the sprite pixels before writing back to the screen. The fastest
general-purpose sprite routines that I can write require 3720 cycles to erase
and write (while saving background data for later erasing) a 16x16 sprite.
This works out to 14.5 cycles per pixel (assuming that all 256 pixels are drawn).
To achieve the maximum possible performance with my sprite engine, I wrote a
sprite compiler. This software is a large and complex Python script, which
reads sprite data from a file and writes out near-optimal 6809 assembly for
drawing and erasing sprites on the screen. It basically paints them, byte by
byte. Even though the sprite compiler includes a lot of crazy optimizations,
the performance gains that I get on a cycle-per-pixel basis are relatively
small and mostly attributable to two techniques: 1) I don't need to AND mask
the bytes/words which will get completely overwritten, and 2) I can minimize
foreground pixel loads by grouping together writes with the same byte/word
values. For the few sprites with which I've been testing, the compiled sprite
code takes an average of 12.9 cycles/pixel to erase and draw, which is only
12% faster than the general purpose routine, but the big gain comes from the
fact that we only draw and erase the bytes which contain non-transparent
pixels. When we look at the overall time consumed (rather than cycles per
pixel), the new sprite engine turns out to be much faster than the general
purpose routine. For example, I can draw+erase a 15-pixel diameter ball in
under 2000 cycles. This is much faster than the general purpose routine,
which would take 3720. I can draw+erase nice outlined 8x16 numeric characters
in 1000 cycles or less each.
So I'm happy with the performance. As I mentioned before, the tradeoff is
increased memory consumption. For a general-purpose engine, you would use
probably 2 bytes per pixel to store sprite data (each sprite object would have
2 copies to get single-pixel positioning, and each copy would contain a mask
byte and a foreground pixel byte for each screen byte). For my engine, the
generated machine code for drawing and erasing the sprites varies, but comes
out to about 6 bytes per pixel if you only need byte-level positioning (ie,
for letters/numbers), or 9 bytes per pixel if you want pixel-level
positioning. It's a pretty heavy memory penalty, but I think it's worth the
speed. The maximum sprite size is about 62x32, and the cool thing is that the
sprites can be any shape or size, and will be optimized for just the pixels
which get written to the screen.
With the high memory consumption (and a scrolling graphics aperature which can
move anywhere in the physical RAM space), it is desirable to abstract the 8k
memory page (de)allocation and mapping. So, one of the very first modules of
code that I wrote is a simple virtual memory manager which tracks the 8k pages
which are allocated by different parts of the engine. It automatically moves
them when the screen aperature moves to overlap with an in-use block.
I'm really excited about this graphics/game engine, because it is sufficiently
generalized that it could be used for a lot of great games in addition to
platformers. It's not suitable for every genre, but it would work well for
several different game types. I would love to do a top-down racer like Micro
Machines. If it were simple enough, it could look beautiful running at
60fps. This engine is also suitable for horizontal shoot-em-ups and top-down
or isometric RPG graphic adventure or arcade action games. The sprite
functionality could be extracted separately from the background scrolling
engine and could be used in any type of game.
I also came up with a name for this engine: I call it DynoSprite. I have a
few more weeks of work to do on a demo that I will release to show the sprite
functionality. With any luck I should have something cool to show you soon.
Richard
More information about the Coco
mailing list