[Coco] Update on new Coco 3 game engine

Richard Goedeken Richard at fascinationsoftware.com
Tue Aug 13 01:26:44 EDT 2013


Hello Coco fans!

In my long-term quest to write a side-scrolling arcade/adventure game for my 
daughter, I began earlier this year with one of the hardest parts: building a 
fast enough graphics engine to handle the scrolling and sprites.  I figured 
that if I couldn't get this working well enough, then there would be no point 
in doing all of the work creating the game elements.  I'm writing this email 
because this graphics code is nearly complete, and I wanted to share some of 
the many interesting things that I learned about the Coco and the 6809 during 
this development.

In the design of the graphics engine, there are many decisions to be made 
which trade off between performance and visual quality.  The one major 
advantage that the Coco has over the other 8-bit micros of the era is the 
large available memory pool of 512k.  I wanted to use this to my advantage as 
much as possible, and you will see it in some of the choices that I made.

I decided to use double buffering, which is very common, to eliminate tearing 
and flashing artifacts.  This requires twice as much memory usage, and also 
requires us to redraw twice as many background pixels as we otherwise would.  
For example, consider the case in which the screen background is moving at 1 
byte (2 pixels) per frame horizontally.  Buffer 0 is drawn at a starting point 
of (0,0).  For the next frame, buffer 1 is drawn at (1,0).  For the following 
frame we will switch back to buffer 0.  We need to draw the new pixels for 
this screen buffer at a starting point of (2,0). We have already drawn this 
buffer at (0,0), so we only need to add two columns of bytes (4 pixels) on the 
right side to paint in the missing part of the screen.  So we must draw a 
column 4 pixels wide, even though we have only moved 2 pixels from the 
previous frame. This is because each buffer only gets updated every other frame.

I also decided to use the 256-byte wide screen mode.  This increases memory 
usage for the screens by 60%, but it gives us some good advantages:  1. Pixel 
location calculations are greatly simplified (no need to multiply by 160).  2. 
We do not need to clip sprites on the sides when we draw them, because it's 
okay to draw a little offscreen.  3. Background block redrawing can be faster 
and more consistent in time between frames by always drawing the full width of 
the blocks.

I really wish the GIME designers had provided for byte-level horizontal screen 
positioning.  It is extremely unfortunate that it can only set the horizontal 
scroll position in 2-byte (4 pixel) increments.  The only way to make it 
scroll smoothly with this constraint is scroll A) very fast, and B) at a 
constant speed.  Some games (Crystal City) do this and it looks impressive, 
but this scrolling is faster than I want, and I also would like to vary the 
scrolling speed.  Slower scrolling is too jerky with 2-byte positioning.  In 
software, we can do 2-pixel scrolling by using a pair of screen planes (even 
and odd) for each of the front and back buffers.  One screen plane is offset 
by one byte, and we choose which plane to display on the monitor when we are 
flipping front/back buffers in the vertical interrupt by looking at the lowest 
bit of the X screen start position in bytes.  The penalty for this finer 
scrolling is doubling the video memory usage, and about 30% more time to draw 
the background pixels.

So that's the background scrolling engine.  I posted a demo on this list a few 
months ago.  I recently rewrote the block drawing functions with an improved 
copying algorithm, so that it is now interrupt friendly (this is required for 
sound), and also a little bit faster.

Regarding performance, the amount of time which passes between one field of 
the NTSC video output from the Coco and the next is 16.7 milliseconds.  This 
is our "time budget".  To achieve 60fps operation, we must draw/erase/redraw 
everything necessary, as well as read input keyboard/joystick state and do 
physics calculations in less than this time.  Similarly, to run at 30fps we 
need to finish all these calculations in less than 33.4ms.  One thing that I 
realized is that the computational workload for the game can vary greatly 
depending upon number of objects on the screen, the positions of the objects, 
whether the background is scrolling and by how much, etc.  Rather than try to 
achieve a constant frame rate (at which every frame will be bound by the worst 
case), it is better to support a variable frame rate.  This is a common 
technique used in modern games, and in fact I even noticed this is the new 
Pikmin 3 game for the Wii U. Since we already use double-buffering, this can 
be supported with a small penalty when doing the physics calculations.  So my 
game engine does this: I track the number of 60hz fields which pass between 
frame updates, and use this value for updating the game state ('physics' 
calculations).  For example, all objects will move at 3* their nominal speed 
if there were 3 field durations which passed between the last pair of frame 
updates.  For simplicity and performance, I only support 1x, 2x, and 3x field 
times for the variable frame rate.  If it takes more than 3 field durations to 
calculate a frame, then the game will appear to 'lag' or slow down.  
Otherwise, it will just get a little choppier as it slows down, but will 
appear to run at the same perceptual speed.

The performance of the scrolling engine is pretty good.  Here is a table which 
shows the number of milliseconds required to update the background (in terms 
of bytes for horizontal scrolling, and rows for vertical scrolling):

Time (millisec)  -8    -6    -4    -2     0     2     4 6     8
-------------------------------------------------------------------
Horizontal     12.3   9.7   7.2   4.3   1.4   4.3   7.2 9.8  12.3
Vertical       12.0  10.2   8.4   5.0   1.4   5.0   8.3 10.1  12.0
-------------------------------------------------------------------

The total overhead of the engine with no objects running is 1.4 milliseconds.  
This includes reading 2 axes of one joystick and all the screen redraw logic.  
One thing that I noticed is that IRQ overhead of the 6809 is really high.  The 
horizontal interrupt is a killer.  The overhead for even an FIRQ is 21 cycles 
(10 cycles to enter, 5 for the LBRA at $FExx, and 6 for the RTI).  The fastest 
routine that I can come up with handle both VSync and sound is a minimum of 45 
cycles in 9 instructions, and I would probably need more than this to 
dynamically update the screen based on row number.  So we need a minimum of 66 
cycles for this interrupt routine, and here's the kicker: the horizontal 
interrupt signals arrive only 114 cycles apart.  Therefore, using the 
horizontal interrupt will occupy a minimum of 58% of all clock cycles, 
regardless of frame rate.  This is too steep for me, so I will not use this 
and won't be able to split up the screen into horizontal regions, like Nick is 
doing for Popstar Pilot.  I'll run the sound at a lower frequency off of the 
12-bit timer, and turn off this interrupt source when the sound is not playing.

During the last few months I've made a lot of progress on the sprite portion 
of the graphics engine.  I believe that my design for this is novel, and it is 
about as fast as it could be.  My goal here was maximum theoretical 
performance.  Again, I traded off memory consumption for speed.  Part of the 
challenge with drawing sprites is that the 16 color mode packs 2 pixels into a 
single byte.  If you want to support sprites with 1-pixel wide features, you 
must mask the background bytes with a logical AND, and then OR/ADD the results 
with the sprite pixels before writing back to the screen.  The fastest 
general-purpose sprite routines that I can write require 3720 cycles to erase 
and write (while saving background data for later erasing) a 16x16 sprite.  
This works out to 14.5 cycles per pixel (assuming that all 256 pixels are drawn).

To achieve the maximum possible performance with my sprite engine, I wrote a 
sprite compiler.  This software is a large and complex Python script, which 
reads sprite data from a file and writes out near-optimal 6809 assembly for 
drawing and erasing sprites on the screen.  It basically paints them, byte by 
byte.  Even though the sprite compiler includes a lot of crazy optimizations, 
the performance gains that I get on a cycle-per-pixel basis are relatively 
small and mostly attributable to two techniques: 1) I don't need to AND mask 
the bytes/words which will get completely overwritten, and 2) I can minimize 
foreground pixel loads by grouping together writes with the same byte/word 
values.  For the few sprites with which I've been testing, the compiled sprite 
code takes an average of 12.9 cycles/pixel to erase and draw, which is only 
12% faster than the general purpose routine, but the big gain comes from the 
fact that we only draw and erase the bytes which contain non-transparent 
pixels.  When we look at the overall time consumed (rather than cycles per 
pixel), the new sprite engine turns out to be much faster than the general 
purpose routine.  For example, I can draw+erase a 15-pixel diameter ball in 
under 2000 cycles.  This is much faster than the general purpose routine, 
which would take 3720.  I can draw+erase nice outlined 8x16 numeric characters 
in 1000 cycles or less each.

So I'm happy with the performance.  As I mentioned before, the tradeoff is 
increased memory consumption.  For a general-purpose engine, you would use 
probably 2 bytes per pixel to store sprite data (each sprite object would have 
2 copies to get single-pixel positioning, and each copy would contain a mask 
byte and a foreground pixel byte for each screen byte).  For my engine, the 
generated machine code for drawing and erasing the sprites varies, but comes 
out to about 6 bytes per pixel if you only need byte-level positioning (ie, 
for letters/numbers), or 9 bytes per pixel if you want pixel-level 
positioning.  It's a pretty heavy memory penalty, but I think it's worth the 
speed.  The maximum sprite size is about 62x32, and the cool thing is that the 
sprites can be any shape or size, and will be optimized for just the pixels 
which get written to the screen.

With the high memory consumption (and a scrolling graphics aperature which can 
move anywhere in the physical RAM space), it is desirable to abstract the 8k 
memory page (de)allocation and mapping.  So, one of the very first modules of 
code that I wrote is a simple virtual memory manager which tracks the 8k pages 
which are allocated by different parts of the engine.  It automatically moves 
them when the screen aperature moves to overlap with an in-use block.

I'm really excited about this graphics/game engine, because it is sufficiently 
generalized that it could be used for a lot of great games in addition to 
platformers.  It's not suitable for every genre, but it would work well for 
several different game types.  I would love to do a top-down racer like Micro 
Machines.  If it were simple enough, it could look beautiful running at 
60fps.  This engine is also suitable for horizontal shoot-em-ups and top-down 
or isometric RPG graphic adventure or arcade action games.  The sprite 
functionality could be extracted separately from the background scrolling 
engine and could be used in any type of game.

I also came up with a name for this engine: I call it DynoSprite.  I have a 
few more weeks of work to do on a demo that I will release to show the sprite 
functionality.  With any luck I should have something cool to show you soon.

Richard



More information about the Coco mailing list