[Coco] Optimizing 6809 Assembly Code: Part 2 – Speedup Storing Data – Unrolling Loops

Sat Sep 16 12:02:15 EDT 2017

On Saturday 16 September 2017 10:43:45 Glen Hewlett wrote:

> Thanks Darren,
>
> Simon Jonassen commented the same thing on Facebook.  I updated Part 2
> of the series to include this method, I think pointing out the use of
> negative numbers for the indexing is also a valuable tip.  I’ll add
> that info too and update Part 2 again.  :)
>
> Please keep the ideas flowing…. This is great stuff!
>
> Cheers,
> Glen
>
> > On Sep 16, 2017, at 1:17 AM, Darren A <mechacoco at gmail.com> wrote:
> >
> > On Fri, Sep 15, 2017 at 8:44 PM, Glen Hewlett wrote:
> >> Hi Again,
> >>
> >> I just posted Part 2 of my series of blogs about optimizing 6809
> >> assembly language programs.
> >>
> >> https://nowhereman999.wordpress.com/2017/09/15/
> >> optimizing-6809-assembly-code-part-2-speedup-storing-data-unrolling
> >>-loops/ <https://nowhereman999.wordpress.com/2017/09/15/
> >> optimizing-6809-assembly-code-part-2-speedup-storing-data-unrolling
> >>-loops/
> >
> > In your first unrolled loop example, you could have saved a fair
> > amount of time by not using the auto-increment index mode.  You only
> > need to increment X once per loop using ABX.  The ,X++ adds 3 cycles
> > to the base count.  Using 5-bit displacements (-16 to +14) keeps the
> > code size small and only adds 1 cycle to the base count. You will
> > also have one occurrence of ,X which adds nothing.
> >
> >    LDX    #$4000+16
> >    LDU    #$0000
> >    LDD    #$0020
> >
> > !   STU    -16,X
> >    STU    -14,X
> >    STU    -12,X
> >    STU    -10,X
> >    STU    -8,X
> >    STU    -6,X
> >    STU    -4,X
> >    STU    -2,X
> >    STU    ,X
> >    STU    2,X
> >    STU    4,X
> >    STU    6,X
> >    STU    8,X
> >    STU    10,X
> >    STU    12,X
> >    STU    14,X
> >    ABX
> >    DECA
> >    BNE    <
> >
> >
> > - Darren
> >
Much of this is/was known to me over the years.  Where I found noticeable 
improvements could be made was in the c compilers pre-assembly output. 
90% of the time a conditional is preceded by a sex instruction, but 
where the conditional is byte sized, its just time wasted.

Also I watched how the compiler handled the >># and <<# shifts because it 
always assumes integer sized values.  So any shift in excess of 8 bits, 
gets reduced by 8 bits, and the first 8 bit shift is done with a 
tfr a,b
clra
 or 
tfr b,a 
clrb 
depending on the direction of shift being done.

The rzsz-3.3.6 I built many years ago is around 175 cps faster because of 
those optimizations and the incorporation of a table lookup crc 
calculation routine. Many cycles faster.

Still not fast enough to keep up with a 9600 baud circuit, and with the 
table lookup crc calcs which were part of that speedup, if I had done 
the next logical thing, I would have thrown out the byte by byte crc 
calcs call, and replaced it with a 256 byte block at a time calc call, 
which would have removed 255 calls to the crc thing per 256 bytes 
handled and might have taken its speed up to being able to keep up with 
a 9600 baud circuit.

If anyone cares, that is the next "improvement" that needs to be done to 
rzsz. Src code is on my web page.

I never tried to script the shift changes, so all that was done by hand 
at the output of the cc2 stage of the build and repeated everytime I 
changed a byte in the src code, so it was rather time intensive. 
Typically one good build an evening, and they were loooong evenings

Cheers, Gene Heskett
-- 
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>