[Coco] Optimizing 6809 Assembly Code: Part 2 – Speedup Storing Data – Unrolling Loops
Gene Heskett
gheskett at shentel.net
Sat Sep 16 12:02:15 EDT 2017
On Saturday 16 September 2017 10:43:45 Glen Hewlett wrote:
> Thanks Darren,
>
> Simon Jonassen commented the same thing on Facebook. I updated Part 2
> of the series to include this method, I think pointing out the use of
> negative numbers for the indexing is also a valuable tip. I’ll add
> that info too and update Part 2 again. :)
>
> Please keep the ideas flowing…. This is great stuff!
>
> Cheers,
> Glen
>
> > On Sep 16, 2017, at 1:17 AM, Darren A <mechacoco at gmail.com> wrote:
> >
> > On Fri, Sep 15, 2017 at 8:44 PM, Glen Hewlett wrote:
> >> Hi Again,
> >>
> >> I just posted Part 2 of my series of blogs about optimizing 6809
> >> assembly language programs.
> >>
> >> https://nowhereman999.wordpress.com/2017/09/15/
> >> optimizing-6809-assembly-code-part-2-speedup-storing-data-unrolling
> >>-loops/ <https://nowhereman999.wordpress.com/2017/09/15/
> >> optimizing-6809-assembly-code-part-2-speedup-storing-data-unrolling
> >>-loops/
> >
> > In your first unrolled loop example, you could have saved a fair
> > amount of time by not using the auto-increment index mode. You only
> > need to increment X once per loop using ABX. The ,X++ adds 3 cycles
> > to the base count. Using 5-bit displacements (-16 to +14) keeps the
> > code size small and only adds 1 cycle to the base count. You will
> > also have one occurrence of ,X which adds nothing.
> >
> > LDX #$4000+16
> > LDU #$0000
> > LDD #$0020
> >
> > ! STU -16,X
> > STU -14,X
> > STU -12,X
> > STU -10,X
> > STU -8,X
> > STU -6,X
> > STU -4,X
> > STU -2,X
> > STU ,X
> > STU 2,X
> > STU 4,X
> > STU 6,X
> > STU 8,X
> > STU 10,X
> > STU 12,X
> > STU 14,X
> > ABX
> > DECA
> > BNE <
> >
> >
> > - Darren
> >
Much of this is/was known to me over the years. Where I found noticeable
improvements could be made was in the c compilers pre-assembly output.
90% of the time a conditional is preceded by a sex instruction, but
where the conditional is byte sized, its just time wasted.
Also I watched how the compiler handled the >># and <<# shifts because it
always assumes integer sized values. So any shift in excess of 8 bits,
gets reduced by 8 bits, and the first 8 bit shift is done with a
tfr a,b
clra
or
tfr b,a
clrb
depending on the direction of shift being done.
The rzsz-3.3.6 I built many years ago is around 175 cps faster because of
those optimizations and the incorporation of a table lookup crc
calculation routine. Many cycles faster.
Still not fast enough to keep up with a 9600 baud circuit, and with the
table lookup crc calcs which were part of that speedup, if I had done
the next logical thing, I would have thrown out the byte by byte crc
calcs call, and replaced it with a 256 byte block at a time calc call,
which would have removed 255 calls to the crc thing per 256 bytes
handled and might have taken its speed up to being able to keep up with
a 9600 baud circuit.
If anyone cares, that is the next "improvement" that needs to be done to
rzsz. Src code is on my web page.
I never tried to script the shift changes, so all that was done by hand
at the output of the cc2 stage of the build and repeated everytime I
changed a byte in the src code, so it was rather time intensive.
Typically one good build an evening, and they were loooong evenings
Cheers, Gene Heskett
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>
More information about the Coco
mailing list