[Coco] Mod10 Suggestions

Sat Feb 18 23:53:50 EST 2017

It would be 8 BNEs actually. It's executed even for the last loop.

BNE is 3 cycles and DECB is 2 cycles so 40 cycles total.

You can also save a cycle for each "temporary" reference by just using 
RESULT as the temporary instead of using the stack. It's one byte longer 
but one cycle faster as long as RESULT is in range of an 8 bit offset 
from PC. That would be 2 cycles gained per iteration for a total of 16 
cycles. It's faster to use the stack if a PCR access to result would 
need a 16 bit offset.

On 2017-02-18 09:15 PM, Dave Philipsen wrote:
> How much speed would you gain by completely eliminating 8 DECBs and 7
> BNEs?:
>
> ORG $1200
> CCD     RMB 16
> RESULT  RMB 1
>
> START   LEAX CCD+16,PCR
> CLRA
>
> LOOP    ADDA ,-X
>         DAA
>         PSHS A
>         LDA ,-X
>         LSLA
>         CMPA #10
>         BLO LOOP2
>         SUBA #9
> LOOP2   ADDA ,S+
>         DAA
>
>         ADDA ,-X
>         DAA
>         PSHS A
>         LDA ,-X
>         LSLA
>         CMPA #10
>         BLO LOOP3
>         SUBA #9
> LOOP3   ADDA ,S+
>         DAA
>
>         ADDA ,-X
>         DAA
>         PSHS A
>         LDA ,-X
>         LSLA
>         CMPA #10
>         BLO LOOP4
>         SUBA #9
> LOOP4   ADDA ,S+
>         DAA
>
>         ADDA ,-X
>         DAA
>         PSHS A
>         LDA ,-X
>         LSLA
>         CMPA #10
>         BLO LOOP5
>         SUBA #9
> LOOP5   ADDA ,S+
>         DAA
>
>         ADDA ,-X
>         DAA
>         PSHS A
>         LDA ,-X
>         LSLA
>         CMPA #10
>         BLO LOOP6
>         SUBA #9
> LOOP6   ADDA ,S+
>         DAA
>
>         ADDA ,-X
>         DAA
>         PSHS A
>         LDA ,-X
>         LSLA
>         CMPA #10
>         BLO LOOP7
>         SUBA #9
> LOOP7   ADDA ,S+
>         DAA
>
>         ADDA ,-X
>         DAA
>         PSHS A
>         LDA ,-X
>         LSLA
>         CMPA #10
>         BLO LOOP8
>         SUBA #9
> LOOP8   ADDA ,S+
>         DAA
>
>         ADDA ,-X
>         DAA
>         PSHS A
>         LDA ,-X
>         LSLA
>         CMPA #10
>         BLO LOOP9
>         SUBA #9
> LOOP9   ADDA ,S+
>         DAA
>
>         ANDA #$0F
>         STA RESULT,PCR
> ENDPGM  RTS
> END START
>
> On 2/18/2017 8:22 PM, William Mikrut wrote:
>> Which is the beauty of this project.
>>
>> Clearly there are at least 3 ways to do this...each with a slightly
>> different outcome.
>>
>> Some optimization for size,speed... or both.
>>
>> There is a wealth of information and experience here from everone and I
>> truly appreciate all the input!
>>
>> I can't wait to start the next project and see where it leads!!
>>
>>
>>
>> On Feb 18, 2017 8:10 PM, "L. Curtis Boyle" <curtisboyle at sasktel.net>
>> wrote:
>>
>>> I was just going to mention that if speed is more important, doing an
>>> leas
>>> -1,s before the loop, and then just a sta ,a /adda ,s (instead of pshs
>>> a/add ,s+), and then a final leas 1,s after the loop is done would be
>>> a bit
>>> longer, but a bit faster.
>>>
>>> L. Curtis Boyle
>>> curtisboyle at sasktel.net
>>>
>>> TRS-80 Color Computer Games website
>>> http://www.lcurtisboyle.com/nitros9/coco_game_list.html
>>>
>>>
>>>
>>>> On Feb 18, 2017, at 7:41 PM, Dave Philipsen <dave at davebiz.com> wrote:
>>>>
>>>> That's pretty well optimized!  Have you ever considered the difference
>>> between optimizing for size and optimizing for speed?  So, for
>>> instance, if
>>> you weren't necessarily constrained for size but you knew you were
>>> going to
>>> process a list of jillions of cc numbers would you write it differently?
>>>> Dave Philipsen
>>>>
>>>>> On Feb 18, 2017, at 5:06 PM, William Mikrut <wmikrut72 at gmail.com>
>>> wrote:
>>>>> Some slight re ordering of the code and it works perfectly!
>>>>> 48 Bytes total, Less 17 for storage -- 31 program bytes to get the job
>>> done.
>>>>> My original code was 61 program bytes... down to half the size and
>>>>> does
>>> the
>>>>> exact same thing.
>>>>> Absolutely amazing!
>>>>>
>>>>>
>>>>> ORG $1200
>>>>> CCD     RMB 16
>>>>> RESULT  RMB 1
>>>>>
>>>>> START   LEAX CCD+16,PCR
>>>>> CLRA
>>>>>        LDB #8
>>>>>
>>>>>
>>>>> LOOP    ADDA ,-X
>>>>>        DAA
>>>>>        PSHS A
>>>>>        LDA ,-X
>>>>>        LSLA
>>>>>        CMPA #10
>>>>>        BLO LOOP2
>>>>>        SUBA #9
>>>>> LOOP2   ADDA ,S+
>>>>>        DAA
>>>>>
>>>>>        DECB
>>>>>        BNE LOOP
>>>>>
>>>>>
>>>>>
>>>>>        ANDA #$0F
>>>>>        STA RESULT,PCR
>>>>> ENDPGM  RTS
>>>>> END START
>>>>>
>>>>>> On Sat, Feb 18, 2017 at 1:03 PM, William Mikrut <wmikrut72 at gmail.com>
>>> wrote:
>>>>>> You are right -- I looked at is closer.
>>>>>> One thing I need to do is reverse the order of operations.
>>>>>>
>>>>>> The LSLA is performed first.
>>>>>> First I need to store the byte and LSLA the next byte.
>>>>>>
>>>>>> Otherwise if I flip it from left to right:
>>>>>> (LEAX CCD,PCR
>>>>>> ...
>>>>>> LDA ,X+
>>>>>> ...
>>>>>> ADDA ,X+)
>>>>>>
>>>>>> it works perfectly.
>>>>>>
>>>>>>
>>>>>>> On Sat, Feb 18, 2017 at 11:35 AM, William Astle <lost at l-w.ca> wrote:
>>>>>>>
>>>>>>> Take a closer look. It only does the LSLA on every other digit. It
>>> does
>>>>>>> *two* digits  per loop, just like Brett's version.
>>>>>>>
>>>>>>> You can easily pretend all numbers are 16 digits by right justifying
>>> the
>>>>>>> numbers in your buffer and padding with zeros.
>>>>>>>
>>>>>>>
>>>>>>>> On 2017-02-18 10:06 AM, William Mikrut wrote:
>>>>>>>>
>>>>>>>> I like how this works from right to left.
>>>>>>>> The only issue is the LSLA on every number.
>>>>>>>>
>>>>>>>> The algo is to double every other number, starting with the right
>>> most
>>>>>>>> digit, and sub 9 if the result is 10 or more.
>>>>>>>>
>>>>>>>> Now if the number is always 16 digits, Brett's 16 bit word seems
>>>>>>>> the
>>>>>>>> easiest way to go.
>>>>>>>> If the number is 13 digits long the 16 bit word method won't work,
>>> but I
>>>>>>>> am
>>>>>>>> happy to pretend all numbers are 16 digits!
>>>>>>>>
>>>>>>>> I am going to try to include a couple things you showed me into
>>> Brett's
>>>>>>>> 16
>>>>>>>> bit chunk method and try a slightly different routine!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Feb 18, 2017 at 10:22 AM, William Astle <lost at l-w.ca>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On 2017-02-18 12:43 AM, msmcdoug wrote:
>>>>>>>>> Actually I'm surprised noone has suggested bcd arithmetic on the
>>> result
>>>>>>>>>> to eliminate divide by 10 loop
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> BCD would certainly give a predictable overall cycle count. It
>>>>>>>>> would
>>>>>>>>> require a significantly different approach, though. The only
>>> register
>>>>>>>>> you
>>>>>>>>> can use for BCD arithmetic is A and DAA is only useful after
>>>>>>>>> ADDA or
>>>>>>>>> ADCA.
>>>>>>>>>
>>>>>>>>> I had thought about using BCD but had initially dismissed it
>>>>>>>>> due to
>>>>>>>>> possible complexity. However, upon reflection, the extra cycles to
>>> use
>>>>>>>>> BCD
>>>>>>>>> would probably be less than the average cycle time of the modulus
>>> loop
>>>>>>>>> combined or checking for digit overflow during the loop.
>>>>>>>>>
>>>>>>>>> I think you could use code that looks something like the following
>>> which
>>>>>>>>> is based off Mr. Mikrut's most recent posted code. (warning:
>>>>>>>>> mailer
>>>>>>>>> code™
>>>>>>>>> follows so it may have errors)
>>>>>>>>>
>>>>>>>>>        ORG $1200
>>>>>>>>> CCD     RMB 16
>>>>>>>>> RESULT  RMB 1
>>>>>>>>> START   LEAX CCD+16,PCR
>>>>>>>>>        CLRA
>>>>>>>>>        LDB #8
>>>>>>>>> LOOP    PSHS A
>>>>>>>>>        LDA ,-X
>>>>>>>>>        LSLA
>>>>>>>>>        CMPA #10
>>>>>>>>>        BLO LOOP2
>>>>>>>>>        SUBA #9
>>>>>>>>> LOOP2   ADDA ,S+
>>>>>>>>>        DAA
>>>>>>>>>        ADDA ,-X
>>>>>>>>>        DAA
>>>>>>>>>        DECB
>>>>>>>>>        BNE LOOP
>>>>>>>>>        ANDA #$0F
>>>>>>>>>        STA RESULT,PCR
>>>>>>>>> ENDPGM  RTS
>>>>>>>>>
>>>>>>>>> I'm using the stack for a temporary storage location instead of
>>>>>>>>> something
>>>>>>>>> PCR relative for code size reasons. You could use the "RESULT
>>> variable
>>>>>>>>> for
>>>>>>>>> the temporary to eliminate stack usage. That would probably be
>>> slightly
>>>>>>>>> faster at the expense of two more code bytes. This is one of those
>>>>>>>>> size/speed trade-offs.
>>>>>>>>>
>>>>>>>>> DAA has to be used after every addition and only applies to A.
>>> Using BCD
>>>>>>>>> means we can eliminate the mod 10 loop and just mask off the upper
>>> digit
>>>>>>>>> (BCD stores two decimal digits in a byte). That gives a constant
>>> time
>>>>>>>>> for
>>>>>>>>> the "mod 10" result and also only takes 2 bytes (and 2 cycles).
>>>>>>>>>
>>>>>>>>> I have also eliminated the STATUS variable and just store the
>>> result.
>>>>>>>>> You
>>>>>>>>> can test RESULT for non-zero trivially so there's no need for a
>>> separate
>>>>>>>>> STATUS value.
>>>>>>>>>
>>>>>>>>> By my calculation, this version is 32 bytes, requires 1 byte of
>>> stack
>>>>>>>>> space, 17 bytes of data space, and runs in a maximum of 351 cycles
>>> (and
>>>>>>>>> a
>>>>>>>>> minimum of 336 cycles if none of the doubled digits goes above 9).
>>> For
>>>>>>>>> this
>>>>>>>>> analysis, I've assumed 8 bit offsets for the PCR references. 16
>>>>>>>>> bit
>>>>>>>>> offsets
>>>>>>>>> in PCR mode are quite a bit more expensive (4 extra cycles and 1
>>> extra
>>>>>>>>> byte).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Coco mailing list
>>>>>>>>> Coco at maltedmedia.com
>>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>>
>>>>>>>>>
>>>>>>> --
>>>>>>> Coco mailing list
>>>>>>> Coco at maltedmedia.com
>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>
>>>>>>
>>>>> --
>>>>> Coco mailing list
>>>>> Coco at maltedmedia.com
>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>
>>>> --
>>>> Coco mailing list
>>>> Coco at maltedmedia.com
>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>
>>>
>>> --
>>> Coco mailing list
>>> Coco at maltedmedia.com
>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>
>
>