[Coco] Mod10 Suggestions

Dave Philipsen dave at davebiz.com
Sat Feb 18 23:57:38 EST 2017


Yeah, I think the BNE is one less cycle if the branch isn't taken, right?

Dave


On 2/18/2017 10:53 PM, William Astle wrote:
> It would be 8 BNEs actually. It's executed even for the last loop.
>
> BNE is 3 cycles and DECB is 2 cycles so 40 cycles total.
>
> You can also save a cycle for each "temporary" reference by just using 
> RESULT as the temporary instead of using the stack. It's one byte 
> longer but one cycle faster as long as RESULT is in range of an 8 bit 
> offset from PC. That would be 2 cycles gained per iteration for a 
> total of 16 cycles. It's faster to use the stack if a PCR access to 
> result would need a 16 bit offset.
>
>
> On 2017-02-18 09:15 PM, Dave Philipsen wrote:
>> How much speed would you gain by completely eliminating 8 DECBs and 7
>> BNEs?:
>>
>> ORG $1200
>> CCD     RMB 16
>> RESULT  RMB 1
>>
>> START   LEAX CCD+16,PCR
>> CLRA
>>
>> LOOP    ADDA ,-X
>>         DAA
>>         PSHS A
>>         LDA ,-X
>>         LSLA
>>         CMPA #10
>>         BLO LOOP2
>>         SUBA #9
>> LOOP2   ADDA ,S+
>>         DAA
>>
>>         ADDA ,-X
>>         DAA
>>         PSHS A
>>         LDA ,-X
>>         LSLA
>>         CMPA #10
>>         BLO LOOP3
>>         SUBA #9
>> LOOP3   ADDA ,S+
>>         DAA
>>
>>         ADDA ,-X
>>         DAA
>>         PSHS A
>>         LDA ,-X
>>         LSLA
>>         CMPA #10
>>         BLO LOOP4
>>         SUBA #9
>> LOOP4   ADDA ,S+
>>         DAA
>>
>>         ADDA ,-X
>>         DAA
>>         PSHS A
>>         LDA ,-X
>>         LSLA
>>         CMPA #10
>>         BLO LOOP5
>>         SUBA #9
>> LOOP5   ADDA ,S+
>>         DAA
>>
>>         ADDA ,-X
>>         DAA
>>         PSHS A
>>         LDA ,-X
>>         LSLA
>>         CMPA #10
>>         BLO LOOP6
>>         SUBA #9
>> LOOP6   ADDA ,S+
>>         DAA
>>
>>         ADDA ,-X
>>         DAA
>>         PSHS A
>>         LDA ,-X
>>         LSLA
>>         CMPA #10
>>         BLO LOOP7
>>         SUBA #9
>> LOOP7   ADDA ,S+
>>         DAA
>>
>>         ADDA ,-X
>>         DAA
>>         PSHS A
>>         LDA ,-X
>>         LSLA
>>         CMPA #10
>>         BLO LOOP8
>>         SUBA #9
>> LOOP8   ADDA ,S+
>>         DAA
>>
>>         ADDA ,-X
>>         DAA
>>         PSHS A
>>         LDA ,-X
>>         LSLA
>>         CMPA #10
>>         BLO LOOP9
>>         SUBA #9
>> LOOP9   ADDA ,S+
>>         DAA
>>
>>         ANDA #$0F
>>         STA RESULT,PCR
>> ENDPGM  RTS
>> END START
>>
>> On 2/18/2017 8:22 PM, William Mikrut wrote:
>>> Which is the beauty of this project.
>>>
>>> Clearly there are at least 3 ways to do this...each with a slightly
>>> different outcome.
>>>
>>> Some optimization for size,speed... or both.
>>>
>>> There is a wealth of information and experience here from everone and I
>>> truly appreciate all the input!
>>>
>>> I can't wait to start the next project and see where it leads!!
>>>
>>>
>>>
>>> On Feb 18, 2017 8:10 PM, "L. Curtis Boyle" <curtisboyle at sasktel.net>
>>> wrote:
>>>
>>>> I was just going to mention that if speed is more important, doing an
>>>> leas
>>>> -1,s before the loop, and then just a sta ,a /adda ,s (instead of pshs
>>>> a/add ,s+), and then a final leas 1,s after the loop is done would be
>>>> a bit
>>>> longer, but a bit faster.
>>>>
>>>> L. Curtis Boyle
>>>> curtisboyle at sasktel.net
>>>>
>>>> TRS-80 Color Computer Games website
>>>> http://www.lcurtisboyle.com/nitros9/coco_game_list.html
>>>>
>>>>
>>>>
>>>>> On Feb 18, 2017, at 7:41 PM, Dave Philipsen <dave at davebiz.com> wrote:
>>>>>
>>>>> That's pretty well optimized!  Have you ever considered the 
>>>>> difference
>>>> between optimizing for size and optimizing for speed?  So, for
>>>> instance, if
>>>> you weren't necessarily constrained for size but you knew you were
>>>> going to
>>>> process a list of jillions of cc numbers would you write it 
>>>> differently?
>>>>> Dave Philipsen
>>>>>
>>>>>> On Feb 18, 2017, at 5:06 PM, William Mikrut <wmikrut72 at gmail.com>
>>>> wrote:
>>>>>> Some slight re ordering of the code and it works perfectly!
>>>>>> 48 Bytes total, Less 17 for storage -- 31 program bytes to get 
>>>>>> the job
>>>> done.
>>>>>> My original code was 61 program bytes... down to half the size and
>>>>>> does
>>>> the
>>>>>> exact same thing.
>>>>>> Absolutely amazing!
>>>>>>
>>>>>>
>>>>>> ORG $1200
>>>>>> CCD     RMB 16
>>>>>> RESULT  RMB 1
>>>>>>
>>>>>> START   LEAX CCD+16,PCR
>>>>>> CLRA
>>>>>>        LDB #8
>>>>>>
>>>>>>
>>>>>> LOOP    ADDA ,-X
>>>>>>        DAA
>>>>>>        PSHS A
>>>>>>        LDA ,-X
>>>>>>        LSLA
>>>>>>        CMPA #10
>>>>>>        BLO LOOP2
>>>>>>        SUBA #9
>>>>>> LOOP2   ADDA ,S+
>>>>>>        DAA
>>>>>>
>>>>>>        DECB
>>>>>>        BNE LOOP
>>>>>>
>>>>>>
>>>>>>
>>>>>>        ANDA #$0F
>>>>>>        STA RESULT,PCR
>>>>>> ENDPGM  RTS
>>>>>> END START
>>>>>>
>>>>>>> On Sat, Feb 18, 2017 at 1:03 PM, William Mikrut 
>>>>>>> <wmikrut72 at gmail.com>
>>>> wrote:
>>>>>>> You are right -- I looked at is closer.
>>>>>>> One thing I need to do is reverse the order of operations.
>>>>>>>
>>>>>>> The LSLA is performed first.
>>>>>>> First I need to store the byte and LSLA the next byte.
>>>>>>>
>>>>>>> Otherwise if I flip it from left to right:
>>>>>>> (LEAX CCD,PCR
>>>>>>> ...
>>>>>>> LDA ,X+
>>>>>>> ...
>>>>>>> ADDA ,X+)
>>>>>>>
>>>>>>> it works perfectly.
>>>>>>>
>>>>>>>
>>>>>>>> On Sat, Feb 18, 2017 at 11:35 AM, William Astle <lost at l-w.ca> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Take a closer look. It only does the LSLA on every other digit. It
>>>> does
>>>>>>>> *two* digits  per loop, just like Brett's version.
>>>>>>>>
>>>>>>>> You can easily pretend all numbers are 16 digits by right 
>>>>>>>> justifying
>>>> the
>>>>>>>> numbers in your buffer and padding with zeros.
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 2017-02-18 10:06 AM, William Mikrut wrote:
>>>>>>>>>
>>>>>>>>> I like how this works from right to left.
>>>>>>>>> The only issue is the LSLA on every number.
>>>>>>>>>
>>>>>>>>> The algo is to double every other number, starting with the right
>>>> most
>>>>>>>>> digit, and sub 9 if the result is 10 or more.
>>>>>>>>>
>>>>>>>>> Now if the number is always 16 digits, Brett's 16 bit word seems
>>>>>>>>> the
>>>>>>>>> easiest way to go.
>>>>>>>>> If the number is 13 digits long the 16 bit word method won't 
>>>>>>>>> work,
>>>> but I
>>>>>>>>> am
>>>>>>>>> happy to pretend all numbers are 16 digits!
>>>>>>>>>
>>>>>>>>> I am going to try to include a couple things you showed me into
>>>> Brett's
>>>>>>>>> 16
>>>>>>>>> bit chunk method and try a slightly different routine!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Feb 18, 2017 at 10:22 AM, William Astle <lost at l-w.ca>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On 2017-02-18 12:43 AM, msmcdoug wrote:
>>>>>>>>>> Actually I'm surprised noone has suggested bcd arithmetic on the
>>>> result
>>>>>>>>>>> to eliminate divide by 10 loop
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> BCD would certainly give a predictable overall cycle count. It
>>>>>>>>>> would
>>>>>>>>>> require a significantly different approach, though. The only
>>>> register
>>>>>>>>>> you
>>>>>>>>>> can use for BCD arithmetic is A and DAA is only useful after
>>>>>>>>>> ADDA or
>>>>>>>>>> ADCA.
>>>>>>>>>>
>>>>>>>>>> I had thought about using BCD but had initially dismissed it
>>>>>>>>>> due to
>>>>>>>>>> possible complexity. However, upon reflection, the extra 
>>>>>>>>>> cycles to
>>>> use
>>>>>>>>>> BCD
>>>>>>>>>> would probably be less than the average cycle time of the 
>>>>>>>>>> modulus
>>>> loop
>>>>>>>>>> combined or checking for digit overflow during the loop.
>>>>>>>>>>
>>>>>>>>>> I think you could use code that looks something like the 
>>>>>>>>>> following
>>>> which
>>>>>>>>>> is based off Mr. Mikrut's most recent posted code. (warning:
>>>>>>>>>> mailer
>>>>>>>>>> codeā„¢
>>>>>>>>>> follows so it may have errors)
>>>>>>>>>>
>>>>>>>>>>        ORG $1200
>>>>>>>>>> CCD     RMB 16
>>>>>>>>>> RESULT  RMB 1
>>>>>>>>>> START   LEAX CCD+16,PCR
>>>>>>>>>>        CLRA
>>>>>>>>>>        LDB #8
>>>>>>>>>> LOOP    PSHS A
>>>>>>>>>>        LDA ,-X
>>>>>>>>>>        LSLA
>>>>>>>>>>        CMPA #10
>>>>>>>>>>        BLO LOOP2
>>>>>>>>>>        SUBA #9
>>>>>>>>>> LOOP2   ADDA ,S+
>>>>>>>>>>        DAA
>>>>>>>>>>        ADDA ,-X
>>>>>>>>>>        DAA
>>>>>>>>>>        DECB
>>>>>>>>>>        BNE LOOP
>>>>>>>>>>        ANDA #$0F
>>>>>>>>>>        STA RESULT,PCR
>>>>>>>>>> ENDPGM  RTS
>>>>>>>>>>
>>>>>>>>>> I'm using the stack for a temporary storage location instead of
>>>>>>>>>> something
>>>>>>>>>> PCR relative for code size reasons. You could use the "RESULT
>>>> variable
>>>>>>>>>> for
>>>>>>>>>> the temporary to eliminate stack usage. That would probably be
>>>> slightly
>>>>>>>>>> faster at the expense of two more code bytes. This is one of 
>>>>>>>>>> those
>>>>>>>>>> size/speed trade-offs.
>>>>>>>>>>
>>>>>>>>>> DAA has to be used after every addition and only applies to A.
>>>> Using BCD
>>>>>>>>>> means we can eliminate the mod 10 loop and just mask off the 
>>>>>>>>>> upper
>>>> digit
>>>>>>>>>> (BCD stores two decimal digits in a byte). That gives a constant
>>>> time
>>>>>>>>>> for
>>>>>>>>>> the "mod 10" result and also only takes 2 bytes (and 2 cycles).
>>>>>>>>>>
>>>>>>>>>> I have also eliminated the STATUS variable and just store the
>>>> result.
>>>>>>>>>> You
>>>>>>>>>> can test RESULT for non-zero trivially so there's no need for a
>>>> separate
>>>>>>>>>> STATUS value.
>>>>>>>>>>
>>>>>>>>>> By my calculation, this version is 32 bytes, requires 1 byte of
>>>> stack
>>>>>>>>>> space, 17 bytes of data space, and runs in a maximum of 351 
>>>>>>>>>> cycles
>>>> (and
>>>>>>>>>> a
>>>>>>>>>> minimum of 336 cycles if none of the doubled digits goes 
>>>>>>>>>> above 9).
>>>> For
>>>>>>>>>> this
>>>>>>>>>> analysis, I've assumed 8 bit offsets for the PCR references. 16
>>>>>>>>>> bit
>>>>>>>>>> offsets
>>>>>>>>>> in PCR mode are quite a bit more expensive (4 extra cycles and 1
>>>> extra
>>>>>>>>>> byte).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> Coco mailing list
>>>>>>>>>> Coco at maltedmedia.com
>>>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Coco mailing list
>>>>>>>> Coco at maltedmedia.com
>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>
>>>>>>>
>>>>>> -- 
>>>>>> Coco mailing list
>>>>>> Coco at maltedmedia.com
>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>
>>>>> -- 
>>>>> Coco mailing list
>>>>> Coco at maltedmedia.com
>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>
>>>>
>>>> -- 
>>>> Coco mailing list
>>>> Coco at maltedmedia.com
>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>
>>
>>
>
>



More information about the Coco mailing list