[Coco] Devastated. Long term OVCC project falls short

Walter Zambotti zambotti at iinet.net.au
Fri Oct 4 21:52:30 EDT 2019


I wonder why the assembly version for Linux would be that much slower than the Windows version.  There’s got to be something weird going there – there should not be a big difference between the two running essentially the same native code (at least on same class of processor)

I am certain the reason is because I chose develop this on a platform that provided easy assembly IDE debugging using visual stdio.  There is little support for assemble under vscode on any platform!

If you aren't aware, ASM under VS uses the MS ABI.  This ABI is native and efficient under windows but is not native to Linux. So I think the gcc compiler places a wrapper around the C functions that the assembly has to call. Which the assembly has to do often when it calls high level functions like MemRead/MemWrite.

I could replace those for assembly as well but they in turn call Pak Mem Read functions which in turn call the various device functions which are triggered when device mapped memory is referenced.  Replacing those is not possible because they are dynamically loaded at run time and there is no way for the compiler to be aware o the MSABI requirements in advance.

Here are those functions from tcc1014mmu.c

#if defined(_WIN64)
#define MSABI
#else
#define MSABI __attribute__((ms_abi))
#endif

unsigned char MemRead8(unsigned short address)
{
	if (address<0xFE00)
	{
		if (MemPageOffsets[MmuRegisters[MmuState][address >> 13]] == 1)
			return(MemPages[MmuRegisters[MmuState][address >> 13]][address & 0x1FFF]);
		return(PackMem8Read(MemPageOffsets[MmuRegisters[MmuState][address >> 13]] + (address & 0x1FFF)));
	}
	if (address>0xFEFF)
		return (port_read(address));
	if (RamVectors)	//Address must be $FE00 - $FEFF
		return(memory[(0x2000 * VectorMask[CurrentRamConfig]) | (address & 0x1FFF)]);
	if (MemPageOffsets[MmuRegisters[MmuState][address >> 13]] == 1)
		return(MemPages[MmuRegisters[MmuState][address >> 13]][address & 0x1FFF]);
	return(PackMem8Read(MemPageOffsets[MmuRegisters[MmuState][address >> 13]] + (address & 0x1FFF)));
}

unsigned char MSABI MemRead8_s(unsigned short address)
{
	if (address<0xFE00)
	{
		if (MemPageOffsets[MmuRegisters[MmuState][address >> 13]] == 1)
			return(MemPages[MmuRegisters[MmuState][address >> 13]][address & 0x1FFF]);
		return(PackMem8Read(MemPageOffsets[MmuRegisters[MmuState][address >> 13]] + (address & 0x1FFF)));
	}
	if (address>0xFEFF)
		return (port_read(address));
	if (RamVectors)	//Address must be $FE00 - $FEFF
		return(memory[(0x2000 * VectorMask[CurrentRamConfig]) | (address & 0x1FFF)]);
	if (MemPageOffsets[MmuRegisters[MmuState][address >> 13]] == 1)
		return(MemPages[MmuRegisters[MmuState][address >> 13]][address & 0x1FFF]);
	return(PackMem8Read(MemPageOffsets[MmuRegisters[MmuState][address >> 13]] + (address & 0x1FFF)));
}

As you can see the only difference between them is the ms_abi __attribute__. The C 6309 instructions call the non ABI version and the assembly 6309 instructions call the ABI version (on Linux only. On Windows MSABI is defined as nothing).

In relation to 8/16/32/64 bit performance penalties unfortunately there is over 11000 lines of code. That's a big change to make for a maybe!  You can see an example of one of the functions in another reply to this thread.

Walter

On 10/5/19 12:19 AM, James Ross wrote:
> Walter said:
>> On Windows the assembly 6309 CPU runs about 1-1.5% SLOWER!
>> On Linux the assembly 6309 CPU runs about 45%-50% SLOWER!!!
> Wow, that is something else!!
>
> Good job on all your work on OVCC.  I have not had the time to play with it yet, but I hope to someday in the near future.  And kudos on making the OVCC GitHub site public.
>
> Those are quite surprising results, for sure.  I see why you were taken aback.
>
> I wonder why the assembly version for Linux would be that much slower than the Windows version.  There’s got to be something weird going there – there should not be a big difference between the two running essentially the same native code (at least on same class of processor)
>
> I possibly should not comment on the ASM code as I have not delved into enough to be giving anyone advice.  But  ... here goes, ha!
>
> A tidbit I have read about in more than one place, I believe, is that there is a penalty for accessing partial registers (i.e. 16bit and possibly 8bits) in 32/64 bit modes.  Whether or not that would account for all the slowdown from the compiled C, I don’t know.  But you should be able to find information on that somewhat easily.
>
> Whether or not I am right or not about this, (I think I am, but I am but not certain!!)  But --- if a person wants to squeeze the absolute best performance at the assembly level don’t deal with (i.e. r/w and process) any data values that are not the same width as the data bus.  And make sure that all your data is aligned at the same data width.
>
> So, in 32bit mode, do everything in 32bits.  In 64bit mode, do everything in 64bits.   That would include instead of using a byte array (8 bits) for the CoCo RAM/ROM use 32bit or 64bit word arrays instead, to represent each byte.
>
> If I ever have the time, I’ll play with this some.
>
> james
>


More information about the Coco mailing list