[Coco] [Color Computer] the final stage of optimization

jdiffendaffer jdiffendaffer at yahoo.com
Tue Nov 24 14:09:51 EST 2009


A hash is a data key normally involving a mathematical formula where the formula attempts to create a unique value for a specific data item.  If that makes any sense.  It does not have to be just math.  It can combine part actual data, part math, or whatever yields a more unique value for the data set.

You have to be aware of potential pitfalls of a given formula.
If you just add the ASCII values together you end up with the same value for the following strings:
Array
yarrA
rAray

You also have very different strings that could have the same total unless you use very large numbers.
Odds are the above names won't happen with programmer generated names but program generated names could be a huge problem.
var012
var021
var102
var120
var201
var210

Using a CRC may work better since it has very low odds of repeating but there is still the potential for a collision (in this case two strings having the same CRC).  One flaw in CRC is it depends on strings being the same length.  If strings can vary in length you will probably have more collisions.  All collisions result in comparing the actual file records.  If you add a cache then disk reads are reduced considerably but not eliminated.  I say that because collisions will always be the same records but the number may exceed the size of your cache.

A packed string reduces the number of bits to the minimum required for the required alphanumeric characters.  
If your application only needed upper case letters then you only need enough bits to represent 26 values.  5 bits.  Then the bits from different letters are shifted over so they are packed together.
Numeric only is 10 values, 4 bits.
If you need upper and lower case, 52 values. 6 bits.
Alpha numeric (upper, lower, numeric) 62.  6 bits.
Add special characters and you are out to 7 or 8 bits.  

If you could squeeze the first few characters into a string, 5 bits saves 3 bits per character.  Over a 32 bit number that's 12 bits or more than 2 more characters squeezed in.
6 bits saves 2 bits and more than 1 extra character fits in 32 bits.

The packed string is easy to implement and may be faster to calculate, but it will result in more disk hits than a CRC unless it represents more bytes of the string.  This is especially true if there is a lot of repetition at the front of variable names.  If names are very different... then it's very good.
Just remember, packed strings only works with strings and the more characters you must support the less return for the effort.

In your case... you have to look at the data do decide what works best for you.


---- quote ----------
James (it's nice to know your name),

The program is a decompiler, unpacker, decoder, however you want to say it. Basic09 compiles into intermediate code, and I am using that to reconstruct the source code.

I am still working at understanding exactly what a hash is, but I think I get it. I was thinking about an idea to create an array for the literals that included a field that would contain the numeric value of a string. That value would be the total of the numeric values of each character in the string. I'm not sure if that's what you mean by "packed ascii" or not, but it sounds like it might be the same thing. Would it be feasible, though? What are the chances that 2 different strings, even of differing lengths, could have the same value? I haven't started working on testing that idea yet, but I'm planning on it. If I could do that, then I could make the array for the literals only include 2 fields, one byte and one integer. Of course, dealing with real values would still be a trick, since a real uses 5 bytes, and I'm not sure how to reduce that to 2.

Wayne 




More information about the Coco mailing list