Related
I'm looking for something like a checksum for a chess board with pieces in specific places. I'm looking to see if a dynamic programming or memoized solution is viable for an AI chess player. The unique identifier would be used to easily check if two boards are equal or to use as indices in the arrays. Thanks for the help.
An extensively used checksum for board positions is the Zobrist signature.
It's an almost unique index number for any chess position, with the requirement that two similar positions generate entirely different indices. These index numbers are used for faster and space efficient transposition tables / opening books.
You need a set of randomly generated bitstrings:
one for each piece at each square;
one to indicate the side to move;
four for castling rights;
eight for the file of a valid en-passant square (if any).
If you want to get the Zobrist hash code of a certain position, you have to xor all random numbers linked to the given feature (details: here and Correctly Implementing Zobrist Hashing).
E.g the starting position:
[Hash for White Rook on a1] xor [White Knight on b1] xor ... ( all pieces )
... xor [White castling long] xor ... ( all castling rights )
XOR allows a fast incremental update of the hash key during make / unmake of moves.
Usually 64bit are used as a standard size in modern chess programs (see The Effect of Hash Signature Collisions in a Chess Program).
You can expect to encounter a collision in a 32 bit hash when you have evaluated √ 232 == 216. With a 64 bit hash, you can expect a collision after about 232 or 4 billion positions (birthday paradox).
If you're looking for a checksum, the usual solution is Zobrist Hashing.
If you're looking for a true unique-identifier, the usual human-readable solution is Forsyth notation.
For a non-human-readable unique-identifier, you can store the type/color of the piece on each square using four-bits. Throw in another 3-bits for en-passant square, 4-bits for which castlings are still allowed, and one-bit for whose turn it is, and you end up with exactly 33 bytes for each board-setup.
You can use a checksum like md5, sha, just pass your chessboard cells as text, like:
TKBQKBHT
........
........
........
tkbqkbht
And get the checksum for generated text.
The checksum between one to other board will be different without any related value, at this point may be create a unique string (or array of bits) is the best way:
TKBQKBHT........................tkbqkbht
Because it will be unique too and is easily compare with others.
If two games achieve the same configuration through different moves or move orders, they should still be "equal". e.g. You shouldn't have to distinguish between which pawn is in a particular location, as long as the location is the same. You don't seem to really want to hash, but to uniquely and correctly distinguish between these board states.
One method is to use a 64x12 square-by-piecetype membership matrix. You can store this as a bit vector and then compare vectors for the check. e.g. the first 64 addresses in the vector might show which locations on the board contain pawns. The next 64 show locations which contain knights. You could let the first 6 sections show membership of white pieces and the final 6 show membership of black pieces.
Binary membership matrix pseudocode:
bool[] memberships = zeros(64*12);
move(pawn,a3,a2);
def move(piece,location,oldlocation):
memberships(pawn,location) = 1;
memberships(pawn,oldlocation) = 0;
This is cumbersome because you have to be careful how you implement it. e.g. make sure there is only one king maximum for each player. The advantage is that it only takes 768 bits to store a state.
Another way is a length-64 integer vector representing vectorized addresses for the board locations. In this case, the first 8 addresses might represent the state of the first row of the board.
Non-binary membership matrix pseudocode:
half[] memberships = zeros(64);
memberships[8] = 1; // white pawn at location a2
memberships[0] = 2; // white rook at location a1
...
memberships[63] = 11; // black knight at location g8
memberships[64] = 12; // black rook at location h8
The nice thing about the non-binary vector is you don't have as much freedom to accidently assign multiple pieces to one location. The downside is that it is now larger to store each state. Larger representations will be slower to do equality comparisons on. (in my example, assume each vector location stores a 16-bit half-word, we get 64*16=1014 bits to store one state compared to the 768 bits for the binary vector)
Either way, you'd probably want to enumerate each piece and board location.
enumerate piece {
empty = 0;
white_pawn = 1;
white_rook = 2;
...
black_knight = 11;
black_rook = 12;
}
enumerate location {
a1 = 0;
...
}
And testing for equality is just comparing two vectors together.
There are 64 squares. There are twelve different figures in chess that can occupy a square plus the possibility of no figure occupying it. Makes 13. You need 4 bits to represent those 13 (2^4 = 16). So you end up with 32 bytes to unambiguously store a chess board.
If you want to ease handling you can store 64 bytes instead, one byte per square, as bytes are easier to read and write.
EDIT: I've read some more on chess and have come to the following conclusion: Two boards are only the same, if all previous boards since last capture or pawn move are also the same. This is because of the threefold repetition rule. If for the third time the board looks exactly the same in a game, a draw can be claimed. So in spite of seeing the same board in two matches, it may be considered unfortunate in one match to make a certain move, so as to avoid a draw, whereas in the other match there is no such danger.
It is up to you, how you want to go about it. You would need a unique identifyer of variable length due to the variable number of previous boards to store. Well, maybe you take it easy, turn a blind eye to this and just store the last five moves to detect directly repetetive moves that could lead to a third repetion of positions, this being the most often occuring reason.
If you want to store moves with the board: There are 64x63=4032 thinkable moves (12 bits necessary), but many of them illegal of course. If I count correctly there are 1728 legal moves (A1->A2 = legal, A1->D2 illegal for instance), which would fit in 11 bits. I would still go for the 12 bits, however, as to make interpretion as easy as possible by storing 0/1 for A1->A2 and 62/63 for H7->H8.
Then there is the 50 moves rule. You don't have to store moves here. Only the number of moves since last capture or pawn move from 0 to 50 (that's enough; it doesn't matter whether it's 50, 51 or more). So another six bits for this.
At last: Black's or white's move? Enpassantable pawn? Castlingable rook? Some additional bits for this (or extension of the 13 occupancies to save some bits).
EDIT again: So if you want to use the board to compare with other matches, then "two boards are only the same, if all previous boards since last capture or pawn move are also the same" applies. If you only want to detect repetion of positions in the same game, however, then you should be fine by just using the 15 occupancies x 64 squares plus one bit for who's move it is.
I would like to implement a hash function that goes into a cache memory. Initially, I have 20 bits of input and I need to hash this input into 7 bits.
My cache is 128x4.
I have tried different hash functions, but the results were not very good (I get 60% hit rate). I was thinking of using the MD5 algorithm, but maybe something is better. I read an implementation of MD5 online, but I did not get it.
It seems like a perfectly distributed hash could actually be undesirable, here. It offers the possibility of mapping nearby addresses into the same set.
Perhaps what you want to do is hash 17 bits down to 4, and map the three low-order bits straight through so as to guarantee a minimum distance between instances of the same set.
I need a function which will generate three numbers so I can use them as RGB pattern for my SVG.
While this is simple, I also need to make sure I'm not using the same color twice.
How exactly do I do that? Generate one number at a time with simple rand (seed time active) and then what? I don't want to exclude a number, but maybe the whole pattern?
I'm kind of lost here.
To be precise, by first calling of this function I will get for example 218 199 154 and by second I'll get 47 212 236 which definitely are two different colors. Any suggestions?
Also I think a struct with int r, int g, int b would be suitable for this?
Edit: Colors should be different to the human eye. Sorry for not mentioning this earlier.
You could use a set to store the generated colors.
First instanciate a new set.
Then, every time you generate a color, look if the value is present in your set.
If the record exists, skip it and retry for a new colour. If not, you can use it but dont forget to cache it in the Set after.
This may become not performant if you need to generate a big quantity of colour.
The cheapest way to do this would be to use a Bloom filter which is very small memory wise, but leads to occasional false positives (i.e., you will think you have used a colour, but you haven't). Basically, create three random numbers between 0-255, save them however you like, hash them as a triplet and place the hash in the filter.
Also, you might want to throw away the low bits of each channel since it's probably not easy to tell #FFFFF0 versus #FFFFF2.
Here is a simple way:
1.Generate a random integer.
2.Shift it 8 times to have 24 meaningful bits, store this integer value.
3.Use first 8 bits for R, second group of 8 bits for G,
and the remaining 8 bits for B value.
For every new random number, shift it 8 times, compare all the other integer values that you stored before, if none of them matches with the new one use it for the new color(step3).
The differentiation by human eye is an interesting topic, because perceptional thresholds vary from one to another person. To achieve it shift the integer 14 times, get the first 6 bits for R(pad two 0s to get 8 bits again), get the second 6 bits for G, and last 6 bits for B. If you think that 6 bits are not good for it, decrease it 5,4...
Simple Run with 4 significant bits for each channel:
My random integer is:
0101-1111-0000-1111-0000-1100-1101-0000
I shift(you can also use multiply or modulo) it to left 20 times:
0000-0000-0000-0000-0000-0101-1111-0000
store this value.
Then get first 4 bits for R second 4 bits for G and last 4 bits for B:
R: 0101
G: 1111
B: 0000
Pad them to make each of them 8 bits.
R: 0101-0000
G: 1111-0000
B: 0000-0000
Use those for your color components.
For each new random number after shifting it compare it with your stored integer values so far. If it is different, then store and use it for color.
One idea would be to use a bit vector to represent the set of colors generated. For 24-bit precision, the bit vector would be 224 bits long, which is 16,777,216 bits, or 2 MB. Certainly not a lot, these days, and it would be very fast to look up and insert colors.
Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?
Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.
How would you approach this problem?
Thanks in advance :)
I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
Author Edit:
Below, #Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
Phimuemue Edit:
Of course one needs to take a bit "special" hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.
The important question is whether you want to solve this problem efficiently, or whether you want accurately.
If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.
Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).
A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.
And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).
Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).
After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.
After the second pass, items with count > 1 in the second table are your duplicates.
This is still O(n), just twice as slow as a single pass.
How about:
Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.
Finding duplicates
Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".
If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.
Counting Duplicates
If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.
Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.
Its going to be a hack, but its doable.
You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.
How? Oh, what language are you using?
You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.
If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.
If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.
With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).
If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.
I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.
Had to do this a long time ago.
What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:
1 to 10, 12, 16, 20 to 50, 52 would become..
[1,10], 12, 16, [20,50], 52, ...
Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage
p.s. another way to store them
1, -9, 12, 16, 20, -30, 52,
when i had no numbers lower than zero
After that i applied various algorithms (described by other posters) here on the reduced data set
#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
unsigned x=0;
size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
else printf("no duplicate\n");
return 0;
}
This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions
Map(key, value)
Reduce(key, values[])
so all in all:
open file and iterate through the data
for each number -> Map(number, line_index)
in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
print the duplicates : number, line_index1, line_index2,...
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
good luck :)
Another more simple approach could be in using bloom filters
AdamT
Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.
Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.
I never had to do this before and never even thought about this before. How can i or what is the best way of storing RGB values in the database.
I thought of couple of options. The most obvious one being 3 byte columns to store the R,G and the B.(I dont want to go this route)
Another option is to store it in a 32 bit int column. ( I am leaning towards this one)
or may be i am just missing something trivial.
The "wasted" space of 32-bit integer column would allow you to store an alpha channel as well, should the need ever arise for it.
First and foremost: what are your requirements?
Do you need to retrieve the color and only the color? Do you ever need to query be components? do you need to search by colorspace distance? Do you need to store colorspace information (Adobe RGB or sRGB)? See also Best Way to represent a color in SQL.
If you're doing storing these numbers for web design, I would suggest simply using a char(6) and storing a string of hex triplets.
Sure, that's two bytes "wasted" over a 32-bit integer, but if you're not comparing them mathematically in some way and just regurgitating them to a CSS file, for instance, storing as a string will remove the need to translate back and forth.
Not that hex triplets to integers is a tough translation, but doing the easiest thing possible rather than optimizing for a few bytes may be worth considering.
If you're doing something other than web-related work, you may want to consider building in room for more than 8 bits per channel.
RGB values are usually described on the web in the format 0xRRGGBB where RR, GG, and BB are the hex values of R, G, and B. While you may be wasting a bit of space with a 32 bit int, I can't imagine it's much compared to the benefit you'll potentially gain from storing the values in a well-known format.
In case you'd like quick primer on how to go about the conversion, wikipedia's got you covered!
Just store it as a 32 bit value. There is no point in breaking down into 3 fields since you will most likely want all 3 components together all the time.
My guess is to store a 32 bit integer.
However if your SQL operations require each component to be of individual columns (meaning to say you need to compare values of R vs G of another column for example) you will have to separate out the values into individual columns. R, G, B, each 0-255 integer.