Checksum with low probability of false negative - c

At this moment I'm using a simple checksum scheme, that just adds the words in a buffer. Firstly, my question is what is the probability of a false negative, that is, the receiving system calculating the same checksum as the sending system even when the data is different (corrupted).
Secondly, how can I reduce the probability of false negatives? What is the best checksuming scheme for that. Note that each word in the buffer is of size 64 bits or 8 bytes, that is a long variable in a 64 bit system.

Assuming a sane checksum implementation, then the probability of a randomly-chosen input string colliding with a reference input string is 1 in 2n, where n is the checksum length in bits.
However, if you're talking about input that differs from the original by a low number of bits, then the probability of collision is generally much, much lower.

One possibility is to have a look at T. Maxino's thesis titled "The Effectiveness of Checksums for Embedded Networks" (PDF), which contains an analysis for some well-known checksums.
However, usually it is better to go with CRCs, which have additional benefits, such as detection of burst errors.
For these, P. Koopman's paper "Cyclic Redundancy Code (CRC) Selection for Embedded Networks" (PDF) is a valuable resource.

Related

64-bit hash/digest in C

I am trying to find out if there is any API in C for calculating a 64 bit hash.
I found out that some people use top 64 bits of MD5/SHA1 etc. Is it a good approach?
You could try SipHash in its form as a MAC (which requires key management, though). It is particularly well-suited for short input messages and aims at cryptographic strength. A C implementation is also available.
But if you really care about someone actively messing around with your files, you shouldn't restrict yourself to 64 bits of security. 64 bits can be broken even by brute force today, given enough time and resources. You should use SHA-256 or stronger for that. Or let me state it the other way round, blacklisting broken options: don't use MD5 (or MD-anything for that matter). Use SHA-1 only if you can't use SHA-256 for some reason.
Using a hash also has the advantage that you don't need to manage any keys (opposed to using a MAC). You should just keep the hashes you compute in a different place than the files you are about to monitor - otherwise somebody tampering with your files can easily tamper with the checksum, too.
Regarding whether truncating hashes is good or bad
In theory, I can't see why it should be wrong to truncate a let's say 160 bit hash value down to 64 bit, regardless of whether you take the most significant bits or the least significant bits or pick them using any arbitrary pattern. The only reason why this isn't done more often that I can think of is efficiency - why bring the big guns if there are more efficient algorithms for handling the smaller problems.
In what follows, I assume a cryptographically secure hash for this purpose, general-purpose hashes are quite a different topic - they might expose attack surfaces when truncated for all I know.
But, for a cryptographically secure hash, unless the algorithm is broken, we can assume that its output is indistinguishable from that of a uniformly distributed random variable.
If we truncate this value now, we don't offer any further insight into the inner workings of the algorithm. Still, we do weaken the security by the simple fact that brute-forcing (be it collisions or finding pre-images) now takes less time by laws of probability.
For example, finding a collision for a 64 bit hash takes roughly 2^32 attempts on average - says the Birthday Paradoxon. If you truncate your output down to the least significant 32 bits of the original 64 bit hash, then you will find collisions in time roughly 2^16, because you simply ignore the most significant 32 bits and the de-facto uniform distribution does the rest - it's like you started searching for collisions with a 32 bit value in the first place.
It's a bad idea. Hash function values are always meant to be taken as a whole.
For the implied question of "how to calculate a 64 bit hash": what's your intended use? Remember that 64 bits are too few for a crypto-strength hash function.
Use CRC to protect against random changes.
Use HMAC to protect against an attacker changing your files. HMAC uses a secret key that is necessary to generate and verify the tags. The result of an HMAC is as long as the underlying hash function (e.g. 20 bytes for an HMAC-SHA1), but it is frequently truncated. I.e. according to NIST SP 800-107 p.14 64-96 bits should be enough for most applications.
64 bits is small for a hash and usually, hashes are meant to be taken as a whole.
Now, what do you need these 64 bits for ? Answer will depend of expected usage.
Keep in mind that md5 is quite broken nowadays and 64 bits is very low security.
If you just need integrity checking against random changes, then a simple checksum as given in the other answers may be enough.
If you need cryptographic strength to ensure the original content, then 64 bit is too weak. Better use the full value of an unbroken algorithm, i.e. not MD5. SHA1 is still okay, but for longer term security better use SHA256. Or even go further with HMAC, as mentioned in the other answer.
There is nothing wrong with using the truncated value of a cryptographic hash. In fact, SHA224/384 are calculated by calculating a SHA256/512 hash with a different initialization vector and then truncating the result. However, this is only valid for cryptographic hashes. It may be a bad idea for normal checksums and table hashes.
Use OpenSSL's API for the calculations.(www.openssl.org).

64-bit multiplicative hashing

I'm working on fast 64-bit hash. Many existing secure hash functions are too way slow, some non-cryptographic hash functions like FNV are just bad.
Well, I came up with a FNV-like hash:
UINT64 hash=0;
// for each input byte
hash=(hash^(input_byte+1))*HASH_PRIME;
Main question is about HASH_PRIME. Often, we may see a "golden ratio" term for multiplicative hashing.
For 64-bit hash, golden ratio is 0x9e3779b97f4a7c13.
I tested the 32-bit golden ratio for period in PRNG:
DWORD hash=0;
// loop
hash=(hash^1)*0x9e3779b9;
rnd_out=hash>>24;
A good value here may produce the period of 0xFFFFFFFF - i.e. max possible. This golden ratio produces notably smaller period.
or just
DWORD hash=~0;
// loop
hash*=0x9e3779b9;
rnd_out=hash>>24;
And again, a good enough multiplier can produce period of 0x3FFFFFFF bytes. Golden ratio here produces again much shorter period.
Never tested the 64-bit primes - too computationally expensive.
Is period important for my hash? And where to find a good 64-bit HASH_PRIMES and how to test such stuff?
Are you doing this is as an exercise? otherwise I would advise having a looking at well known hash functions as Bob Jenkin's lookup8 and lookup family (http://burtleburtle.net/bob/hash/ ) and Austin Appleby's murmur http://code.google.com/p/smhasher/ (a speed killer and my favorite). Good hash functions are hard to build... and if you are after a rolling type of hash, Rabin fingerprints are hard to beat.
And to make sure that your hashes are decent if you really want to roll your own, use either Appleby and Jenkins hash tests (torture and smhasher )
Not sure about the first two examples. But in the third to get a full period out of the code you need to add an odd number. Otherwise this will have a maximum period of 65537, It could be as low as 3. There may even be a fixed point.
Wherever you got the 0x3FFFFFFF for a good period is not correct. One of the Knuth volumes discusses this in excessive detail.
The multiplier must be of the form 4n+1 and there must be an odd addend

How high do I have to count before I hit an MD5 hash collision?

Never mind why I'm doing this -- this is mainly theoretical.
If I were MD5 hashing string representations of integers, how high would I have to count before two of the hashes collide?
This problem (in generic case) is known as Birthday Paradox
The probability of collision in generic case can be computed easily. However, in your particular case, you have to actually compute (and store!) each MD5.
EDIT #Scott : not really. The Pigeonhole principle (being just a particular case of Birthday problem) would say that having 2^128 possible MD5 values, we surely will have a collision after 1 + 2^128 tries. The birthday paradox says that the probability of collision will be grater than 0.5 for about 2^70 MD5 values.
With these estimates for storage requirements, it's up to you to decide if the problem worth it. By me it does not.
Apparently, one can base a thesis on this very thing (or similar problems, anyway). I haven't read it, but maybe something in Stevens' thesis will help you (it's apparently linked from the Wikipedia article).
In a perfect world, to 1 + 2^128. But I doubt md5 is perfect, I cant give you a number but is guaranteed to be <= 1+ 2^128
Here is a scientific way to find out an estimate of how high you would have to count.
Make MD5 hash that is cut down to say 4 bits. Calculate that (make sure you calculate until you reach say 100 collisions so you get a good average)
Then make the same thing at 8 bits (again, wait for many collisions so you can calculate an average).
Do it again and again until you have averages for 4, 8, 12, 16 bits and then see if you can find a trend. Follow that trend up to 128 bits
You may want to xor all 128 bits to come up with your shorter version. Taking the first or last part may not be the best test.

Finding prime factors to large numbers using specially-crafted CPUs

My understanding is that many public key cryptographic algorithms these days depend on large prime numbers to make up the keys, and it is the difficulty in factoring the product of two primes that makes the encryption hard to break. It is also my understanding that one of the reasons that factoring such large numbers is so difficult, is that the sheer size of the numbers used means that no CPU can efficiently operate on the numbers, since our minuscule 32 and 64 bit CPUs are no match for 1024, 2048 or even 4096 bit numbers. Specialized Big Integer math libraries must be used in order to process those numbers, and those libraries are inherently slow since a CPU can only hold (and process) small chunks (like 32 or 64 bits) at one time.
So...
Why can't you build a highly specialized custom chip with 2048 bit registers, and giant arithmetic circuits, much in the same way that we scaled from 8 to 16 to 32 to 64-bit CPUs, just build one a LOT larger? This chip wouldn't need most of the circuitry on conventional CPUs, after all it wouldn't need to handle things like virtual memory, multithreading or I/O. It wouldn't even need to be a general-purpose processor supporting stored instructions. Just the bare minimum to perform the necessary arithmetical calculations on ginormous numbers.
I don't know a whole lot about IC design, but I do remember learning about how logic gates work, how to build a half adder, full adder, then link together a bunch of adders to do multi-bit arithmetic. Just scale up. A lot.
Now, I'm fairly certain that there is a very good reason (or 17) that the above won't work (since otherwise one of the many people smarter than I am would have already done it) but I am interested in knowing why it won't work.
(Note: This question may need some re-working, as I'm not even sure yet if the question makes sense)
What #cube said, and the fact that a giant arithmetic logic unit would take more time for the logic signals to stabilize, and include other complications in digital design. Digital logic design includes something that you take for granted in software, namely that signals through combinational logic take a small but nonzero time to propagate and settle. A 32x32 multiplier needs to be designed carefully. A 1024x1024 multiplier would not only take a huge amount of physical resources in a chip, but it also would be slower than a 32x32 multiplier (though perhaps faster than a 32x32 multiplier computing all the partial products needed to perform a 1024x1024 multiply). Plus it's not only the multiplier that's the bottleneck: you've got memory pathways. You'd have to spend a bunch of time gathering the 1024 bits from a memory circuit that's only 32 bits wide, and storing the resulting 2048 bits back into the memory circuit.
Almost certainly it's better to get a bunch of "conventional" 32-bit or 64-bit systems working in parallel: you get the speedup w/o the hardware design complexity.
edit: if anyone has ACM access (I don't), perhaps take a look at this paper to see what it says.
Its because this speedup would be only in O(n), but the complexity of factoring the number is something like O(2^n) (with respect to the number of bits). So if you made this überprocessor and factorized the numbers 1000 times faster, I would only have to make the numbers 10 bits larger and we would be back on the start again.
As indicated above, the primary problem is simply how many possibilities you have to go through to factor a number. That being said, specialized computers do exist to do this sort of thing.
The real progress for this sort of cryptography is improvements in number factoring algorithms. Currently, the fastest known general algorithm is the general number field sieve.
Historically, we seem to be able to factor numbers twice as large each decade. Part of that is faster hardware, and part of it is simply a better understanding of mathematics and how to perform factoring.
I can't comment on the feasibility of an approach exactly like the one you described, but people do similar things very frequently using FPGAs:
Crack DES keys
Crack GSM conversations
Open source graphics card
Shamir & Tromer suggest a similar approach, using a kind of grid computing:
This article discusses a new design for a custom hardware
implementation of the sieving step, which
reduces [the cost of sieving, relative to TWINKLE,] to about $10M. The new device,
called TWIRL, can be seen as an extension of the
TWINKLE device. However, unlike TWINKLE it
does not have optoelectronic components, and can
thus be manufactured using standard VLSI technology
on silicon wafers. The underlying idea is to use
a single copy of the input to solve many subproblems
in parallel. Since input storage dominates cost, if the
parallelization overhead is kept low then the resulting
speedup is obtained essentially for free. Indeed, the
main challenge lies in achieving this parallelism efficiently while allowing compact storage of the input.
Addressing this involves myriad considerations, ranging
from number theory to VLSI technology.
Why don't you try building an uber-quantum computer and run Shor's algorithm on it?
"... If a quantum computer with a sufficient number of qubits were to be constructed, Shor's algorithm could be used to break public-key cryptography schemes such as the widely used RSA scheme. RSA is based on the assumption that factoring large numbers is computationally infeasible. So far as is known, this assumption is valid for classical (non-quantum) computers; no classical algorithm is known that can factor in polynomial time. However, Shor's algorithm shows that factoring is efficient on a quantum computer, so a sufficiently large quantum computer can break RSA. ..." -Wikipedia

Why is that data structures usually have a size of 2^n?

Is there a historical reason or something ? I've seen quite a few times something like char foo[256]; or #define BUF_SIZE 1024. Even I do mostly only use 2n sized buffers, mostly because I think it looks more elegant and that way I don't have to think of a specific number. But I'm not quite sure if that's the reason most people use them, more information would be appreciated.
There may be a number of reasons, although many people will as you say just do it out of habit.
One place where it is very useful is in the efficient implementation of circular buffers, especially on architectures where the % operator is expensive (those without a hardware divide - primarily 8 bit micro-controllers). By using a 2^n buffer in this case, the modulo, is simply a case of bit-masking the upper bits, or in the case of say a 256 byte buffer, simply using an 8-bit index and letting it wraparound.
In other cases alignment with page boundaries, caches etc. may provide opportunities for optimisation on some architectures - but that would be very architecture specific. But it may just be that such buffers provide the compiler with optimisation possibilities, so all other things being equal, why not?
Cache lines are usually some multiple of 2 (often 32 or 64). Data that is an integral multiple of that number would be able to fit into (and fully utilize) the corresponding number of cache lines. The more data you can pack into your cache, the better the performance.. so I think people who design their structures in that way are optimizing for that.
Another reason in addition to what everyone else has mentioned is, SSE instructions take multiple elements, and the number of elements input is always some power of two. Making the buffer a power of two guarantees you won't be reading unallocated memory. This only applies if you're actually using SSE instructions though.
I think in the end though, the overwhelming reason in most cases is that programmers like powers of two.
Hash Tables, Allocation by Pages
This really helps for hash tables, because you compute the index modulo the size, and if that size is a power of two, the modulus can be computed with a simple bitwise-and or & rather than using a much slower divide-class instruction implementing the % operator.
Looking at an old Intel i386 book, and is 2 cycles and div is 40 cycles. A disparity persists today due to the much greater fundamental complexity of division, even though the 1000x faster overall cycle times tend to hide the impact of even the slowest machine ops.
There was also a time when malloc overhead was occasionally avoided at great length. Allocation's available directly from the operating system would be (still are) a specific number of pages, and so a power of two would be likely to make the most use of the allocation granularity.
And, as others have noted, programmers like powers of two.
I can think of a few reasons off the top of my head:
2^n is a very common value in all of computer sizes. This is directly related to the way bits are represented in computers (2 possible values), which means variables tend to have ranges of values whose boundaries are 2^n.
Because of the point above, you'll often find the value 256 as the size of the buffer. This is because it is the largest number that can be stored in a byte. So, if you want to store a string together with a size of the string, then you'll be most efficient if you store it as: SIZE_BYTE+ARRAY, where the size byte tells you the size of the array. This means the array can be any size from 1 to 256.
Many other times, sizes are chosen based on physical things (for example, the size of the memory an operating system can choose from is related to the size of the registers of the CPU etc) and these are also going to be a specific amount of bits. Meaning, the amount of memory you can use will usually be some value of 2^n (for a 32bit system, 2^32).
There might be performance benefits/alignment issues for such values. Most processors can access a certain amount of bytes at a time, so even if you have a variable whose size is let's say) 20 bits, a 32 bit processor will still read 32 bits, no matter what. So it's often times more efficient to just make the variable 32 bits. Also, some processors require variables to be aligned to a certain amount of bytes (because they can't read memory from, for example, addresses in the memory that are odd). Of course, sometimes it's not about odd memory locations, but locations that are multiples of 4, or 6 of 8, etc. So in these cases, it's more efficient to just make buffers that will always be aligned.
Ok, those points came out a bit jumbled. Let me know if you need further explanation, especially point 4 which IMO is the most important.
Because of the simplicity (read also cost) of base 2 arithmetic in electronics: shift left (multiply by 2), shift right (divide by 2).
In the CPU domain, lots of constructs revolve around base 2 arithmetic. Busses (control & data) to access memory structure are often aligned on power 2. The cost of logic implementation in electronics (e.g. CPU) makes for arithmetics in base 2 compelling.
Of course, if we had analog computers, the story would be different.
FYI: the attributes of a system sitting at layer X is a direct consequence of the server layer attributes of the system sitting below i.e. layer < x. The reason I am stating this stems from some comments I received with regards to my posting.
E.g. the properties that can be manipulated at the "compiler" level are inherited & derived from the properties of the system below it i.e. the electronics in the CPU.
I was going to use the shift argument, but could think of a good reason to justify it.
One thing that is nice about a buffer that is a power of two is that circular buffer handling can use simple ands rather than divides:
#define BUFSIZE 1024
++index; // increment the index.
index &= BUFSIZE; // Make sure it stays in the buffer.
If it weren't a power of two, a divide would be necessary. In the olden days (and currently on small chips) that mattered.
It's also common for pagesizes to be powers of 2.
On linux I like to use getpagesize() when doing something like chunking a buffer and writing it to a socket or file descriptor.
It's makes a nice, round number in base 2. Just as 10, 100 or 1000000 are nice, round numbers in base 10.
If it wasn't a power of 2 (or something close such as 96=64+32 or 192=128+64), then you could wonder why there's the added precision. Not base 2 rounded size can come from external constraints or programmer ignorance. You'll want to know which one it is.
Other answers have pointed out a bunch of technical reasons as well that are valid in special cases. I won't repeat any of them here.
In hash tables, 2^n makes it easier to handle key collissions in a certain way. In general, when there is a key collission, you either make a substructure, e.g. a list, of all entries with the same hash value; or you find another free slot. You could just add 1 to the slot index until you find a free slot; but this strategy is not optimal, because it creates clusters of blocked places. A better strategy is to calculate a second hash number h2, so that gcd(n,h2)=1; then add h2 to the slot index until you find a free slot (with wrap around). If n is a power of 2, finding a h2 that fulfills gcd(n,h2)=1 is easy, every odd number will do.

Resources