What's the Rationale Behind the Design of GUID's? - theory

I don't understand GUID's. I mean, I understand that they're a big random number with trivially small likelihood of duplication, but why do they have such a strictly defined format? (I am referencing most of my information from http://en.wikipedia.org/wiki/Globally_unique_identifier.)
Why not just make GUID's a completely random 128bit integer and call it a day? It seems that all the restrictions and rules put upon generating them would reduce the entropy in some ways. Why have some of the bits reserved for format and variant specification? Why specify endianness?
The variant and format specifications are what really puzzle me. If its just a random identifier, why would a consumer of the GUID care about what format it was in? It's a 128bit random identifier. Isn't that enough?

Don't confuse randomness and uniqueness.
Without the strict formatting on where, for instance, the MAC address is incorporated, then two machines could potentially generate the same random identifier at the same time. At this point, it's no longer "globally unique".

Related

What typing system does BASIC use?

I've noticed that nowhere I can find can give me a definitive answer to the question above. I first wondered about this when I noticed that you never had to state the type of variable in QBasic when declaring them, although you could add a suffix to the name of a variable to make sure it was of a particular type.
Also, as some dialects of BASIC are interpreted and others compiled, does this affect the typing system?
There are so many flavors of BASIC, some only historical and some still in use, that it's impossible to give one true answer.
Some of the old BASICs (the line numbered BASICs) had two data types: String, or Integer. The original BASIC that shipped with Apple-II computers was an "Integer BASIC." Later BASICs introduced Floating Point, which was often single precision FP. The BASIC that shipped with the TI-99/4a was an example of an early-80's floating point BASIC. "Way back when", you would make a string literal with quotes, and a string variable with a $ sigil following the identifier name. Variables that didn't have the $ sigil would usually default to the type of numeric variable that the given flavor of basic supported (Integer or Floating Point). GWBasic, for example, would default to floating point unless you specified the % sigil, which meant "Integer". TI Extended Basic didn't have an integer type, but the floating point numeric type had something like 15 significant digits, if I recall (floating point math errors not withstanding).
These early basics were essentially statically typed, though the distinction was far less useful than in more powerful languages. The choices for data types were few: String, Number (sometimes Int, sometimes FP), and sometimes with the ability to specify whether a number as Int or FP. Behind the scenes some even freely converted between ints and floating point as necessary. Often such behind the scenes conversions were not well documented.
But that was the state of affairs in the '80s, when everyone with a home computer was a hobbiest, and standards were loose. Every hardware manufacturer seemed to have their own take on how BASIC should work.
More modern BASICs are more powerful, and allow for tighter control over variable types (when needed).
Earlier dialects of BASIC were always statically typed. Numeric variables, string variables and arrays each required different syntax. Also length of names was often limited to just one symbol. Most often used syntax was just V for numeric, V$ for string and arrays were separately declared with DIM.
Since I didn't program in BASIC for good 15 years, I can't say for sure what is going on in modern dialects.
The enhanced version of BASIC used in MultiValue Database systems uses dynamic typing. This means that the compiler decides how to treat your variable based on the logic and context of the statements.
Anything in double quotes is a string and any numeric value not in double quotes is a number. For writing numeric data away in the form of doubles or floats there are various format expressions you can use to achieve this, which you apply to your variables.
Ultimately everything is saved at database level as an ASCII string. So the developer enforces the type at business logic level, as opposed to the database enforcing it.

The use of __int8_t on some small values on 32bit machine is useless optimization effort?

I have some values on my functions that are < 10, normally. So,if I use __int8_t instead of int to store this values is useless effort of optimization?
Not only can this be a useless optimization, but you may cause integer misalignment (i.e., integers not aligned to 4-byte or 8-byte boundaries depending on the architecture) which may actually slow performance. To get around this, most compilers try to align integers on most architectures, so you may not be saving any space at all, because the compiler adds padding (e.g., for stack based variables and struct members) to properly align larger variables or struct members that follow your 8-bit integer. Since your question looked like it was regarding a local variable in a function and assuming that there was only one such variable and that the function was not recursive, the optimization is most likely unnecessary and you should just use the platform native integer size (i.e., int).
The optimization you mentioned can be worthwhile if you have very many instances of a particular type and you want to consume less memory, by using 8-bit fields instead of say 64-bit fields for small integers in a struct. If this is your intention, be sure to use the correct pragmas and/or compiler switches for your platform, so that structs get properly packed to the smallest size. Another instance where using a smaller integer size can come in handy is when arrays, especially large ones, are involved. Finally, keep in mind that specifying the number of bits in the type is non-portable, as the leading double underscore in its name indicated.
I hazard to mention this, since it appears to only be tangentially related to your question, but you can go even smaller than 8-bit ints in structs, for even smaller ranges of values than 0-255. In this case, bit fields become an option - the epitomy of micro-optimization. Don't use bit fields unless you have actually measured performance in the time and space domains and are sure that there are significant savings to be had. To disuade you further, I present some of the pitfalls of using them in the following paragraph.
Bit fields add more complexity to the development process by forcing you to manually order struct members to pack data into as few bytes as possible - a task that is not only arduous, but also often leads to a completely unintuitive struct member variable ordering. Changes, additions and deletions of member variables often tend to cascade to the members that follow them causing a maintenance problem down the road. To get around this problem, developers often stick padding and alignment bits into the struct, complicating the process further for all but the original developer of the code. Commenting such code stops being a nicety and often becomes a necessity. There is also the problem of developers forgetting that they are dealing with a bit field within the code and treating it as if it had the entire domain of its underlying type, leading to no end of silent and sometimes difficult to debug under/overflows.
Given the above guidelines, profile your application to see whether the optimizations just discussed make sense for your particular compiler(s) and platform(s).
It won't save time, but if you have a million of these things, it will save 3mb of space (and that may save time).

Is there a perfect hash function for the combined input sets of IMEI numbers and MAC addresses? (C implementation)

I'm looking for a hash function that I can use to give uniform unique IDs to devices that connect to our network either using a GSM modem or an ethernet connection.
So for any given device I have either an IMEI number or a MAC address hard-coded that I can use to generate the hash.
I've been researching hash functions for the last few hours, reading up on the different non-cryptographic and cryptographic hashes that I might want to use. My focus is low-collisions over performance, as the hash will not be calculated very often.
My front-runners are MD5, FNV-1a, MurmurHash2, Hsieh, and DJB.
Whatever hash I use will have to be implemented in C and will be used on a microcontroller with a tiny processor.
I know that the trick to choosing a good hash function for your needs is knowing what sort of input you're going to be feeding it.
The reason I'm asking this question is that the idea popped into my head that both IMEI and MAC have finite lengths and ranges, so perhaps there exists a fairly simple hash function that can cover the full sets of both and not have collisions. (Thus, a perfect hash function)
An IMEI number is 15 decimal digits long (12-13 bytes in hex?), and a MAC address is 6 bytes. Mulling it over I don't think you would have collisions between the two sets of input numbers, but feel free to correct me if that is wrong. If you did could you do something to prevent it? Add some seed to one of the sets?
Am I on the right track? Is finding perfect hash function for these combined sets possible?
Thanks!
Update
Thanks for the answers and comments. I ended up using the identity function ;) as my hash function, and then also using a mask since there is potential overlap across the sets of numbers.
IMEI, IMEISV, and MAC will all fit in 6.5 bytes or less, so I am storing my values in 7 bytes and then doing a bitwise OR on the first byte with a mask based on which set the number comes from, to ensure they are unique across all sets.
There's no way to make a perfect hash over an unknown, growing input set. You could simply make the field one bit larger than whichever of IMEI or MAC is larger, and use that bit to flag which type of identifier it is, along with the entire IMEI/MAC. Anything smaller will have collisions, but they're probably quite rare.

Generating totally random numbers without random function? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
True random number generator
I was talking to a friend the other day and we were trying to figure out if it is possible to generate completely random numbers without the help of a random function? In C for example "rand" generates pseudo-random numbers. Or we can use something like "srand( time( NULL ) );" This will allow the computer to read numbers from its clock as seed values. So if I understand everything I have read so far right, then I am pretty sure that no random function actually produces truely random numbers. How would one write a program that generates numbers that are completely random and what would code look like?
Check out this question:
True random number generator
Also, from wikipedia's entry on pseudorandom numbers
As John von Neumann joked, "Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin."
The excellent random.org website provides hardware-based random numbers as well as a number of software interfaces to retrieve these.
This can be used e.g. for genuinely unpredictable seeds or for 'true' random numbers. Being a web service, there are limits on the number of draws you can make, so don't try to use this for your graduate school monte carlo simulation.
FWIW, I wrapped one of those interface in the R package random.
It would look like:
int random = CallHardwareRandomGenerator();
Even with hardware, randomness is tricky. There are things which are physically random (atomic decay is random, but with predictable average amounts, so that can be used as a source of random information) there are things that are physically random enough to make prediction impractical (this is how casinos make money).
There are things that are largely indeterminate (mix up information from key-stroke rate, mouse-movements, and a few things like that), which are a good-enough source of "randomness" for many uses.
Mathematically, we cannot produce randomness, but we can improve distribution and make something harder to predict. Cryptographic PRNGs do a stronger job at this than most, but are more expensive in terms of resources.
This is more of a physics question I think. If you think about it nothing is random, it just occurs due to events the complexity of which make them unpredictable to us. A computer is a subsystem just like any other in the universe and by giving it unpredictable external inputs (RTC, I/O garbage) we can get the same kind of randomness that that a roulette wheel gets from varying friction, air resistance, initial impulse and millions of factors that I can't wrap my head around.
There's room for a fair amount of philosophical debate about what "truly random" really even means. From a practical viewpoint, even sources we know aren't truly random can be used in ways that produce what are probably close enough for almost any practical purpose though (in particular, that at least with current technology, full knowledge of the previously produced bitstream appears to be insufficient to predict the next bit accurately). Most of those do involve a bit of extra hardware though -- for example, it's pretty easy to put a source together from a little bit of Americium out of a smoke detector.
There are quite a few more sources as well, though they're mostly pretty low bandwidth (e.g., collect one bit for each keystroke, based on whether the interval between keystrokes was an even or odd number of CPU clocks -- assuming the CPU clock and keyboard clock are derived from separate crystals). OTOH, you have to be really careful with this -- a fair number of security holes (e.g., in Netscape around v. 4.0 or so) have stemmed from people believing that such sources were a lot more random than they really were.
While there are a number of web sites that produce random numbers from hardware sources, most of them are useless from a viewpoint of encryption. Even at best, you're just trusting your SSL (or TLS) connection to be secure so nobody captured the data you got from the site.

What is an optimal format for saving large amounts of numerical data (GBs) from a C program?

I'm a physicist that normally deals with large amounts of numerical data generated using C programs. Typically, I store everything as columns in ASCII files, but this had led to massively large files. Given that I am limited in space, this is an issue and I'd like to be a little smarter about the whole thing. So ...
Is there a better format than ASCII? Should I be using binary files, or perhaps a custom format some library?
Should I be compressing each file individually, or the entire directory? In either case, what format should I use?
Thanks a lot!
In your shoes, I would consider the standard scientific data formats, which are much less space- and time-consuming than ASCII, but (while maybe not quite as bit-efficient as pure, machine-dependent binary formats) still offer standard documented and portable, fast libraries to ease the reading and writing of the data.
If you store data in pure binary form, the metadata is crucial to make any sense out of the data again (are these numbers single or double precision, or integers and of what length, what are the arrays' dimensions, etc, etc), and issues with archiving and retrieving paired data/metadata pairs can, and in practice do, occasionally make perfectly good datasets unusable -- a real pity and waste.
CDF, in particular, is "a self-describing data format for the storage and manipulation of scalar and multidimensional data in a platform- and discipline-independent fashion" with many libraries and utilities to go with it. As alternatives, you might also consider NetCDF and HDF -- I'm less familiar with those (and such tradeoffs as flexibility vs size vs speed issues) but, seeing how widely they're used by scientists in many fields, I suspect any of the three formats could give you very acceptable results.
If you need the files for a longer time, they are important experimental data that prove somethings for you or so, don't use binary formats. You will not be able to read them when your architecture changes. dangerous. stick to text (yes ascii) files.
Choose a compression format that fits your needs. Is compression time an issue? Usually not, but check that for yourself. Is decompression time an issue? Usually yes, if you want to do data analysis on it. Under these conditions I'd go for bzip2. This is quite common nowadays, well tested, foolproof. I'd do files individually, since the larger your file, the larger the probability of losses. (Bits flip etc).
A terabyte disk is a hundred bucks. Hard to run out of space these days. Sure, storing the data in binary saves space. But there's a cost, you'll have a lot less choices to get the data out of the file again.
Check what your operating system can do. Windows supports automatic compression on folders for example, the file content get zipped by the file system without you having to do anything at all. Compression rates should compete well with raw binary data.
There's a lot of info you didn't include, but should think about:
1.) Are you storing integers or floats? What is the typical range of the numbers?
For example: storing small comma-separated integers in ascii, such as "1,2,4,2,1" will average 2-bytes per datum, but storing them as binary would require 4-bytes per datum.
If your integers are typically 3 digits, then comma-separated vs binary won't matter much.
On the other hand, storing doubles (8-byte values) will almost certainly be smaller in binary format.
2.) How do you need to access these values? If you are not concerned about access time, compress away! On the other hand, if you need speedy, random access then compression will probably hinder you.
3.) Are some values frequently repeated? Then you may consider a Huffman encoding or a table of "short-cut" values.

Resources