Key-indexed search in existence tables - c

I have a doubt about one of my book's statement.
Talking about key-indexed search in a symbol table, at a certain point it says: "If there aren't records (but only keys), we can use a bit table. In this case, the symbol table is called existence table, because we can consider the k-th bit as an indicator whether the k key there is or there isn't in the table. For example, using a 313-word table on a 32-bit computer, we can use this method to quickly determine whether a given 4-digit internal telephone number was already assigned."
Well, I know what a word is, thus that existence table should be a 10.016-bit table, in that case. But what does it mean? What does that fact of the 4-digit telephone number have to do with it? And so, how you can implement a symbol table with key-indexed search, when the records correspond to the keys?

There are 9000 four-digit numbers (in base 10, decimal), and 10000 (nonnegative) numbers with at most four digits, so a table with more than 10,000 bits is sufficient to indicate for each of these numbers whether it's present (is bit no n set or not?). For five-digit numbers - 90,000 of them - you'd need a larger table.
Since the bit-table can only tell you either "yes, we have it" or "no, we haven't", you can't use it if you need any information exceeding that. But if that's all you need to know, any injective mapping of keys to indices into the table (array) gives you access to that information, compactly stored. In the case of the telephone numbers, the mapping is trivial.

You can use a bittable of 10000 bits (each bit corresponding to a phone number), which fit in 313 bytes (10000/32 = 312.5 ~= 313)

Related

Bayesian Network creating conditional probability table (CPT)

I have trouble understanding where the numbers in the P(A|B,E) table are coming from in the alarm burglary example. I understand that P(B) and P(E) is chosen from knowledge about the domain. But I do not understand how many of the values in the CPT which can be chosen and which has to be calculated in order to make the tables valid. I assume that the P(J|A) and P(J|¬A) are chosen by expert knowledge? And then it must be the same for P(J|M).. or would these also have to be calculated by using given values?
I see with a binary example which is given here in the table on page 7:
https://cseweb.ucsd.edu/~elkan/250A/bayesnets.pdf, they are using the same numbers, but how have they calculated the values 0.95, 0.94, 0.29 and 0.001?
All the values in CPTs must come from somewhere, and cannot be calculated from other CPTs. There are two major approaches to get the numbers:
Have a domain expert specify the numbers.
Have a data set that contains joint realizations of the random variables. Then the numbers within the CPTs can be calculated from the respective frequencies within the data set. Note that this procedure becomes more complicated when not all variables are observed within the data set.
In addition, it is possible to mix approach 1 and 2.

Generate random 10 digit number as part of ID and ensure no duplicates

I have a request as part of an application I am working on to generate a UserId with the following criteria:
It will have a set 4 digit prefix
The next 10 digits will be random
A check-digit will be added on the end
Points 1 and 3 are straight-forward, but what would the best way be to generate a 10 digit random number whilst ensuring that it hasn't already been used.
I don't particularly like the idea of choosing one randomly, seeing if it has been taken and then either accepting or trying again.
My other thought was to generate a list of X numbers in advance (X being a number greater than the number of accounts you expect to be created) and just take the next one off the list as the accounts are created.
Any thoughts?
EDIT:
Lets bring technology into this. If I am using a SQL server database, is there a way I can make the database do this for me? E.g. enforce a unique constraint and get the database to generate the number?
Encryption. See this answer.
What you need is basically encrypt a sequential ID, thus producing a seemingly random number.
What's especially good about this is that you can do this all client-side, albeit in two consecutive transactions.
The only way to be 100% sure there is no other userid with the same random digits is to compare the gernerated to all existing users, you have to do it sometime
The only possibility is to do this code-side.
BUT you can save the random generated id to the userdatabase and compare this with your new key (in your code)
e.g. SELECT * FROM xy WHERE userid = 'newuserid'
If the result is null your key never was generated.
Thanks to Anton's answer I found this C# implementation of the skip32 encryption algorithm.
https://github.com/eleven41/Eleven41.Skip32
Using this I can pass in an incrementing database identity integer and get a nice random looking 9/10 digit number from it. So a simple call like
int result = cipher.Encrypt(databaseIdentity);
will give me the number I want. No need for duplicate checking as each will be unique.

finding a number appearing again among numbers stored in a file

Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?
Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.
How would you approach this problem?
Thanks in advance :)
I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
Author Edit:
Below, #Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
Phimuemue Edit:
Of course one needs to take a bit "special" hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.
The important question is whether you want to solve this problem efficiently, or whether you want accurately.
If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.
Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).
A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.
And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).
Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).
After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.
After the second pass, items with count > 1 in the second table are your duplicates.
This is still O(n), just twice as slow as a single pass.
How about:
Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.
Finding duplicates
Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".
If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.
Counting Duplicates
If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.
Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.
Its going to be a hack, but its doable.
You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.
How? Oh, what language are you using?
You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.
If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.
If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.
With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).
If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.
I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.
Had to do this a long time ago.
What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:
1 to 10, 12, 16, 20 to 50, 52 would become..
[1,10], 12, 16, [20,50], 52, ...
Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage
p.s. another way to store them
1, -9, 12, 16, 20, -30, 52,
when i had no numbers lower than zero
After that i applied various algorithms (described by other posters) here on the reduced data set
#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
unsigned x=0;
size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
else printf("no duplicate\n");
return 0;
}
This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions
Map(key, value)
Reduce(key, values[])
so all in all:
open file and iterate through the data
for each number -> Map(number, line_index)
in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
print the duplicates : number, line_index1, line_index2,...
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
good luck :)
Another more simple approach could be in using bloom filters
AdamT
Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.
Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.

Is there any benefit to my rather quirky character sizing convention?

I love things that are a power of 2. I celebrated my 32nd birthday knowing it was the last time in 32 years I'd be able to claim that my age was a power of 2. I'm obsessed. It's like being some Z-list Batman villain, except without the colourful adventures and a face full of batarangs.
I ensure that all my enum values are powers of 2, if only for future bitwise operations, and I'm reasonably assured that there is some purpose (even if latent) for doing it.
Where I'm less sure, is in how I define the lengths of database fields. Again, I can't help it. Everything ends up being a power of 2.
CREATE TABLE Person
(
PersonID int IDENTITY PRIMARY KEY
,Firstname varchar(64)
,Surname varchar(128)
)
Can any SQL super-boffins who know about the internals of how stuff is stored and retrieved tell me whether there is any benefit to my inexplicable obsession? Is it more efficient to size character fields this way? Can anyone pop in with an "actually, what you're doing works because ....."?
I suspect I'm just getting crazier in my older age, but it'd be nice to know that there is some method to my madness.
Well, if I'm your coworker and I'm reading your code, I don't have to use SVN blame to find out who wrote it. That's kind of cool. :)
The only relevant powers of two are 512 and 4096, which is the default disk block size and memory page size respectively. If your total row-length crosses these boundaries, you might notice un-proportional jumps and dumps in performance if you look very closely. For example, if your row is 513 bytes long, you need to read twice as many blocks for a single row than for a row that is 512 bytes long.
The problem is calculating the proper row size, as the internal storage format is not very well documented.
Also, I do not know whether the SQL Server actually keeps the rows block aligned, so you might be out of luck there anyways.
With varchar, you only stored the number of characters + 2 for length.
Generally, the maximum row size is 8060
CREATE TABLE dbo.bob (c1 char(3000), c2 char(3000), c31 char(3000))
Msg 1701, Level 16, State 1, Line 1
Creating or altering table 'bob' failed because the minimum row size would be 9007, including 7 bytes of internal overhead. This exceeds the maximum allowable table row size of 8060 bytes.
The power of 2 stuff is frankly irrational and that isn't good in a programmer...

Continuous Vs. discrete attributes

Could anyone please clarify the difference between continuous and discrete attributes?
Thanks.
I will try to explain with an example:
Suppose your table in the database has a column which stores the temperature of the day or say a furnace. The values for that column come from a continuous domain of temperature values.
If the table has a column named gender. Then that is discrete in the sense that only two or maybe three values comprise its domain.
I hope this helps.
cheers
(It's been a long while since I did any pure maths, so take this with a pinch of salt.)
Speaking theoretically, continuous attributes come from an infinite set (i.e. real numbers, you can make them as large or small as you need). Discrete attributes come from a finite or countably infinite set (i.e. integers).
Another way of looking at it is that continuous attributes can have infinitesimally small differences between one value and the next, while discrete attributes always have some limit on the difference between one value and the next.
Practically spoken, continuous attributes would be a floating-point type, where discrete would be integers or characters.
Simon Righarts is right, except for his final conclusion.
Since computer memory is always finite, the set of representible values of any type is by definition also always finite too, and therefore in computer science there is no such thing as "continuous TYPES (which I think was what you were really asking about, not "continuous attributes"). Well, at least not in that part of computer science that gets applied anywhere in real life.
The classical floating-point type, encoded in 32 bits, has a maximum of 2^32 representible values. The classical floating-point type, encoded in 64 bits, has a maximum of 2^64 representible values. Non-representible values are plain useless and not worth considering. BigInteger types, which take as many bytes as are needed to hold a value, are limited to a maximum of 2^(8*computermemorysize) representible values. All of them are very much finite.
Data can be Descriptive (like "high" or "fast") or Numerical (numbers).
And Numerical Data can be Discrete or Continuous:
Discrete data is counted,
Continuous data is measured
Discrete Data
Discrete Data can only take certain values.
Example 1: the number of students in a class we can't have half a student.
Example 1: the results of rolling 2 dice Only have the values 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, and 12 we can not have 2.1 or 3.5.
Continuous Data
Continuous Data can take any value (within a range).
Examples:
A person's height could be any value (within the range of human heights), not just certain fixed heights, time in a race you could even measure it to fractions of a second, A dog's weight, or length of a leaf.
Attributes:
Discrete Attribute
Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in a collection of documents
Sometimes, represented as integer variables
Note: Binary attributes are a special case of discrete attributes
Continuous Attribute:
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured and represented using a finite number of digits

Resources