How do I track random numbers with stateless services - database

I have a design problem that I'm looking for an efficient way to solve:
I have three instances of a single service running. Each instance is totally stateless and exposes a single endpoint /token. When the /token endpoint is called by a client, a random number is returned. The random number is generated by a non-repeating pseudo-random number generator which generates a unique random integer for the first n-times it is called and then repeat the same sequence the next n-times it is called. In other words, it's a repeating cycle of n values. So say n = 20, it'll return unique values within the range of 0 to 20 for the first 20 times it is called.
The problem here is: given that I have three instances of this service running, how do I avoid duplicating random integers since any of the services can't know what random value has been generated by either of them.
Here's what I've done:
I have passed as enviroment variable, a seed value that ensures that all the services are generating the same random sequence.
I have setup a database that each of these services can access remotely. It has a table with a single column map set to a default value of 0
When the client calls the /token endpoint of a service, the service increases the value in the map column by 1 and fetches the resulting value.
I then return the random number this resulting value maps to in the random sequence
Is the above approach efficient ?
Could the services experience a race condition when trying to access the database row ?
Could this problem be solved without a database ?
Suggestions would be really appreciated.
Thanks in advance

Sounds like snowflake could be used -
Instead of regular integers, well they will be integers again but you will need to encode a bit of information inside them, you can construct them so they encode the information bitwise.
So let's say your integers will be 64 bits. You can leave x bits for service, you can leave y bits for seed and z bits for the "real" id.(for example datetime ticks).
Doing this you will have something like 0001 (serviceid) (x=4) 1010 (z=4) (your n=20) 0101....0 (56 bits for datetime ticks).
So now you can identify your generated numbers by serviceId, give them some seed (maybe configuration) and some "real" part.
More on snowflake you can probably find from here.

Related

Bayesian Network creating conditional probability table (CPT)

I have trouble understanding where the numbers in the P(A|B,E) table are coming from in the alarm burglary example. I understand that P(B) and P(E) is chosen from knowledge about the domain. But I do not understand how many of the values in the CPT which can be chosen and which has to be calculated in order to make the tables valid. I assume that the P(J|A) and P(J|¬A) are chosen by expert knowledge? And then it must be the same for P(J|M).. or would these also have to be calculated by using given values?
I see with a binary example which is given here in the table on page 7:
https://cseweb.ucsd.edu/~elkan/250A/bayesnets.pdf, they are using the same numbers, but how have they calculated the values 0.95, 0.94, 0.29 and 0.001?
All the values in CPTs must come from somewhere, and cannot be calculated from other CPTs. There are two major approaches to get the numbers:
Have a domain expert specify the numbers.
Have a data set that contains joint realizations of the random variables. Then the numbers within the CPTs can be calculated from the respective frequencies within the data set. Note that this procedure becomes more complicated when not all variables are observed within the data set.
In addition, it is possible to mix approach 1 and 2.

Named range of consistent random numbers

Background
Following on from a question I asked a while ago about getting an array of different (but not necessarily unique) random numbers to which the answer was this:
=RANDBETWEEN(ROW(A1:A10)^0,10)
To get an array of 10 random numbers between 1 and 10
The Problem
If I create a named range (called "randArray") with the formula above I hoped I would be able to reference randArray a number of times and get the same set of random numbers. Granted, they would change each time I press F9 or update the worksheet -- but change together.
This is what I get instead, two completely different sets of random numbers
I'm not surprised by this behavior but how can I achieve this without using VBA and without putting the random numbers onto the worksheet?
If you're interested
This example is intended to be MCVE. In my actual case, I am using random numbers to estimate Pi. The user stipulates how many random points to apply and gets an accordingly accurate estimation. The problem arises because I also graph the points and when there are a small number of points it's very clear to see that the estimation and the graph don't represent the same dataset
Update
I have awarded the initial bounty to #Michael for providing an interesting and different solution. I am still looking for a complete solution which allows the user to stipulate how many random points to use, and although there might not be a perfect answer I'm still interested in any other possible solutions and more than happy to put up further bounties.
Thank you to everyone who has contributed so far.
This solution generates 10 seemingly random numbers between 1 and 10 that persist for nearly 9 seconds at a time. This allows repeated calls of the same formula to return the same set of values in a single refresh.
You can modify the time frame if required. Shorter time periods allow for more frequent updates, but also slightly increase the extremely unlikely chance that some calls to the formula occur after the cutover point resulting in a 2nd set of 10 random numbers for subsequent calls.
Firstly, define an array "Primes" with 10 different prime numbers:
={157;163;167;173;179;181;191;193;197;199}
Then, define this formula that will return an array of 10 random numbers:
=MOD(ROUND(MOD(ROUND(NOW(),4)*70000,Primes),0),10)+1
Explanation:
We need to build our own random number generator that we can seed with the same value for an amount of time; long enough for the called formula to keep returning the same value.
Firstly, we create a seed: ROUND(NOW(),4) creates a new seed number every 0.0001 days = 8.64 seconds.
We can generate rough random numbers using the following formula:
Random = Seed * 7 mod Prime
https://cdsmith.wordpress.com/2011/10/10/build-your-own-simple-random-numbers/
Ideally, a sequence of random numbers is generated by taking input from the previous output, but we can't do that in a single function. So instead, this uses 10 different prime numbers, essentially starting 10 different random number generators. Now, this has less reliability at generating random numbers, but testing results further below shows it actually seems to do a pretty good job.
ROUND(NOW(),4)*70000 gets our seed up to an integer and multiplies by 7 at the same time
MOD(ROUND(NOW(),4)*70000,Prime) generates a sequence of 10 random numbers from 0 to the respective prime number
ROUND(MOD(ROUND(NOW(),4)*70000,Prime),0) is required to get us back to an integer because Excel seems to struggle with apply Mod to floating point numbers.
=MOD(ROUND(MOD(ROUND(NOW(),4)*70000,Prime),0),10)+1 takes just the value from the ones place (random number from 0 to 9) and shifts it to give us a random number from 1 to 10
Testing results:
I generated 500 lots of 10 random numbers (in columns instead of rows) for seed values incrementing by 0.0001 and counted the number of times each digit occurred for each prime number. You can see that each digit occurred nearly 500 times in total and that the distribution of each digit is nearly equal between each prime number. So, this may be adequate for your purposes.
Looking at the numbers generated in immediate succession you can see similarities between adjacent prime numbers, they're not exactly the same but they're pretty close in places, even if they're offset by a few rows. However, if the refresh is occurring at random intervals, you'll still get seemingly random numbers and this should be sufficient for your purposes. Otherwise, you can still apply this approach to a more complex random number generator or try a different mix of prime numbers that are further apart.
Update 1: Trying to find a way of being able to specify the number of random numbers generated without storing a list of primes.
Attempt 1: Using a single prime with an array of seeds:
=MOD(ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))/10000,4)*70000,1013),0),10)+1
This does give you an even distribution, but it really is just repeating the exact same sequence of 10 numbers over and over. Any analysis of the sample would be identical to analysing =MOD(ROW(1:SampleSize),10)+1. I think you want more variation than that!
Attempt 2: Working on a 2-dimensional array that still uses 10 primes....
Update 2: Didn't work. It had terrible performance. A new answer has been submitted that takes a similar but different approach.
OK, here's a solution where users can specify the number of values in defined name SAMPLESIZE
=MOD(ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize)),4)*10000*163,1013),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*2,4)*10000*211,1013),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*3,4)*10000*17,1013),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*5,4)*10000*179,53),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*7,4)*10000*6101,1013),0),10)+1
It's a long formula, but has good efficiency and can be used in other functions. Attempts at a shorter formula resulted in unusably poor performance and arrays that for some reason couldn't be used in other functions.
This solution combines 5 different prime number generators to increase variety in the generated random numbers. Some arbitrary constants were introduced to try to reduce repeating patterns.
This has correct distribution and fairly good randomness. Repeated testing with a SampleSize of 10,000 resulted in frequencies of individual numbers varying between 960 and 1040 with no overall favoritism. However it seems to have the strange property of never generating the same number twice in a row!
You can achieve this using just standard spreadsheet formulas.
One way is to use the so called Lehmer random number method. It generates a sequence of random numbers in your spreadsheet that stays the same until you change the "seed number", a number you choose yourself and will recreate a different random sequence for each seed number you choose.
The short version:
In cell B1, enter your "seed" number, it can be any number from 1 to 2,147,483,647
In cell B2 enter the formula =MOD(48271*B1,2^31-1) , this will generate the first random number of your sequence.
Now copy this cell down as far as the the random sequence you want to generate.
That's it. For your named range, go ahead and name the range from B2 down as far as your sequence goes. If you want a different set of numbers, just change the seed in B1. If you ever want to recreate the same set of numbers just use the same seed and the same random sequence will appear.
More details in this tutorial:
How to generate random numbers that don't change in Excel and Google Sheets
It's not a great answer but considering the limitation of a volatile function, it is definitely a possible answer to use the IF formula with Volatile function and a Volatile variable placed somewhere in the worksheet.
I used the below formula to achieve the desired result
=IF(rngIsVolatile,randArray,A1:A10)
I set cell B12 as rngIsVolatile. I pasted the screenshots below to see it in working.
When rngIsVolatile is set to True, it picks up new values from randArray:
When rngIsVolatile is set to False, it picks up old values from A1:A10:

Generate random 10 digit number as part of ID and ensure no duplicates

I have a request as part of an application I am working on to generate a UserId with the following criteria:
It will have a set 4 digit prefix
The next 10 digits will be random
A check-digit will be added on the end
Points 1 and 3 are straight-forward, but what would the best way be to generate a 10 digit random number whilst ensuring that it hasn't already been used.
I don't particularly like the idea of choosing one randomly, seeing if it has been taken and then either accepting or trying again.
My other thought was to generate a list of X numbers in advance (X being a number greater than the number of accounts you expect to be created) and just take the next one off the list as the accounts are created.
Any thoughts?
EDIT:
Lets bring technology into this. If I am using a SQL server database, is there a way I can make the database do this for me? E.g. enforce a unique constraint and get the database to generate the number?
Encryption. See this answer.
What you need is basically encrypt a sequential ID, thus producing a seemingly random number.
What's especially good about this is that you can do this all client-side, albeit in two consecutive transactions.
The only way to be 100% sure there is no other userid with the same random digits is to compare the gernerated to all existing users, you have to do it sometime
The only possibility is to do this code-side.
BUT you can save the random generated id to the userdatabase and compare this with your new key (in your code)
e.g. SELECT * FROM xy WHERE userid = 'newuserid'
If the result is null your key never was generated.
Thanks to Anton's answer I found this C# implementation of the skip32 encryption algorithm.
https://github.com/eleven41/Eleven41.Skip32
Using this I can pass in an incrementing database identity integer and get a nice random looking 9/10 digit number from it. So a simple call like
int result = cipher.Encrypt(databaseIdentity);
will give me the number I want. No need for duplicate checking as each will be unique.

What is a good algorithm to check whether or not a number exist in multiple sets without searching them all?

Scenario
Let's say you have multiple databases in 3 zones. Zone A, B, and C. Each zone in different geographical location. At the same time, you have an application that will route username and password based on the geographical location of the user. For example, user A will be redirected to database in Zone A. User B Zone B and so on.
Now, let's say user A moves to a zone B. The application query zone B and won't find anything. Querying zone A and zone C might take some time due to zones are far away, and will have to query all the databases in all zones.
My Question
How can you verify if a string/number exists in multiple sets?
or
How can you verify a row exist in the database before even sending a query?
My Algorithm
This is not perfect, but will give you some idea what I'm trying to do
If we have the database with the following 3 users
foo
bar
foobar
We take the hash of all 3 users, and look for the next prime number if the hash is not prime.
sum = hash(foo).nextPrime() * hash(bar).nextPrime() * hash(foobar).nextPrime()
That sum is shared between all zones. If I want to check foo, I can just take the hash of foo, and look for the next prime, then take the gcd(foo,sum). If it's not equal to one. It means foo exist in some database. If it equal to one, it means foo doesn't exist at all. If I want to add a new username. I can simply do sum = sum * hash(newUserName).nextPrime().
Sum will grow to a point that will be just faster to query all databases.
Do you know a similar algorithm to solve this problem?
One data structure suitable for this application is a Bloom filter.
A Bloom filter is a probabilistic data structure which allows you test whether an item is already in a set. If the test returning false then the item is definitely not in the set (0% false negatives), if true then it may be in the set, but is not guaranteed to be (false positives are possible).
The filter is implemented as a bit array with m bits and a set of k hash functions. To add an item to the array (e.g. a username), hash the item using each of the hash functions and then take the modulo m of each hash value to compute the indexes to set in the bit array. To test if an item is in the set, compute all the hashes and indexes and check that all of the corresponding bits in the array are set to 1. If any of them are zero them the item is definitely not in the set, if all are 1 then the item is most likely in the set, but there is a small chance it may not be, the percentage of false positives can be reduced by using a larger m.
To implement the k hash functions, it is possible to just use the same hashing algorithm (e.g. CRC32, MD5 etc) but append different salts to the username string for each before passing to the hash function, effectively creating "new" hash functions for each salt. For a given m and n (number of elements being added), the optimal number of hash functions is k = (m / n) ln 2
For your application, the Bloom filter bit array would be shared across all zones A B C etc. When a user attempts to login, you could first check in the database of the local zone, and if present then log them in as normal. If not present in the local database, check the Bloom filter and if the result is negative then you know for sure that they don't exist in another zone. If positive, then you still need to check the databases in the other zones (because of the possibility of a false positive), but presumably this isn't a big issue because you would be contacting the other zones in any case to transfer the user's data in the case that it was a true positive.
One down-side of using a Bloom filter is that it is difficult (though not impossible) to remove elements from the set once they have been added.

Should I start user IDs from 1 or 1000 in database? Why?

Should I use high numbers for user IDs in database?
Are there any benefits with starting user_id from 1000 (from <9000 users project) or 10000 for more...?
The advantage of starting user IDs from 1000 (even when you will have fewer than 9,000 IDs) is that they will all have the same number of digits, so that files, for example, suffixed with the UID will sort in numeric order automatically, even if the sorter only uses alphabetic numbering. And you don't have to pad the numbers with leading zeroes to get there.
The converse is that if you only have 1000 users, numbers starting at 1,000,000,000 would look a little silly: 1,000,000,001 then 1,000,000,002 and so on.
For many purposes, therefore, it doesn't matter which you do. A uniform number of digits has some advantages, and that is why a value other than zero or one is often used as the starting point.
not really. I would just start from 1. if you have any needs to put stuff in before one, there are no issues w/ using negative numbers, so you can just do an insert and manually specifiy the id. At my company, we start all the users at one, auto incrementing, and our global admin user is ID 0.
I know this answer comes late, but still there is something to add, imo:
1) The approach to use 1000 as a start id can be of an advantage, e.g. if you do not want to make it obvious how many users you have (in case you make the id visible somewhere in an url or sth), and therefore (or in addition)
2) it can be useful if you want to make ids harder to guess, because usually the first ids belong to admins or moderators, so if you takes any id to start (e.g. 1421), you could just add another security tweak to your db...

Resources