Solr Random Ordering Internals

Solr Random Ordering Internals - solr

I am using Random sorting. For Random sorting I am providing seed at run time which is random value from 1001 to 999999.
And the results I am getting are random but not properly distributed. Lets there are 5 results, then in 100 run, approx 20 times each item should come to top. But it is not happening,. One item is coming at top approx 80 times. How should I use random ordering so as the order is fairly distributed.
Also How Random sorting acutally works. How does seed play role? I mean using this seed, solr generates order. So how does it actually does it.? What algo it uses? How can I smartly change random seed?
It depends on number of documents?

Related

Named range of consistent random numbers

Background
Following on from a question I asked a while ago about getting an array of different (but not necessarily unique) random numbers to which the answer was this:
=RANDBETWEEN(ROW(A1:A10)^0,10)
To get an array of 10 random numbers between 1 and 10
The Problem
If I create a named range (called "randArray") with the formula above I hoped I would be able to reference randArray a number of times and get the same set of random numbers. Granted, they would change each time I press F9 or update the worksheet -- but change together.
This is what I get instead, two completely different sets of random numbers
I'm not surprised by this behavior but how can I achieve this without using VBA and without putting the random numbers onto the worksheet?
If you're interested
This example is intended to be MCVE. In my actual case, I am using random numbers to estimate Pi. The user stipulates how many random points to apply and gets an accordingly accurate estimation. The problem arises because I also graph the points and when there are a small number of points it's very clear to see that the estimation and the graph don't represent the same dataset
Update
I have awarded the initial bounty to #Michael for providing an interesting and different solution. I am still looking for a complete solution which allows the user to stipulate how many random points to use, and although there might not be a perfect answer I'm still interested in any other possible solutions and more than happy to put up further bounties.
Thank you to everyone who has contributed so far.

This solution generates 10 seemingly random numbers between 1 and 10 that persist for nearly 9 seconds at a time. This allows repeated calls of the same formula to return the same set of values in a single refresh.
You can modify the time frame if required. Shorter time periods allow for more frequent updates, but also slightly increase the extremely unlikely chance that some calls to the formula occur after the cutover point resulting in a 2nd set of 10 random numbers for subsequent calls.
Firstly, define an array "Primes" with 10 different prime numbers:
={157;163;167;173;179;181;191;193;197;199}
Then, define this formula that will return an array of 10 random numbers:
=MOD(ROUND(MOD(ROUND(NOW(),4)*70000,Primes),0),10)+1
Explanation:
We need to build our own random number generator that we can seed with the same value for an amount of time; long enough for the called formula to keep returning the same value.
Firstly, we create a seed: ROUND(NOW(),4) creates a new seed number every 0.0001 days = 8.64 seconds.
We can generate rough random numbers using the following formula:
Random = Seed * 7 mod Prime
https://cdsmith.wordpress.com/2011/10/10/build-your-own-simple-random-numbers/
Ideally, a sequence of random numbers is generated by taking input from the previous output, but we can't do that in a single function. So instead, this uses 10 different prime numbers, essentially starting 10 different random number generators. Now, this has less reliability at generating random numbers, but testing results further below shows it actually seems to do a pretty good job.
ROUND(NOW(),4)*70000 gets our seed up to an integer and multiplies by 7 at the same time
MOD(ROUND(NOW(),4)*70000,Prime) generates a sequence of 10 random numbers from 0 to the respective prime number
ROUND(MOD(ROUND(NOW(),4)*70000,Prime),0) is required to get us back to an integer because Excel seems to struggle with apply Mod to floating point numbers.
=MOD(ROUND(MOD(ROUND(NOW(),4)*70000,Prime),0),10)+1 takes just the value from the ones place (random number from 0 to 9) and shifts it to give us a random number from 1 to 10
Testing results:
I generated 500 lots of 10 random numbers (in columns instead of rows) for seed values incrementing by 0.0001 and counted the number of times each digit occurred for each prime number. You can see that each digit occurred nearly 500 times in total and that the distribution of each digit is nearly equal between each prime number. So, this may be adequate for your purposes.
Looking at the numbers generated in immediate succession you can see similarities between adjacent prime numbers, they're not exactly the same but they're pretty close in places, even if they're offset by a few rows. However, if the refresh is occurring at random intervals, you'll still get seemingly random numbers and this should be sufficient for your purposes. Otherwise, you can still apply this approach to a more complex random number generator or try a different mix of prime numbers that are further apart.
Update 1: Trying to find a way of being able to specify the number of random numbers generated without storing a list of primes.
Attempt 1: Using a single prime with an array of seeds:
=MOD(ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))/10000,4)*70000,1013),0),10)+1
This does give you an even distribution, but it really is just repeating the exact same sequence of 10 numbers over and over. Any analysis of the sample would be identical to analysing =MOD(ROW(1:SampleSize),10)+1. I think you want more variation than that!
Attempt 2: Working on a 2-dimensional array that still uses 10 primes....
Update 2: Didn't work. It had terrible performance. A new answer has been submitted that takes a similar but different approach.

OK, here's a solution where users can specify the number of values in defined name SAMPLESIZE
=MOD(ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize)),4)*10000*163,1013),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*2,4)*10000*211,1013),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*3,4)*10000*17,1013),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*5,4)*10000*179,53),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*7,4)*10000*6101,1013),0),10)+1
It's a long formula, but has good efficiency and can be used in other functions. Attempts at a shorter formula resulted in unusably poor performance and arrays that for some reason couldn't be used in other functions.
This solution combines 5 different prime number generators to increase variety in the generated random numbers. Some arbitrary constants were introduced to try to reduce repeating patterns.
This has correct distribution and fairly good randomness. Repeated testing with a SampleSize of 10,000 resulted in frequencies of individual numbers varying between 960 and 1040 with no overall favoritism. However it seems to have the strange property of never generating the same number twice in a row!

You can achieve this using just standard spreadsheet formulas.
One way is to use the so called Lehmer random number method. It generates a sequence of random numbers in your spreadsheet that stays the same until you change the "seed number", a number you choose yourself and will recreate a different random sequence for each seed number you choose.
The short version:
In cell B1, enter your "seed" number, it can be any number from 1 to 2,147,483,647
In cell B2 enter the formula =MOD(48271*B1,2^31-1) , this will generate the first random number of your sequence.
Now copy this cell down as far as the the random sequence you want to generate.
That's it. For your named range, go ahead and name the range from B2 down as far as your sequence goes. If you want a different set of numbers, just change the seed in B1. If you ever want to recreate the same set of numbers just use the same seed and the same random sequence will appear.
More details in this tutorial:
How to generate random numbers that don't change in Excel and Google Sheets

It's not a great answer but considering the limitation of a volatile function, it is definitely a possible answer to use the IF formula with Volatile function and a Volatile variable placed somewhere in the worksheet.
I used the below formula to achieve the desired result
=IF(rngIsVolatile,randArray,A1:A10)
I set cell B12 as rngIsVolatile. I pasted the screenshots below to see it in working.
When rngIsVolatile is set to True, it picks up new values from randArray:
When rngIsVolatile is set to False, it picks up old values from A1:A10:

Generating uniformly distributed random numbers in distributed environment

I have to generate a "unique Random Number" in a Wireless sensor network which works on the principle of Gossiping.
The requirements are:
Each node has to generate a unique Random Number, without
having any shared knowledge of what other nodes have generated.
The Distribution of the generated Random number should be uniform with
respect to each other.
It would be preferable if the range of the generated random number is around 10-16 bits or may be lesser.
The limitations are:
One node has no idea what number the other nodes in the network are generating.
Implementation in C, C++.
I also have the provision of using a unique seed for random number generation. the seed could be any number in the range 0-2^15.
If there is no way of generating such numbers, then it would be helpful if there is any method by which I can meet some of the above requirements.
If you can suggest me some way to achieve this result it would be really helpful.

for this solution to work you must know the total number of nodes in the generation network. let this number be n.
the basic idea is to generate uniformly distributed random numbers on each participating node inside a given interval. the n intervals of the participating nodes must not overlap.
a shared seed does not complicate matters if the total number of nodes does not change and each node can be assigned statically some integer i <= n such that each number is issued exactly once. instead of generating a single random number on each turn, n numbers are generated, and node i takes the i-th number from this series.
however, the overall distribution of random numbers generated will not be uniform unless ...:
you synchronize random number generation.
all intervals have the same size.
for information on random number generation on individual nodes see here.

Generate pseudo sample of population given probabilities

I would like to generate pseudo data that conforms to the distribution of actual sampled data. Looking for an efficient and accurate method in C/Obj-C for iphone development. Currently the occurrance of 60 different categories in 1000 sampled events has been assigned a probability (0-1). I want to generate 1000 new events which conform to the same probabilities.
Clarification {
I have a categorical distribution of set {1,2,...,60}. I understand that samples from this distribution will conform to the probabilities of each category. Therefore I need to take 1000 samples from this distribution. I have determined (thanks to answers so far) that I need to:
Normalize this distribution by summing the values and dividing each
by the sum.
Order them.
Create a CDF by replacing each value with the sum of all previous values.
Then I can generate a uniform random number between 0 and 1, and find the greatest number in the CDF whose value is less than or equal to the number just chosen, and return the category corresponding to this CDF value.
}
Q1. Is this the correct way to solve the problem?
Q2. The caveat still holds that I'm using NSDecimals to store the category probabilities. Are there any libraries available or functions in Cocoa or Math.h, etc. that I can use to do this simply? I'm open to trying new libraries, currently only have Core-Plot and the standard Cocoa libraries in this project. Thanks.

Your problem description is unclear. But it sounds like you're looking for inverse transform sampling.
Basically, you first need to generate a cumulative distribution function (CDF) corresponding to your original data; call it F(x). You then generate uniform random data in the range 0->1, and then transform it using the inverse CDF, i.e F-1(x).

Here's my suggestion. This assumes that when you say "normalized probability" you mean the sum of the probability of all types is 1. (If not, you'll need to rescale so that's the case.)
Make up some order for your 60 types. (Say, alphabetic.)
Generate a random number between 0 and 1. (Call it your "target".)
Create an accumulator, initially at 0.
Loop through your 60 types. For each type:
Add the probability of that type of event to your accumulator.
If your accumulator is >= your target, generate an event of that type and stop.
If you do that 1000 times, I believe you'll get the distribution you're looking for.

Getting random entry from Objectify entity

How can I get a random element out of a Google App Engine datastore using Objectify? Should I fetch all of an entity's keys and choose randomly from them or is there a better way?

Assign a random number between 0 and 1 to each entity when you store it. To fetch a random record, generate another random number between 0 and 1, and query for the smallest entity with a random value greater than that.

You don't need to fetch all.
For example:
countall = query(X.class).count()
// http://groups.google.com/group/objectify-appengine/browse_frm/thread/3678cf34bb15d34d/82298e615691d6c5?lnk=gst&q=count#82298e615691d6c5
rnd = Generate random number [0..countall]
ofy.query(X.class).order("- date").limit(rnd); //for example -date or some chronic indexed field
Last id is your...
(in average you fatch 50% or at lest first read is in average 50% less)
Improvements (to have smaller key table in cache)!
After first read remember every X elements.
Cache id-s and their position. So next time query condition from selected id further (max ".limit(rnd%X)" will be X-1).
Random is just random, if it doesn't need to be close to 100% fair, speculate chronic field value (for example if you have 1000 records in 10 days, for random 501 select second element greater than fifth day).
Other options, if you have chronic field date (or similar), fetch elements older than random date and younger then random date + 1 (you need to know first date and last date). Second select random between fetched records. If query is empty select greater than etc...

Quoted from this post about selecting some random elements from an Objectified datastore:
If your ids are sequential, one way would be to randomly select 5
numbers from the id range known to be in use. Then use a query with an
"in" filter().
If you don't mind the 5 entries being adjacent, you can use count(),
limit(), and offset() to randomly find a block of 5 entries.
Otherwise, you'll probably need to use limit() and offset() to
randomly select one entry out at a time.
-- Josh

I pretty much adapt the algorithm provided Matejc. However, 3 things:
Instead of using count() or the datastore service factory (DatastoreServiceFactory.getDatastoreService()), I have an entity that keep track of the total count of the entities that I am interested in. The reason for this approach is that:
a. count() could be expensive when you are dealing with a lot of objects
b. You can't test the datastore service factory locally...testing in prod is just a bad practice.
Generating the random number: ThreadLocalRandom.current().nextLong(1, maxRange)
Instead of using limit(), I use offset, so I don't have to worry about "sorting."

What's a good shuffle percentage?

I'm basically new to coding for random results, but did some reading and tested out the javascript version of the Fisher-Yates algorithm (as seen on wikipedia), with an ordered list.
I ended up adding code to make sure the array was shuffled differently than its initial order, and also calculated the percentage of how many objects were shuffled to a different position by the algorithm.
So I'm wondering what might be considered a good result. Kind of a generic question. If I shuffled a deck of cards, what would be the least acceptable amount of shuffle? Right now I have mine coded to repeat the algorithm if it comes out less than 25 percent shuffled.
What do you think?

Zero. You can make any number of checks you like to make it feel more random, but even the check for the same order makes your algorithm flawed.

If your shuffle algorithm is correctly implemented and produces a truly random shuffle (modulo your PRNG's randomness, or lack thereof), I wouldn't reshuffle at all. In particular, the fact that you don't accept random configurations that are 25+% similar to your original configuration tells an adversary that they can expect not to see any of those configurations after your shuffling completes.

Thanks for the feedback. All of your answers were relevant.
I'm adding this answer since my notion of shuffling went from random to non random, and now I'm using a hybrid of the two.
With the perfect in-shuffle I can generate multiple lists in different orders (which modulate off the number of items). However, when the number of items is odd, the last number does not get shuffled. So I decided to randomize its position in the shuffled list.
In the process of figuring that out, I made a table generator which displays all the lists possible given a number of items. It's pretty interesting. For example, the number 52 generates 52 columns, while the number 51 only generates 8 columns.