Generated unique id with 6 characters - handling when too much ids already used - database

​In my program you can book an item. This item has an id with 6 characters from 32 possible characters.
So my possibilities are 32^6. Every id must be unique.
func tryToAddItem {
if !db.contains(generateId()) {
addItem()
} else {
tryToAddItem()
}
}
For example 90% of my ids are used. So the probability that I call tryToAddItem 5 times is 0,9^5 * 100 = 59% isn't it?
So that is quite high. This are 5 database queries on a lot of datas.
When the probability is so high I want to implement a prefix „A-xxxxxx“.
What is a good condition for that? At which time do I will need a prefix?
In my example 90% ids were use. What is about the rest? Do I threw it away?
What is about database performance when I call tryToAddItem 5 times? I could imagine that this is not best practise.

For example 90% of my ids are used. So the probability that I call tryToAddItem 5 times is 0,9^5 * 100 = 59% isn't it?
Not quite. Let's represent the number of call you make with the random variable X, and let's call the probability of an id collision p. You want the probability that you make the call at most five times, or in general at most k times:
P(X≤k) = P(X=1) + P(X=2) + ... + P(X=k)
= (1-p) + (1-p)*p + (1-p)*p^2 +... + (1-p)*p^(k-1)
= (1-p)*(1 + p + p^2 + .. + p^(k-1))
If we expand this out all but two terms cancel and we get:
= 1- p^k
Which we want to be greater than some probability, x:
1 - p^k > x
Or with p in terms of k and x:
p < (1-x)^(1/k)
where you can adjust x and k for your specific needs.
If you want less than a 50% probability of 5 or more calls, then no more than (1-0.5)^(1/5) ≈ 87% of your ids should be taken.
First of all make sure there is an index on the id columns you are looking up. Then I would recommend thinking more in terms of setting a very low probability of a very bad event occurring. For example maybe making 20 calls slows down the database for too long, so we'd like to set the probability of this occurring to <0.1%. Using the formula above we find that no more than 70% of ids should be taken.
But you should also consider alternative solutions. Is remapping all ids to a larger space one time only a possibility?
Or if adding ids with prefixes is not a big deal then you could generate longer ids with prefixes for all new items going forward and not have to worry about collisions.

Thanks for response. I searched for alternatives and want show three possibilities.
First possibility: Create an UpcomingItemIdTable with 200 (more or less) valid itemIds. A task in the background can calculate them every minute (or what you need). So the action tryToAddItem will always get a valid itemId.
Second possibility
Is remapping all ids to a larger space one time only a possibility?
In my case yes. I think for other problems the answer will be: it depends.
Third possibility: Try to generate an itemId and when there is a collision try it again.
Possible collisions handling: Do some test before. Measure the time to generate itemIds when there are already 1000,10.000,100.000,1.000.000 etc. entries in the table. When the tryToAddItem method needs more than 100ms (or what you prefer) then increase your length from 6 to 7,8,9 characters.
Some thoughts
every request must be atomar
create an index on itemId
Disadvantages for long UUIDs in API: See https://zalando.github.io/restful-api-guidelines/#144
less usable, because...
-cannot be memorized and easily communicated by humans
-harder to use in debugging and logging analysis
-less convenient for consumer facing usage
-quite long: readable representation requires 36 characters and comes with higher memory and bandwidth consumption
-not ordered along their creation history and no indication of used id volume
-may be in conflict with additional backward compatibility support of legacy ids
[...]
TLDR: For my case every possibility is working. As so often it depends on the problem. Thanks for input.

Related

Optimizing ID generation in a particular format

I am looking to generate IDs in a particular format. The format is this:
X | {N, S, E, W} | {A-Z} | {YY} | {0-9} {0-9} {0-9} {0-9} {0-9}
The part with "X" is a fixed character, the second part can be any of the 4 values N, S, E, W (North, South, East, West zones) based on the signup form data, the third part is an alphabet from the set {A-Z} and it is not related to the input data in anyway (can be randomly assigned), YY are the last 2 digits of the current year and the last part is a 5 digit number from 00000 to 99999.
I am planning to construct this ID by generating all 5 parts and concatenating the results into a final string. The steps for generating each part:
This is fixed as "X"
This part will be one of "N", "S", "E", "W" based on the input data
Generate a random alphabet from {A-Z}
Last 2 digits of current year
Generate 5 random digits
This format gives me 26 x 10^5 = 26, 00, 000 unique IDs each year for a particular zone, which is enough for my use case.
For handling collisions, I plan to query the database and generate a new ID if the ID already exists in the DB. This will continue until I generate an ID which doesnt exist in the DB.
Is this strategy good or should I use something else? When the DB has a lot of entries of a particular zone in a particular year, what would be the approximate probability of collision or expected number of DB calls?
Should I instead use, sequential IDs like this:
Start from "A" in part 3 and "00000" in part 5
Increment part 3 to "B", when "99999" has been used in part 5
If I do use this strategy, is there a way I can implement this without looking into the DB to first find the last inserted ID?
Or some other way to generate the IDs in this format. My main concern is that the process should be fast (not too many DB calls)
If there's no way around DB calls, should I use a cache like Redis for making this a little faster? How exactly will this work?
For handling collisions, I plan to query the database and generate a
new ID if the ID already exists in the DB. This will continue until I
generate an ID which doesnt exist in the DB.
What if you make 10 such DB calls because of this. The problem with randomness is that collisions will occur even though the probability is low. In a production system with high load, doing a check with random data is dangerous.
This format gives me 26 x 10^5 = 26, 00, 000 unique IDs each year for
a particular zone, which is enough for my use case.
Your range is small, no doubt. But you need to see tahat the probability of collision will be 1 / 26 * 10^5 which is not that great!.
So, if the hash size is not a concern, read about UUID, Twitter snowflake etc.
If there's no way around DB calls, should I use a cache like Redis for
making this a little faster? How exactly will this work?
Using a cache is a good idea. Again, the problem here is the persistence. If you are looking for consistency, then Redis uses LRU and keys would get lost in time.
Here's how I would solve this issue:
So, I would first write write a mapper range for characters.
Ex: N goes from A to F, S from G to M etc.
This ensures that there is some consistency among the zones.
After this, we can do the randomized approach itself but with indexing.
So, suppose let's say there is a chance for collision. We can significantly reduce this value.
Make the unique hash in your table as indexable.
This means that your search is much faster.
When you want to insert, generate 2 random hashes and do a single IN query - something like "select hash from table where hash in (hash1,hash2)". If this does not work, next time, you need to generate 4 random hashes and do the same query. If it works , use the hash. Keep increasing the exponential value to avoid collisions.
Again this is speculative, better approcahes may be there.

Suggestions on designing a metric

I am designing a metric to measure when a search term is "ambiguous." A score near to one means that it is ambiguous ("Ajax" could be a programming language, a cleaning solution, a greek hero, a European soccer club, etc.) and a score near to zero means it is pretty clear what the user meant ("Lady Gaga" probably means only one thing). Part of this metric is that I have a list of possible interpretations and frequency of those interpretations from past data and I need to turn this into a number between 0 and 1.
For example: lets say the term is "Cats" -- of a million trials 850,000 times the user meant the furry thing that meows, 80,000 times they meant the musical by that name, and the rest are abbreviations for things each only meant a trivial number of times. I would say this should have a low ambiguity score because even though there were multiple possible meanings, one was by far the preferred meaning. In contrast lets say the term is "Friends" -- of a million trials 500,000 times the user meant the people who they hang out with all the time, 450,000 times they meant the tv show by that name, and the rest were some other meaning. This should get a higher ambiguity score because the different meanings were much closer in frequency.
TLDR: If I sort the array in decreasing order, I need a way to take arrays which fall off quickly to numbers close to zero and arrays that fall off slower to numbers closer to one. If the array was [1,0,0,0...] this should get a perfect score of 0 and if it was [1/n,1/n,1/n...] this should get a perfect score of 1. Any suggestions?
What you are looking for sounds very similar to the Entropy measure in information theory. It is a measure of how uncertain a random variable is based on the probabilities of each outcome. It is given by:
H(X) = -sum(p(x[i]) * log( p(x[i])) )
where p(x[i]) is the probability of the ith possiblility. So in your case, p(x[i]) would be the probability that a certain search phrase corresponded to an actual meaning. In the cats example, you would have:
p(x[0]) = 850,000 / (850,000+80,000) = 0.914
p(x[1]) = 80,000 / (850,000+80,000) = 0.086
H(X) = -(0.914*log2(0.914) + 0.086*log2(0.086)) = 0.423
For the Friends case, you would have: (assuming only one other category)
H(X) = -(0.5*log2(0.5) + 0.45*log2(0.45) + 0.05*log2(0.05)) = 1.234
The higher number here means more uncertainty.
Note that I am using log base 2 in both cases, but if you use a logarithm of the base equal to the number of possibilities, you can get the scale to work out to 0 to 1.
H(X) = -(0.5*log3(0.5) + 0.45*log3(0.45) + 0.05*log3(0.05)) = 0.779
Note also that the most ambiguous case is when all possibilities have the same probability:
H(X) = -(0.33*log3(0.33) + 0.33*log3(0.33) + 0.33*log3(0.33)) = 1.0
and the least ambiguous case is when there is only one possibility:
H(X) = -log(1) = 0.0
Since you want the most ambiguous terms to be near 1, you could just use 1.0-H(X) as your metric.

finding a number appearing again among numbers stored in a file

Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?
Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.
How would you approach this problem?
Thanks in advance :)
I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
Author Edit:
Below, #Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
Phimuemue Edit:
Of course one needs to take a bit "special" hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.
The important question is whether you want to solve this problem efficiently, or whether you want accurately.
If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.
Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).
A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.
And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).
Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).
After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.
After the second pass, items with count > 1 in the second table are your duplicates.
This is still O(n), just twice as slow as a single pass.
How about:
Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.
Finding duplicates
Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".
If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.
Counting Duplicates
If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.
Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.
Its going to be a hack, but its doable.
You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.
How? Oh, what language are you using?
You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.
If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.
If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.
With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).
If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.
I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.
Had to do this a long time ago.
What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:
1 to 10, 12, 16, 20 to 50, 52 would become..
[1,10], 12, 16, [20,50], 52, ...
Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage
p.s. another way to store them
1, -9, 12, 16, 20, -30, 52,
when i had no numbers lower than zero
After that i applied various algorithms (described by other posters) here on the reduced data set
#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
unsigned x=0;
size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
else printf("no duplicate\n");
return 0;
}
This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions
Map(key, value)
Reduce(key, values[])
so all in all:
open file and iterate through the data
for each number -> Map(number, line_index)
in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
print the duplicates : number, line_index1, line_index2,...
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
good luck :)
Another more simple approach could be in using bloom filters
AdamT
Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.
Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.

Storing vast amounts of (simple) timeline graph data in a DB

I need to store the number of plays for every second of a podcast / audio file. This will result in a simple timeline graph (like the "hits" graph in Google Analytics) with seconds on the x-axis and plays on the y-axis.
However, these podcasts could potentially go on for up to 3 hours, and 100,000 plays for each second is not unrealistic. That's 10,800 seconds with up to 100,000 plays each. Obviously, storing each played second in its own row is unrealistic (it would result in 1+ billion rows) as I want to be able to fetch this raw data fast.
So my question is: how do I best go about storing these massive amounts of timeline data?
One idea I had was to use a text/blob column and then comma-separate the plays, each comma representing a new second (in sequence) and then the number for the amount of times that second has been played. So if there's 100,000 plays in second 1 and 90,000 plays in second 2 and 95,000 plays in second 3, then I would store it like this: "100000,90000,95000,[...]" in the text/blob column.
Is this a feasible way to store such data? Is there a better way?
Thanks!
Edit: the data is being tracked to another source and I only need to update the raw graph data every 15min or so. Hence, fast reads is the main concern.
Note: due to nature of this project, each played second will have to be tracked individually (in other words, I can't just track 'start' and 'end' of each play).
Problem with the blob storage is you need to update the entire blob for all of the changes. This is not necessarily a bad thing. Using your format: (100000, 90000,...), 7 * 3600 * 3 = ~75K bytes. But that means you're updating that 75K blob for every play for every second.
And, of course, the blob is opaque to SQL, so "what second of what song has the most plays" will be an impossible query at the SQL level (that's basically a table scan of all the data to learn that).
And there's a lot of parsing overhead marshalling that data in and out.
On the other hand. Podcast ID (4 bytes), second offset (2 bytes unsigned allows pod casts up to 18hrs long), play count (4 byte) = 10 bytes per second. So, minus any blocking overhead, a 3hr song is 3600 * 3 * 10 = 108K bytes per song.
If you stored it as a blob, vs text (block of longs), 4 * 3600 * 3 = 43K.
So, the second/row structure is "only" twice the size (in a perfect world, consult your DB server for details) of a binary blob. Considering the extra benefits this grants you in terms of being able to query things, that's probably worth doing.
Only down side of second/per row is if you need to to a lot of updates (several seconds at once for one song), that's a lot of UPDATE traffic to the DB, whereas with the blob method, that's likely a single update.
Your traffic patterns will influence that more that anything.
Would it be problematic to use each second, and how many plays is on a per-second basis?
That means 10K rows, which isn't bad, and you just have to INSERT a row every second with the current data.
EDIT: I would say that that solutions is better than doing a comma-separated something in a TEXT column... especially since getting and manipulating data (which you say you want to do) would be very messy.
I would view it as a key-value problem.
for each second played
Song[second] += 1
end
As a relational database -
song
----
name | second | plays
And a hack psuedo-sql to start a second:
insert into song(name, second, plays) values("xyz", "abc", 0)
and another to update the second
update song plays = plays + 1 where name = xyz and second = abc
A 3-hour podcast would have 11K rows.
It really depends on what is generating the data ..
As I understand you want to implement a map with the key being the second mark and the value being the number of plays.
What is the pieces in the event, unit of work, or transaction you are loading?
Can I assume you have a play event along the podcastname , start and stop times
And you want to load into the map for analysis and presentation?
If that's the case you can have a table
podcastId
secondOffset
playCount
each even would do an update of the row between the start and ending position
update t
set playCount = playCount +1
where podCastId = x
and secondOffset between y and z
and then followed by an insert to add those rows between the start and stop that don't exist, with a playcount of 1, unless you preload the table with zeros.
Depending on the DB you may have the ability to setup a sparse table where empty columns are not stored, making more efficient.

Querying for N random records on Appengine datastore

I'm trying to write a GQL query that returns N random records of a specific kind. My current implementation works but requires N calls to the datastore. I'd like to make it 1 call to the datastore if possible.
I currently assign a random number to every kind that I put into the datastore. When I query for a random record I generate another random number and query for records > rand ORDER BY asc LIMIT 1.
This works, however, it only returns 1 record so I need to do N queries. Any ideas on how to make this one query? Thanks.
"Under the hood" a single search query call can only return a set of consecutive rows from some index. This is why some GQL queries, including any use of !=, expand to multiple datastore calls.
N independent uniform random selections are not (in general) consecutive in any index.
QED.
You could probably use memcache to store the entities, and reduce the cost of grabbing N of them. Or if you don't mind the "random" selections being close together in the index, select a randomly-chosen block of (say) 100 in one query, then pick N at random from those. Since you have a field that's already randomised, it won't be immediately obvious to an outsider that the N items are related. At least, not until they look at a lot of samples and notice that items A and Z never appear in the same group, because they're more than 100 apart in the randomised index. And if performance permits, you can re-randomise your entities from time to time.
What kind of tradeoffs are you looking for? If you are willing to put up with a small performance hit on inserting these entities, you can create a solution to get N of them very quickly.
Here's what you need to do:
When you insert your Entities, specify the key. You want to give keys to your entities in order, starting with 1 and going up from there. (This will require some effort, as app engine doesn't have autoincrement() so you'll need to keep track of the last id you used in some other entity, let's call it an IdGenerator)
Now when you need N random entities, generate N random numbers between 1 and whatever the last id you generated was (your IdGenerator will know this). You can then do a batch get by key using the N keys, which will only require one trip to the datastore, and will be faster than a query as well, since key gets are generally faster than queries, AFAIK.
This method does require dealing with a few annoying details:
Your IdGenerator might become a bottleneck if you are inserting lots of these items on the fly (more than a few a second), which would require some kind of sharded IdGenerator implementation. If all this data is preloaded, or is not high volume, you have it easy.
You might find that some Id doesn't actually have an entity associated with it anymore, because you deleted it or because a put() failed somewhere. If this happened you'd have to grab another random entity. (If you wanted to get fancy and reduce the odds of this you could make this Id available to the IdGenerator to reuse to "fill in the holes")
So the question comes down to how fast you need these N items vs how often you will be adding and deleting them, and whether a little extra complexity is worth that performance boost.
Looks like the only method is by storing the random integer value in each entity's special property and querying on that. This can be done quite automatically if you just add an automatically initialized property.
Unfortunately this will require processing of all entities once if your datastore is already filled in.
It's weird, I know.
I agree to the answer from Steve, there is no such way to retrieve N random rows in one query.
However, even the method of retrieving one single entity does not usually work such that the prbability of the returned results is evenly distributed. The probability of returning a given entity depends on the gap of it's randomly assigned number and the next higher random number. E.g. if random numbers 1,2, and 10 have been assigned (and none of the numbers 3-9), the algorithm will return "2" 8 times more often than "1".
I have fixed this in a slightly more expensice way. If someone is interested, I am happy to share
I just had the same problem. I decided not to assign IDs to my already existing entries in datastore and did this, as I already had the totalcount from a sharded counter.
This selects "count" entries from "totalcount" entries, sorted by key.
# select $count from the complete set
numberlist = random.sample(range(0,totalcount),count)
numberlist.sort()
pagesize=1000
#initbuckets
buckets = [ [] for i in xrange(int(max(numberlist)/pagesize)+1) ]
for k in numberlist:
thisb = int(k/pagesize)
buckets[thisb].append(k-(thisb*pagesize))
logging.debug("Numbers: %s. Buckets %s",numberlist,buckets)
#page through results.
result = []
baseq = db.Query(MyEntries,keys_only=True).order("__key__")
for b,l in enumerate(buckets):
if len(l) > 0:
result += [ wq.fetch(limit=1,offset=e)[0] for e in l ]
if b < len(buckets)-1: # not the last bucket
lastkey = wq.fetch(1,pagesize-1)[0]
wq = baseq.filter("__key__ >",lastkey)
Beware that this to me is somewhat complex, and I'm still not conviced that I dont have off-by-one or off-by-x errors.
And beware that if count is close to totalcount this can be very expensive.
And beware that on millions of rows it might not be possible to do within appengine time boundaries.
If I understand correctly, you need retrieve N random instance.
It's easy. Just do query with only keys. And do random.choice N times on list result of keys. Then get results by fetching on keys.
keys = MyModel.all(keys_only=True)
n = 5 # 5 random instance
all_keys = list(keys)
result_keys = []
for _ in range(0,n)
key = random.choice(all_keys)
all_keys.remove(key)
result_keys.append(key)
# result_keys now contain 5 random keys.

Resources