Querying for N random records on Appengine datastore - google-app-engine

I'm trying to write a GQL query that returns N random records of a specific kind. My current implementation works but requires N calls to the datastore. I'd like to make it 1 call to the datastore if possible.
I currently assign a random number to every kind that I put into the datastore. When I query for a random record I generate another random number and query for records > rand ORDER BY asc LIMIT 1.
This works, however, it only returns 1 record so I need to do N queries. Any ideas on how to make this one query? Thanks.

"Under the hood" a single search query call can only return a set of consecutive rows from some index. This is why some GQL queries, including any use of !=, expand to multiple datastore calls.
N independent uniform random selections are not (in general) consecutive in any index.
QED.
You could probably use memcache to store the entities, and reduce the cost of grabbing N of them. Or if you don't mind the "random" selections being close together in the index, select a randomly-chosen block of (say) 100 in one query, then pick N at random from those. Since you have a field that's already randomised, it won't be immediately obvious to an outsider that the N items are related. At least, not until they look at a lot of samples and notice that items A and Z never appear in the same group, because they're more than 100 apart in the randomised index. And if performance permits, you can re-randomise your entities from time to time.

What kind of tradeoffs are you looking for? If you are willing to put up with a small performance hit on inserting these entities, you can create a solution to get N of them very quickly.
Here's what you need to do:
When you insert your Entities, specify the key. You want to give keys to your entities in order, starting with 1 and going up from there. (This will require some effort, as app engine doesn't have autoincrement() so you'll need to keep track of the last id you used in some other entity, let's call it an IdGenerator)
Now when you need N random entities, generate N random numbers between 1 and whatever the last id you generated was (your IdGenerator will know this). You can then do a batch get by key using the N keys, which will only require one trip to the datastore, and will be faster than a query as well, since key gets are generally faster than queries, AFAIK.
This method does require dealing with a few annoying details:
Your IdGenerator might become a bottleneck if you are inserting lots of these items on the fly (more than a few a second), which would require some kind of sharded IdGenerator implementation. If all this data is preloaded, or is not high volume, you have it easy.
You might find that some Id doesn't actually have an entity associated with it anymore, because you deleted it or because a put() failed somewhere. If this happened you'd have to grab another random entity. (If you wanted to get fancy and reduce the odds of this you could make this Id available to the IdGenerator to reuse to "fill in the holes")
So the question comes down to how fast you need these N items vs how often you will be adding and deleting them, and whether a little extra complexity is worth that performance boost.

Looks like the only method is by storing the random integer value in each entity's special property and querying on that. This can be done quite automatically if you just add an automatically initialized property.
Unfortunately this will require processing of all entities once if your datastore is already filled in.
It's weird, I know.

I agree to the answer from Steve, there is no such way to retrieve N random rows in one query.
However, even the method of retrieving one single entity does not usually work such that the prbability of the returned results is evenly distributed. The probability of returning a given entity depends on the gap of it's randomly assigned number and the next higher random number. E.g. if random numbers 1,2, and 10 have been assigned (and none of the numbers 3-9), the algorithm will return "2" 8 times more often than "1".
I have fixed this in a slightly more expensice way. If someone is interested, I am happy to share

I just had the same problem. I decided not to assign IDs to my already existing entries in datastore and did this, as I already had the totalcount from a sharded counter.
This selects "count" entries from "totalcount" entries, sorted by key.
# select $count from the complete set
numberlist = random.sample(range(0,totalcount),count)
numberlist.sort()
pagesize=1000
#initbuckets
buckets = [ [] for i in xrange(int(max(numberlist)/pagesize)+1) ]
for k in numberlist:
thisb = int(k/pagesize)
buckets[thisb].append(k-(thisb*pagesize))
logging.debug("Numbers: %s. Buckets %s",numberlist,buckets)
#page through results.
result = []
baseq = db.Query(MyEntries,keys_only=True).order("__key__")
for b,l in enumerate(buckets):
if len(l) > 0:
result += [ wq.fetch(limit=1,offset=e)[0] for e in l ]
if b < len(buckets)-1: # not the last bucket
lastkey = wq.fetch(1,pagesize-1)[0]
wq = baseq.filter("__key__ >",lastkey)
Beware that this to me is somewhat complex, and I'm still not conviced that I dont have off-by-one or off-by-x errors.
And beware that if count is close to totalcount this can be very expensive.
And beware that on millions of rows it might not be possible to do within appengine time boundaries.

If I understand correctly, you need retrieve N random instance.
It's easy. Just do query with only keys. And do random.choice N times on list result of keys. Then get results by fetching on keys.
keys = MyModel.all(keys_only=True)
n = 5 # 5 random instance
all_keys = list(keys)
result_keys = []
for _ in range(0,n)
key = random.choice(all_keys)
all_keys.remove(key)
result_keys.append(key)
# result_keys now contain 5 random keys.

Related

Optimizing ID generation in a particular format

I am looking to generate IDs in a particular format. The format is this:
X | {N, S, E, W} | {A-Z} | {YY} | {0-9} {0-9} {0-9} {0-9} {0-9}
The part with "X" is a fixed character, the second part can be any of the 4 values N, S, E, W (North, South, East, West zones) based on the signup form data, the third part is an alphabet from the set {A-Z} and it is not related to the input data in anyway (can be randomly assigned), YY are the last 2 digits of the current year and the last part is a 5 digit number from 00000 to 99999.
I am planning to construct this ID by generating all 5 parts and concatenating the results into a final string. The steps for generating each part:
This is fixed as "X"
This part will be one of "N", "S", "E", "W" based on the input data
Generate a random alphabet from {A-Z}
Last 2 digits of current year
Generate 5 random digits
This format gives me 26 x 10^5 = 26, 00, 000 unique IDs each year for a particular zone, which is enough for my use case.
For handling collisions, I plan to query the database and generate a new ID if the ID already exists in the DB. This will continue until I generate an ID which doesnt exist in the DB.
Is this strategy good or should I use something else? When the DB has a lot of entries of a particular zone in a particular year, what would be the approximate probability of collision or expected number of DB calls?
Should I instead use, sequential IDs like this:
Start from "A" in part 3 and "00000" in part 5
Increment part 3 to "B", when "99999" has been used in part 5
If I do use this strategy, is there a way I can implement this without looking into the DB to first find the last inserted ID?
Or some other way to generate the IDs in this format. My main concern is that the process should be fast (not too many DB calls)
If there's no way around DB calls, should I use a cache like Redis for making this a little faster? How exactly will this work?
For handling collisions, I plan to query the database and generate a
new ID if the ID already exists in the DB. This will continue until I
generate an ID which doesnt exist in the DB.
What if you make 10 such DB calls because of this. The problem with randomness is that collisions will occur even though the probability is low. In a production system with high load, doing a check with random data is dangerous.
This format gives me 26 x 10^5 = 26, 00, 000 unique IDs each year for
a particular zone, which is enough for my use case.
Your range is small, no doubt. But you need to see tahat the probability of collision will be 1 / 26 * 10^5 which is not that great!.
So, if the hash size is not a concern, read about UUID, Twitter snowflake etc.
If there's no way around DB calls, should I use a cache like Redis for
making this a little faster? How exactly will this work?
Using a cache is a good idea. Again, the problem here is the persistence. If you are looking for consistency, then Redis uses LRU and keys would get lost in time.
Here's how I would solve this issue:
So, I would first write write a mapper range for characters.
Ex: N goes from A to F, S from G to M etc.
This ensures that there is some consistency among the zones.
After this, we can do the randomized approach itself but with indexing.
So, suppose let's say there is a chance for collision. We can significantly reduce this value.
Make the unique hash in your table as indexable.
This means that your search is much faster.
When you want to insert, generate 2 random hashes and do a single IN query - something like "select hash from table where hash in (hash1,hash2)". If this does not work, next time, you need to generate 4 random hashes and do the same query. If it works , use the hash. Keep increasing the exponential value to avoid collisions.
Again this is speculative, better approcahes may be there.

Generated unique id with 6 characters - handling when too much ids already used

​In my program you can book an item. This item has an id with 6 characters from 32 possible characters.
So my possibilities are 32^6. Every id must be unique.
func tryToAddItem {
if !db.contains(generateId()) {
addItem()
} else {
tryToAddItem()
}
}
For example 90% of my ids are used. So the probability that I call tryToAddItem 5 times is 0,9^5 * 100 = 59% isn't it?
So that is quite high. This are 5 database queries on a lot of datas.
When the probability is so high I want to implement a prefix „A-xxxxxx“.
What is a good condition for that? At which time do I will need a prefix?
In my example 90% ids were use. What is about the rest? Do I threw it away?
What is about database performance when I call tryToAddItem 5 times? I could imagine that this is not best practise.
For example 90% of my ids are used. So the probability that I call tryToAddItem 5 times is 0,9^5 * 100 = 59% isn't it?
Not quite. Let's represent the number of call you make with the random variable X, and let's call the probability of an id collision p. You want the probability that you make the call at most five times, or in general at most k times:
P(X≤k) = P(X=1) + P(X=2) + ... + P(X=k)
= (1-p) + (1-p)*p + (1-p)*p^2 +... + (1-p)*p^(k-1)
= (1-p)*(1 + p + p^2 + .. + p^(k-1))
If we expand this out all but two terms cancel and we get:
= 1- p^k
Which we want to be greater than some probability, x:
1 - p^k > x
Or with p in terms of k and x:
p < (1-x)^(1/k)
where you can adjust x and k for your specific needs.
If you want less than a 50% probability of 5 or more calls, then no more than (1-0.5)^(1/5) ≈ 87% of your ids should be taken.
First of all make sure there is an index on the id columns you are looking up. Then I would recommend thinking more in terms of setting a very low probability of a very bad event occurring. For example maybe making 20 calls slows down the database for too long, so we'd like to set the probability of this occurring to <0.1%. Using the formula above we find that no more than 70% of ids should be taken.
But you should also consider alternative solutions. Is remapping all ids to a larger space one time only a possibility?
Or if adding ids with prefixes is not a big deal then you could generate longer ids with prefixes for all new items going forward and not have to worry about collisions.
Thanks for response. I searched for alternatives and want show three possibilities.
First possibility: Create an UpcomingItemIdTable with 200 (more or less) valid itemIds. A task in the background can calculate them every minute (or what you need). So the action tryToAddItem will always get a valid itemId.
Second possibility
Is remapping all ids to a larger space one time only a possibility?
In my case yes. I think for other problems the answer will be: it depends.
Third possibility: Try to generate an itemId and when there is a collision try it again.
Possible collisions handling: Do some test before. Measure the time to generate itemIds when there are already 1000,10.000,100.000,1.000.000 etc. entries in the table. When the tryToAddItem method needs more than 100ms (or what you prefer) then increase your length from 6 to 7,8,9 characters.
Some thoughts
every request must be atomar
create an index on itemId
Disadvantages for long UUIDs in API: See https://zalando.github.io/restful-api-guidelines/#144
less usable, because...
-cannot be memorized and easily communicated by humans
-harder to use in debugging and logging analysis
-less convenient for consumer facing usage
-quite long: readable representation requires 36 characters and comes with higher memory and bandwidth consumption
-not ordered along their creation history and no indication of used id volume
-may be in conflict with additional backward compatibility support of legacy ids
[...]
TLDR: For my case every possibility is working. As so often it depends on the problem. Thanks for input.

Google App Engine - Search API index growth

I would like to know how can I estimate the growth (how much the size increasez in a period of time) of an index of App engine Search API (FTS) based on the number of entities inserted and amount of information. For this I would like to know basically how is the index size calculated (on what does it depend). Specifically:
When inserting new entities, is the growth (size) influenced by the number of previous existing entities? (ie. is the growth exponential)? For ex. if I have 1000 entities and I insert 10, the index will grow with X bytes. But if I have 100000 entities and insert 10, will it increase with X or much more than X (exponentially, let' say 10*X) ?
Does the number of fields (properties) influences the size exponentially? For ex. if I have entity A with 2 fields and entity B with 4 fields (let's say identical in values, for mathematical simplicity) will the size increase, when adding entity B, twice as that of entity A or much more than that?
What other means can I use to find statistical information; do I have other tools in the cloud console of app engine, or can I do this programmatically ?
Thank you.
You can check the size of a given index by running the code below.
from google.appengine.api import search
for index in search.get_indexes(fetch_schema=True):
logging.info("index %s", index.storage_usage)
# pseudo code
amount_of_items_to_add = 100
x = 0
for x <= amount_of_items_to_add:
search_api_insert_insert(data)
x+=1
#rerun for loop to see how much the size increased
for index in search.get_indexes(fetch_schema=True):
logging.info("index %s", index.storage_usage)
This code is obviously not a complete working example, but you should be able to build a simple method that takes some data inserts it into the search api and returns how much the used storage increased.
I have run a number of tests for different number of entities and different number of indexed properties per entity and it seams thst the estimated growth of the index reported by the api is not exponential it is linear.
But the most interesting fact to know is that although the size reported is realtime almost, after deleting documents from the index, it may take 12, 24 even 36 hours to update.

Determine percentage of unused keys in large redis DB

I have a Redis database with many millions of keys in it. Over time, the keys that I have written to and read from have changed, and so there are many keys that I am simply not using any more. Most don't have any kind of TTL either.
I want to get a sense for what percentage of the keys in the Redis database is not in use any more. I was thinking I could use hyperloglog to estimate the cardinality of the number of keys that are being written to, but it seems like a lot of work to do a PFADD for every key that gets written to and read from.
To be clear, I don't want to delete anything yet, I just want to do some analysis on the number of used keys in the database.
I'd start with the scan command to iterate through the keys, and use the object idletime command on each to collect the number of seconds since the key was last used. From there you can generate metrics however you like.
One way, using Redis, would be to use a sorted set with the idletime of the key as its score. The advantage of this over HLL is that you can then say "give me keys idle between x and y seconds ago" by using zrange and/or zrevrange. The results of that you could then use for operations such as deletion, archival, or setting a TTL. With HLL you can't do this.
Another advantage is that, unless you store the result in Redis, there is only a Redis cost when you run it. You don't have to modify your code to do additional operations when accessing keys, for example.
The accuracy of the object's idle time is around ten seconds or so if I recall. But for getting an idea of how many and which keys haven't been accessed in a given time frame it should work fine.
You can analysis the data with time window, and use a hyperloglog to estimate the cardinality for each time window.
For example, you can use a hyperloglog for each day's analysis:
// for each key that has been read or written in day1
// add it to the corresponding hyperloglog
pfadd key-count-day1 a b
pfadd key-count-day1 c d e
// for each key that has been read or written in day2
// add it to the corresponding hyperloglog
pfadd key-count-day2 a
pfadd key-count-day2 c
In this case, you can get the estimated number of keys that are active in dayN with the hyperloglog whose key is key-count-dayN.
With pfcount, you can get the number of active keys for each day or several days.
// number of active keys in day2: count2
pfcount key-count-day2
// number of active keys in day1 and day2: count-total
pfcount key-count-day1 key-count-day2
With these 2 counts, you can calculate the percentage of keys that are unused since day2: (count-total - count2) / count-total

Getting random entry from Objectify entity

How can I get a random element out of a Google App Engine datastore using Objectify? Should I fetch all of an entity's keys and choose randomly from them or is there a better way?
Assign a random number between 0 and 1 to each entity when you store it. To fetch a random record, generate another random number between 0 and 1, and query for the smallest entity with a random value greater than that.
You don't need to fetch all.
For example:
countall = query(X.class).count()
// http://groups.google.com/group/objectify-appengine/browse_frm/thread/3678cf34bb15d34d/82298e615691d6c5?lnk=gst&q=count#82298e615691d6c5
rnd = Generate random number [0..countall]
ofy.query(X.class).order("- date").limit(rnd); //for example -date or some chronic indexed field
Last id is your...
(in average you fatch 50% or at lest first read is in average 50% less)
Improvements (to have smaller key table in cache)!
After first read remember every X elements.
Cache id-s and their position. So next time query condition from selected id further (max ".limit(rnd%X)" will be X-1).
Random is just random, if it doesn't need to be close to 100% fair, speculate chronic field value (for example if you have 1000 records in 10 days, for random 501 select second element greater than fifth day).
Other options, if you have chronic field date (or similar), fetch elements older than random date and younger then random date + 1 (you need to know first date and last date). Second select random between fetched records. If query is empty select greater than etc...
Quoted from this post about selecting some random elements from an Objectified datastore:
If your ids are sequential, one way would be to randomly select 5
numbers from the id range known to be in use. Then use a query with an
"in" filter().
If you don't mind the 5 entries being adjacent, you can use count(),
limit(), and offset() to randomly find a block of 5 entries.
Otherwise, you'll probably need to use limit() and offset() to
randomly select one entry out at a time.
-- Josh
I pretty much adapt the algorithm provided Matejc. However, 3 things:
Instead of using count() or the datastore service factory (DatastoreServiceFactory.getDatastoreService()), I have an entity that keep track of the total count of the entities that I am interested in. The reason for this approach is that:
a. count() could be expensive when you are dealing with a lot of objects
b. You can't test the datastore service factory locally...testing in prod is just a bad practice.
Generating the random number: ThreadLocalRandom.current().nextLong(1, maxRange)
Instead of using limit(), I use offset, so I don't have to worry about "sorting."

Resources