Getting random entry from Objectify entity - google-app-engine

How can I get a random element out of a Google App Engine datastore using Objectify? Should I fetch all of an entity's keys and choose randomly from them or is there a better way?

Assign a random number between 0 and 1 to each entity when you store it. To fetch a random record, generate another random number between 0 and 1, and query for the smallest entity with a random value greater than that.

You don't need to fetch all.
For example:
countall = query(X.class).count()
// http://groups.google.com/group/objectify-appengine/browse_frm/thread/3678cf34bb15d34d/82298e615691d6c5?lnk=gst&q=count#82298e615691d6c5
rnd = Generate random number [0..countall]
ofy.query(X.class).order("- date").limit(rnd); //for example -date or some chronic indexed field
Last id is your...
(in average you fatch 50% or at lest first read is in average 50% less)
Improvements (to have smaller key table in cache)!
After first read remember every X elements.
Cache id-s and their position. So next time query condition from selected id further (max ".limit(rnd%X)" will be X-1).
Random is just random, if it doesn't need to be close to 100% fair, speculate chronic field value (for example if you have 1000 records in 10 days, for random 501 select second element greater than fifth day).
Other options, if you have chronic field date (or similar), fetch elements older than random date and younger then random date + 1 (you need to know first date and last date). Second select random between fetched records. If query is empty select greater than etc...

Quoted from this post about selecting some random elements from an Objectified datastore:
If your ids are sequential, one way would be to randomly select 5
numbers from the id range known to be in use. Then use a query with an
"in" filter().
If you don't mind the 5 entries being adjacent, you can use count(),
limit(), and offset() to randomly find a block of 5 entries.
Otherwise, you'll probably need to use limit() and offset() to
randomly select one entry out at a time.
-- Josh

I pretty much adapt the algorithm provided Matejc. However, 3 things:
Instead of using count() or the datastore service factory (DatastoreServiceFactory.getDatastoreService()), I have an entity that keep track of the total count of the entities that I am interested in. The reason for this approach is that:
a. count() could be expensive when you are dealing with a lot of objects
b. You can't test the datastore service factory locally...testing in prod is just a bad practice.
Generating the random number: ThreadLocalRandom.current().nextLong(1, maxRange)
Instead of using limit(), I use offset, so I don't have to worry about "sorting."

Related

Optimizing ID generation in a particular format

I am looking to generate IDs in a particular format. The format is this:
X | {N, S, E, W} | {A-Z} | {YY} | {0-9} {0-9} {0-9} {0-9} {0-9}
The part with "X" is a fixed character, the second part can be any of the 4 values N, S, E, W (North, South, East, West zones) based on the signup form data, the third part is an alphabet from the set {A-Z} and it is not related to the input data in anyway (can be randomly assigned), YY are the last 2 digits of the current year and the last part is a 5 digit number from 00000 to 99999.
I am planning to construct this ID by generating all 5 parts and concatenating the results into a final string. The steps for generating each part:
This is fixed as "X"
This part will be one of "N", "S", "E", "W" based on the input data
Generate a random alphabet from {A-Z}
Last 2 digits of current year
Generate 5 random digits
This format gives me 26 x 10^5 = 26, 00, 000 unique IDs each year for a particular zone, which is enough for my use case.
For handling collisions, I plan to query the database and generate a new ID if the ID already exists in the DB. This will continue until I generate an ID which doesnt exist in the DB.
Is this strategy good or should I use something else? When the DB has a lot of entries of a particular zone in a particular year, what would be the approximate probability of collision or expected number of DB calls?
Should I instead use, sequential IDs like this:
Start from "A" in part 3 and "00000" in part 5
Increment part 3 to "B", when "99999" has been used in part 5
If I do use this strategy, is there a way I can implement this without looking into the DB to first find the last inserted ID?
Or some other way to generate the IDs in this format. My main concern is that the process should be fast (not too many DB calls)
If there's no way around DB calls, should I use a cache like Redis for making this a little faster? How exactly will this work?
For handling collisions, I plan to query the database and generate a
new ID if the ID already exists in the DB. This will continue until I
generate an ID which doesnt exist in the DB.
What if you make 10 such DB calls because of this. The problem with randomness is that collisions will occur even though the probability is low. In a production system with high load, doing a check with random data is dangerous.
This format gives me 26 x 10^5 = 26, 00, 000 unique IDs each year for
a particular zone, which is enough for my use case.
Your range is small, no doubt. But you need to see tahat the probability of collision will be 1 / 26 * 10^5 which is not that great!.
So, if the hash size is not a concern, read about UUID, Twitter snowflake etc.
If there's no way around DB calls, should I use a cache like Redis for
making this a little faster? How exactly will this work?
Using a cache is a good idea. Again, the problem here is the persistence. If you are looking for consistency, then Redis uses LRU and keys would get lost in time.
Here's how I would solve this issue:
So, I would first write write a mapper range for characters.
Ex: N goes from A to F, S from G to M etc.
This ensures that there is some consistency among the zones.
After this, we can do the randomized approach itself but with indexing.
So, suppose let's say there is a chance for collision. We can significantly reduce this value.
Make the unique hash in your table as indexable.
This means that your search is much faster.
When you want to insert, generate 2 random hashes and do a single IN query - something like "select hash from table where hash in (hash1,hash2)". If this does not work, next time, you need to generate 4 random hashes and do the same query. If it works , use the hash. Keep increasing the exponential value to avoid collisions.
Again this is speculative, better approcahes may be there.

Need an optimized way to handle combination of entities to improve performance

So, I am working on a feature in a web application. The problem is like this-
I have four different entities. Let's say those are - Item1, Item2, Item3, Item4. There's two phase of the feature. Let's say the first phase is - Choose entities. In the first phase, User will have option to choose multiple items for each entity and for every combination from that choosing, I need to do some calculation. Then in the second phase(let's say Relocate phase) - based on the calculation done in the first phase, for each combination I would have to let user choose another combination where the value of the first combination would get removed to the row of the second combination.
Here's the data model for further clarification -
EntityCombinationTable
(
Id
Item1_Id,
Item2_Id,
Item3_Id,
Item4_Id
)
ValuesTable
(
Combination_Id,
Value
)
So suppose I have following values in both values -
EntityCombinationTable
Id -- Item1_Id -- Item2_Id -- Item3_Id -- Item4_Id
1 1 1 1 1
2 1 2 1 1
3 2 1 1 1
4 2 2 1 1
ValuesTable
Combination_Id -- Value
1 10
2 0
3 0
4 20
So if in the first phase - I choose (1,2) for Item1, (1,2) for Item_2 and 1 for both Item_3 and Item4, then total combination would be 2*2*1*1 = 4.
Then in the second phase, for each of the combination that has value greater than zero, I would have to let the user choose different combination where the values would get relocated.
For example - As only combination with Id 1 and 2 has value greater than zero, only two relocation combination would need to be shown in the second dialog. So if the user choose (3,3,3,3) and (4,4,4,4) as relocation combination in the second phase, then new row will need to be inserted in
EntityCombinationTable for (3,3,3,3) and (4,4,4,4). And values of (1,1,1,1) and (2,2,1,1) will be relocated respectively to rows corresponding to (3,3,3,3) and (4,4,4,4) in the ValuesTable.
So the problem is - each of the entity can have items upto 100 or even more. So in worst case the total number of combinations can be 10^8 which would lead to a very heavy load in database(inserting and updating a huge number rows in the table) and also generating all the combination in the code level would require a substantial time.
I have thought about an alternative approach to not keep the items as combination. Rather keep separate table for each entity. and then make the combination in the runtime. Which also would cause performance issue. As there's a lot more different stages where I might need the combination. So every time I would need to generate all the combinations.
I have also thought about creating key-value pair type table, where I would keep the combination as a string. But in this approach I am not actually reducing number of rows to be inserted rather number of columns is getting reduced.
So my question is - Is there any better approach this kind of situation where I can keep track of combination and manipulate in an optimized way?
Note - I am not sure if this would help or not, but a lot of the rows in the values table will probably have zero as value. So in the second phase we would need to show a lot less rows than the actual number of possible combinations

Objectify - filter() with range and with a 100 records limit does return the same result always?

Objectify filter() with range and with a 100 records limit does return the same result always, is there a way to get 100 random record on a given matching criteria???
Example:
Lets say I have an index on score field (in GAE Entity Leaderboard), where I would like to fetch out records where score range is 0 to 10 with a limit of 100 records I do get the same results out all the time (considering there is no change in the data records). I understand this is purely to do with the index where the data could have been stored and ordered in ASC order and any time you query we might end up with the same result?
Is there way with GAE objectify to fetch out the random 100 records matching the same criteria considering we have 1000s of records b/w 0-100 score range???
Depending on your specific needs, you may be able to use the __scatter__ property built into GAE:
https://code.google.com/p/appengine-mapreduce/wiki/ScatterPropertyImplementation
With how the datastore indexes, no.
The datastore reads through his index, the index is pre-sorted, so it will always return the SAME value.
The only way to get such a "randomizer" effect would be to grab your 1000s of records and randomly select 100 of them, or create an auto-incrementing number for your 1000s of records you want to have, and then grab a random interval of a 100.
EDIT: by the OP's suggestion, another solution would be to pre-fetch a simple count, then randomly generate an offset number to grab random data. (added to the answer for sake of having all solutions at one place)

Web stats: Calculating/estimating unique visitors for arbitary time intervals

I am writing an application which is recording some 'basic' stats -- page views, and unique visitors. I don't like the idea of storing every single view, so have thought about storing totals with a hour/day resolution. For example, like this:
Tuesday 500 views 200 unique visitors
Wednesday 400 views 210 unique visitors
Thursday 800 views 420 unique visitors
Now, I want to be able to query this data set on chosen time periods -- ie, for a week. Calculating views is easy enough: just addition. However, adding unique visitors will not give the correct answer, since a visitor may have visited on multiple days.
So my question is how do I determine or estimate unique visitors for any time period without storing each individual hit. Is this even possible? Google Analytics reports these values -- surely they don't store every single hit and query the data set for every time period!?
I can't seem to find any useful information on the net about this. My initial instinct is that I would need to store 2 sets of values with different resolutions (ie day and half-day), and somehow interpolate these for all possible time ranges. I've been playing with the maths, but can't get anything to work. Do you think I may be on to something, or on the wrong track?
Thanks,
Brendon.
If you are OK with approximations, I think tom10 is onto something, but his notion of random subsample is not the right one or needs a clarification. If I have a visitor that comes on day1 and day2, but is sampled only on day2, that is going to introduce a bias in the estimation. What I would do is to store full information for a random subsample of users (let's say, all users whose hash(id)%100 == 1). Then you do the full calculations on the sampled data and multiply by 100. Yes tom10 said about just that, but there are two differences: he said "for example" sample based on the ID and I say that's the only way you should sample because you are interested in unique visitors. If you were interested in unique IPs or unique ZIP codes or whatever you would sample accordingly. The quality of the estimation can be assessed using the normal approximation to the binomial if your sample is big enough. Beyond this, you can try and use a model of user loyalty, like you observe that over 2 days 10% of visitors visit on both days, over three days 11% of visitors visit twice and 5% visit once and so forth up to a maximum number of day. These numbers unfortunately can depend on time of the week, season and even modeling those, loyalty changes over time as the user base matures, changes in composition and the service changes as well, so any model needs to be re-estimated. My guess is that in 99% of practical situations you'd be better served by the sampling technique.
You could store a random subsample of the data, for example, 10% of the visitor IDs, then compare these between days.
The easiest way to do this is to store a random subsample of each day for future comparisons, but then, for the current day, temporarily store all your IDs and compare them to the subsampled historical data and determine the fraction of repeats. (That is, you're comparing the subsampled data to a full dataset for a given day and not comparing two subsamples -- it's possible to compare two subsamples and get an estimate for the total but the math would be a bit trickier.)
You don't need to store every single view, just each unique session ID per hour or day depending on the resolution you need in your stats.
You can keep these log files containing session IDs sorted to count unique visitors quickly, by merging multiple hours/days. One file per hour/day, one unique session ID per line.
In *nix, a simple one-liner like this one will do the job:
$ sort -m sorted_sid_logs/2010-09-0[123]-??.log | uniq | wc -l
It counts the number of unique visitors during the first three days of September.
You can calculate the uniqueness factor (UF) on each day and use it to calculate the composite (week by example) UF.
Let's say that you counted:
100 visits and 75 unique session id's on monday (you have to store the sessions ID's at least for a day, or the period you use as unit).
200 visits and 100 unique session id's on tuesday.
If you want to estimate the UF for the period Mon+Tue you can do:
UV = UVmonday + UVtuesday = TVmonday*UFmonday + TVtuesday*UFtuesday
being:
UV = Unique Visitors
TV = Total Visits
UF = Uniqueness Factor
So...
UV = (Sum(TVi*UFi))
UF = UV / TV
TV = Sum(TVi)
I hope it helps...
This math counts two visits of the same person as two unique visitors. I think it's ok if the only way you have to identify somebody is via the session ID.

Querying for N random records on Appengine datastore

I'm trying to write a GQL query that returns N random records of a specific kind. My current implementation works but requires N calls to the datastore. I'd like to make it 1 call to the datastore if possible.
I currently assign a random number to every kind that I put into the datastore. When I query for a random record I generate another random number and query for records > rand ORDER BY asc LIMIT 1.
This works, however, it only returns 1 record so I need to do N queries. Any ideas on how to make this one query? Thanks.
"Under the hood" a single search query call can only return a set of consecutive rows from some index. This is why some GQL queries, including any use of !=, expand to multiple datastore calls.
N independent uniform random selections are not (in general) consecutive in any index.
QED.
You could probably use memcache to store the entities, and reduce the cost of grabbing N of them. Or if you don't mind the "random" selections being close together in the index, select a randomly-chosen block of (say) 100 in one query, then pick N at random from those. Since you have a field that's already randomised, it won't be immediately obvious to an outsider that the N items are related. At least, not until they look at a lot of samples and notice that items A and Z never appear in the same group, because they're more than 100 apart in the randomised index. And if performance permits, you can re-randomise your entities from time to time.
What kind of tradeoffs are you looking for? If you are willing to put up with a small performance hit on inserting these entities, you can create a solution to get N of them very quickly.
Here's what you need to do:
When you insert your Entities, specify the key. You want to give keys to your entities in order, starting with 1 and going up from there. (This will require some effort, as app engine doesn't have autoincrement() so you'll need to keep track of the last id you used in some other entity, let's call it an IdGenerator)
Now when you need N random entities, generate N random numbers between 1 and whatever the last id you generated was (your IdGenerator will know this). You can then do a batch get by key using the N keys, which will only require one trip to the datastore, and will be faster than a query as well, since key gets are generally faster than queries, AFAIK.
This method does require dealing with a few annoying details:
Your IdGenerator might become a bottleneck if you are inserting lots of these items on the fly (more than a few a second), which would require some kind of sharded IdGenerator implementation. If all this data is preloaded, or is not high volume, you have it easy.
You might find that some Id doesn't actually have an entity associated with it anymore, because you deleted it or because a put() failed somewhere. If this happened you'd have to grab another random entity. (If you wanted to get fancy and reduce the odds of this you could make this Id available to the IdGenerator to reuse to "fill in the holes")
So the question comes down to how fast you need these N items vs how often you will be adding and deleting them, and whether a little extra complexity is worth that performance boost.
Looks like the only method is by storing the random integer value in each entity's special property and querying on that. This can be done quite automatically if you just add an automatically initialized property.
Unfortunately this will require processing of all entities once if your datastore is already filled in.
It's weird, I know.
I agree to the answer from Steve, there is no such way to retrieve N random rows in one query.
However, even the method of retrieving one single entity does not usually work such that the prbability of the returned results is evenly distributed. The probability of returning a given entity depends on the gap of it's randomly assigned number and the next higher random number. E.g. if random numbers 1,2, and 10 have been assigned (and none of the numbers 3-9), the algorithm will return "2" 8 times more often than "1".
I have fixed this in a slightly more expensice way. If someone is interested, I am happy to share
I just had the same problem. I decided not to assign IDs to my already existing entries in datastore and did this, as I already had the totalcount from a sharded counter.
This selects "count" entries from "totalcount" entries, sorted by key.
# select $count from the complete set
numberlist = random.sample(range(0,totalcount),count)
numberlist.sort()
pagesize=1000
#initbuckets
buckets = [ [] for i in xrange(int(max(numberlist)/pagesize)+1) ]
for k in numberlist:
thisb = int(k/pagesize)
buckets[thisb].append(k-(thisb*pagesize))
logging.debug("Numbers: %s. Buckets %s",numberlist,buckets)
#page through results.
result = []
baseq = db.Query(MyEntries,keys_only=True).order("__key__")
for b,l in enumerate(buckets):
if len(l) > 0:
result += [ wq.fetch(limit=1,offset=e)[0] for e in l ]
if b < len(buckets)-1: # not the last bucket
lastkey = wq.fetch(1,pagesize-1)[0]
wq = baseq.filter("__key__ >",lastkey)
Beware that this to me is somewhat complex, and I'm still not conviced that I dont have off-by-one or off-by-x errors.
And beware that if count is close to totalcount this can be very expensive.
And beware that on millions of rows it might not be possible to do within appengine time boundaries.
If I understand correctly, you need retrieve N random instance.
It's easy. Just do query with only keys. And do random.choice N times on list result of keys. Then get results by fetching on keys.
keys = MyModel.all(keys_only=True)
n = 5 # 5 random instance
all_keys = list(keys)
result_keys = []
for _ in range(0,n)
key = random.choice(all_keys)
all_keys.remove(key)
result_keys.append(key)
# result_keys now contain 5 random keys.

Resources