In the python version of Google App Engine, how do I find the quartile values of a model with an index on a specific property? - google-app-engine

In Google App Engine I have a model with 10K entities with an index on the property foo. What is the most efficient way to find the 1st quartile, 2nd quartile (the median), and the 3rd quartile entities? I can fetch the sorted list of keys and find the three quartile keys programmatically, but downloading all the keys won't scale. What is the more elegant approach?
sortedValues = MyModel.all(keys_only=True).order('foo').fetch(limit=10000)

Have you tried .fetch(2500,limit=1), .fetch(5000,limit=1), and .fetch(7500,limit=1)? The first argument corresponds to the offset.
The documentation reads the following, however, so this approach won't afford you O(1) performance.
Note: The query has performance characteristics that correspond linearly with the offset amount plus the limit amount.
From here.

Since quartiles are defined in terms of entity ordering, there's unfortunately no way to determine them other than iterating over them. As cheeken points out, you can speed things up a little by not fetching the intermediate results by using an offset argument.

Related

Redis creating keys efficiently

I am trying to store some data in redis.
Consider following example.
A person has different cars owned in different years.
Currently I am storing keys like this -
cars:johndoe:1991:mercedes model s1 engine v1
cars:johndoe:1992:mercedes model s1 engine v1
cars:jane:1992:BMW model s2 engine v2
cars:foobar:1991:honda model s3 engine v3
Advantage of this is that - I can use wildcard with key to fetch different data.
e.g
1. all cars bought in 1991
keys cars:*:1991:*
2. all cars owned by johndoe
keys cars:johndoe:*
etc.
But as per redis documentation, keys command with wildcard is not efficient, as it searches all keys.
So I decided to use sets for this data.
But to achieve same results like above, I need many different types sets e.g
1. sadd cars cars:johndoe:1991:mercedes cars:johndoe:1992:mercedes cars:jane:1992:BMW cars:foobar:1991:honda
2. sadd cars:johndoe cars:johndoe:1991:mercedes cars:johndoe:1992:mercedes
3. sadd cars:jane cars:jane:1992:BMW
4. sadd cars:foobar cars:foobar:1991:honda
5. sadd cars:1991 cars cars:johndoe:1991:mercedes cars:foobar:1991:honda
This way, I need to add & remove many keys for single operation.
Is this the only way ? Please let me know if there is any more efficient workaround for this.
Thanks in advance.
In the first method instead of using keys you can use SCAN
You can write a simple lua script or even in your app logic you can iterate over the scan command with the cursor point and get all the keys matching the given pattern.
IMO if you have lot of criteria to be dealt then 1st option would be the best fit
You can avoid hitting multiple sadds
Retrieve based on any conditions
If you have less number of criteria the 2nd option would be the best fit
No need of iterating with scan command, you will have the keys to look up for ready.
Hope this helps.
These are, indeed, the only two approaches - not only for Redis but any database for that matter. It is the classic space-time tradeoff: either pay with RAM (i.e. store multiple indices [Sets]) to make the search fast or use CPU (i.e. do an ad-hoc scan).
Redis' philosophy is use space because our goal is maximizing performance. This means you want to store your data in the manner you're going to need it. Since in your case you're interested in reading the data using different dimensions, you need to index each of these dimensions appropriately. Redis' Sets are a good choice for that type of index, excluding the "make year" datum which could benefit from being indexed in a Sorted Set to facilitate range searches (over years, but only if needed).

CouchBase view get for multiple ranges

I'm evaluating CouchBase for an application, and trying to figure out something about range queries on views. I know I can do a view get for a single key, multiple keys, or a range. Can I do a get for multiple ranges? i.e. I want to retrieve items with view key 0-10, 50-100, 5238-81902. I might simultaneously need 100 different ranges, so having to make 100 requests to the database seems like a lot of overhead.
As far as I know in couchbase there is no way to implement getting values from multiple ranges with one view. May be there are (or will be implemented in future) some features in Couchbase N1QL, but I didn't work with it.
Answering your question 100 requests will not be a big overhead. Couchbase is quiet fast and it's designed to handle a lot of operations per second. Also, if your view is correctly designed, it will not be "recalculated" on each query.
Also there is another way:
1. Determine minimum and maximum value of your range (it will be 0..81902 according to your example)
2. Query view that will return only document ids and a value that range was based on, without including all docs in result.
3. On client side filter array of results from previous step according to your ranges (0-10, 50-100, 5238-81902)
and then use getMulti with document ids that left in array.
I don't know your data structure, so you can try both ways, test them and choose the best one that will fit your demands.

Google App Engine storing as list vs JSON

I have a model called User, and a user has a property relatedUsers, which, in its general format, is an array of integers. Now, there will be times when I want to check if a certain number exists in a User's relatedUsers array. I see two ways of doing this:
Use a standard Python list with indexed values (or maybe not) and just run an IN query and see if that number is in there.
Having the key to that User, get back the value for property relatedUsers, which is an array in JSON string format. Decode the string, and check if the number is in there.
Which one is more efficient? Would number 1 cost more reads than option 2? And would number 1 writes cost more than number 2, since indexing each value costs a write. What if I don't index -- which solution would be better then?
Here's your costs vs capability, option wise:
Putting the values in an indexed list will be far more expensive. You will incur the cost of one write for each value in the list, which can explode depending on how many friends your users have. It's possible for this cost explosion to be worse if you have certain kinds of composite indexes. The good side is that you get to run queries on this information: you can get query for a list of users who are friends with a particular user, for example.
No extra index or write costs here. The problem is that you lose querying functionality.
If you know that you're only going to be doing checks only on the current user's list of friends, by all means go with option 2. Otherwise you might have to look at your design a little more carefully.

Fastest way to perform subset test operation on a large collection of sets with same domain

Assume we have trillions of sets stored somewhere. The domain for each of these sets is the same. It is also finite and discrete. So each set may be stored as a bit field (eg: 0000100111...) of a relatively short length (eg: 1024). That is, bit X in the bitfield indicates whether item X (of 1024 possible items) is included in the given set or not.
Now, I want to devise a storage structure and an algorithm to efficiently answer the query: what sets in the data store have set Y as a subset. Set Y itself is not present in the data store and is specified at run time.
Now the simplest way to solve this would be to AND the bitfield for set Y with bit fields of every set in the data store one by one, picking the ones whose AND result matches Y's bitfield.
How can I speed this up? Is there a tree structure (index) or some smart algorithm that would allow me to perform this query without having to AND every stored set's bitfield?
Are there databases that already support such operations on large collections of sets?
If you can preprocess the sets, the subset relation is representable as a DAG (because you're describing a poset). If the transitive reduction is computed, then I think you can avoid testing all the sets by just performing a DFS starting from the biggest sets and stopping whenever Y is no longer a subset of the current set being visited.
Depending on the cardinality of the set from which all the sets are drawn, one option might be to build an inverted index mapping from elements to the sets that contain them. Given a set Y, you could then find all sets that have Y as a subset by finding all of the sets that contain each element individually and computing their intersection. If you store the lists in sorted order (for example, by numbering all the sets in your database with values 0, 1, etc.) then you should be able to compute this intersection fairly efficiently, assuming that no one element is contained in too many sets.
I tend to say that the answer is no, because of the bit field very low cardinality.
This would be a stretch on a conventional RDBMS based on your volume, have you looked at Neo4j which is based on a graph storage model?
A quick glance make me think of BDDs - which is somewhat along the idea of the DAG solution. Alternatively a ZDD.
If an RDBMS was your only option, I would recommend looking at this interesting article on modelling a DAG in SQL:
http://www.codeproject.com/KB/database/Modeling_DAGs_on_SQL_DBs.aspx?msg=3051183
If you can't afford Oracle or MSSQL, have a look at PostgresQL 9, which supports recursive queries. It's also supported cross joins for quite some time.

Should I denormalize properties to reduce the number of indexes required by App Engine?

One of my queries can take a lot of different filters and sort orders depending on user input. This generates a huge index.yaml file of 50+ indexes.
I'm thinking of denormalizing many of my boolean and multi-choice (string) properties into a single string list property. This way, I will reduce the number of query combinations because most queries will simply add a filter to the string list property, and my index count should decrease dramatically.
It will surely increase my storage size, but this isn't really an issue as I won't have that much data.
Does this sound like a good idea or are there any other drawbacks with this approach?
As always, this depends on how you want to query your entities. For most of the sorts of queries you could execute against a list of properties like this, App Engine will already include an automatically built index, which you don't have to specify in app.yaml. Likewise, most queries that you'd want to execute that require a composite index, you couldn't do with a list property, or would require an 'exploding' index on that list property.
If you tell us more about the sort of queries you typically run on this object, we can give you more specific advice.
Denormalizing your data to cut back on the number of indices sounds like it a good tradeoff. Reducing the number of indices you need will have fewer indices to update (though your one index will have more updates); it is unclear how this will affect performance on GAE. Size will of course be larger if you leave the original fields in place (since you're copying data into the string list property), but this might not be too significant unless your entity was quite large already.
This is complicated a little bit since the index on the list will contain one entry for each element in the list on each entity (rather than just one entry per entity). This will certainly impact space, and query performance. Also, be wary of creating an index which contains multiple list properties or you could run into a problem with exploding indices (multiple list properties => one index entry for each combination of values from each list).
Try experimenting and see how it works in practice for you (use AppStats!).
"It will surely increase my storage size, but this isn't really an issue as I won't have that much data."
If this is true then you have no reason to denormalize.

Resources