Datastore Index Creation Fails Without Explanation - google-app-engine

I'm trying to create a compound index with a single Number field and a list of Strings field. When I view the status of the index it just has an exclamation mark with no explanation. I assume it is because datastore concludes that it is an exploding index based on this FAQ page: https://cloud.google.com/appengine/articles/index_building#FAQs.
Is there any way to confirm what the actual failure reason is? Is it possible to split the list field into multiple fields based on some size limit and create multiple indexes for each chunk?

You get the exploding indexes problem when you have an index on multiple list/repeated properties. In this case a single entity would generate all combinations of the property values (i.e. an index on (A, B) where A has N entries and B has M entries will generate N*M index entries).
In this case you shouldn't get the exploding index problem since you aren't combining two repeated fields.
There are some other obscure ways in which an index build can fail. I would recommend filing a production ticket so that someone can look into your specific indexes.

I believe it was the 1000 item limit per entity for indexes on list properties. I partitioned the property into groups of 999, e.g. property1, property2 etc. as needed. I was then able to create indexes for each chunked property successfully.

Related

MongoDB: multiple indexex

Is it possible to create more than one text index and more than one array index in one collection in MongoDB?
I don't mean the single index with multiple fields.
Text Indexes : MongoDB doesn't allow user to create multiple single field text indexes, it makes sense as in your query you won't specify field name when using $text search($text search is meant for text indexes), if you really need it, you can create a compound text index and use weights as an option - that way you're saying your query's are most likely matched to a particular field in a compound text index.
Ref : Text Indexes MongoDB , $text search Documentation
Multi-Key Indexes :
You can create & would create multi-key indexes same as like creating
other indexes types(I mean to say you no need to pass any options
like when creating unique/sparse etc.).
At the time of creation of index - MongoDB would automatically
converts a normal index to multi-key index if it finds an array in any of the document for a
particular field being indexed or After index is created then at any time - mongoDB can convert an index to multi-key index if
an array is inserted in any document on a particular field.
You can check Multi-Key Indexes by finding
isMultiKey : true on a index upon doing getIndexes() on a collection.
Limitations of Multi-Key Indexes :
Remember not to create a compound index in which both fields are arrays, cause it would explode as it needs to create index keys on Cartesian product of those two fields.
Additionally, mongoDB would throw an error if you try to insert a
document with two fields as arrays, where the two fields are already compound indexed.
Ref : MultiKey Indexes MongoDB
Adding my comment as an answer fo easy future reference.
As far as I remember you can't have multiple text index in a single collection. MongoDB allows you to create multiple text index, however, gives error while searching. Refer: jira.mongodb.org/browse/SERVER-8127

Efficiently modelling a Feed schema on Google Cloud Datastore?

I'm using GCP/App Engine to build a Feed that returns posts for a given user in descending order of the post's score (a modified timestamp). Posts that are not 'seen' are returned first, followers by posts where 'seen' = true.
When a user creates a post, a Feed entity is created for each one of their followers (i.e. a fan-out inbox model)
Will my current index model result in an exploding index and/or contention on the 'score' index if many users load their feed simultaneously?
index.yaml
indexes:
- kind: "Feed"
properties:
- name: "seen" // Boolean
- name: "uid" // The user this feed belongs to
- name: "score" // Int timestamp
direction: desc
// Other entity fields include: authorUid, postId, postType
A user's feed is fetched by:
SELECT postId FROM Feed WHERE uid = abc123 AND seen = false ORDER BY score DESC
Would I be better off prefixing the 'score' with the user id? Would this improve the performance of the score index? e.g. score="{alphanumeric user id}-{unix timestamp}"
From the docs:
You can improve performance with "sharded queries", that prepend a
fixed length string to the expiration timestamp. The index is sorted
on the full string, so that entities at the same timestamp will be
located throughout the key range of the index. You run multiple
queries in parallel to fetch results from each shard.
With just 4 entities I'm seeing 44 indexes which seems excessive.
You do not have an exploding indexes problem, that problem is specific to queries on entities with repeated properties (i.e properties with multiple values) when those properties are used in composite indexes. From Index limits:
The situation becomes worse in the case of entities with multiple
properties, each of which can take on multiple values. To accommodate
such an entity, the index must include an entry for every possible
combination of property values. Custom indexes that refer to multiple properties, each with multiple values, can "explode"
combinatorially, requiring large numbers of entries for an entity with
only a relatively small number of possible property values. Such
exploding indexes can dramatically increase the storage size of an entity in Cloud Datastore, because of the large number of index
entries that must be stored. Exploding indexes also can easily cause
the entity to exceed the index entry count or size limit.
The 44 built-in indexes are nothing more than the indexes created for the multiple indexed properties of your 4 entities (probably your entity model has about 11 indexed properties). Which is normal. You can reduce the number by scrubbing your model usage and marking as unindexed all properties which you do not plan to use in queries.
You do however have the problem of potentially high number of index updates in a short time - when a user with many followers creates a post with all those indexes falling in a narrow range - hotspots, which the article you referenced applies to. Pre-pending the score with the follower user ID (not the post creator ID, which won't help as the same number of updates on the same index range will happen for one use posting event regardless of sharding being used or not) should help. The impact of followers reading the post (when the score properly is updated) is less impactful since it's less likely for all followers to read the post exactly in the same time.
Unfortunately prepending the follower ID doesn't help with the query you intend to do as the result order will be sorted by follower ID first, not by timestamp.
What I'd do:
combine the functionality of the seen and score properties into one: a score value of 0 can be used to indicate that a post was not yet seen, any other value would indicate the timestamp when it was seen. Fewer indexes, fewer index updates, less storage space.
I wouldn't bother with sharding in this particular case:
reading a post takes a bit of time, one follower reading multiple posts won't typically happen fast enough for the index updates for that particular follower to be a serious problem. In the rare worst case an already read post may appear as unread - IMHO not bad enough for justification
delays in updating the indexes for all followers again is IMHO not a big problem - it may just take a bit longer for the post to appear in a follower's feed

Google App Engine - What's the recommended way to keep entity amount within limit?

I have some entities of a kind, and I need to keep them within limited amount, by discarding old ones. Just like log entries maintainance. Any good approach on GAE to do this?
Options in my mind:
Option 1. Add a Date property for each of these entities. Create cron job to check datastore statistics daily. If it exceeds the limit, query some entities of that kind and sort by date with oldest first. Delete them until the size is less than, for example, 0.9 * max_limit.
Option 2. Option 1 requires an additional property with index. I observed that the entity key ids may be likely increasing. So I'd like to query only keys and sort by ascending order. Delete the ones with smaller ids. It does not require additional property (date) and index. But I'm seriously worrying about whether the key id is assured to go increasingly?
I think this is a common data maintainance task. Is there any mature way to do it?
By the way, a tiny ad for my app, free and purely for coder's fun! http://robotypo.appspot.com
You cannot assume that the IDs are always increasing. The docs about ID gen only guarantee that:
IDs allocated in this manner will not be used by the Datastore's
automatic ID sequence generator and can be used in entity keys without
conflict.
The default sort order is also not guaranteed to be sorted by ID number:
If no sort orders are specified, the results are returned in the order
they are retrieved from the Datastore.
which is vague and doesn't say that the default order is by ID.
One solution may be to use a rotating counter that keeps track of the first element. When you want to add new entities: fetch the counter, increment it, mod it by the limit, and add a new element with an ID as the value of the counter. This must all be done in a transaction to guarantee that the counter isn't being incremented by another request. The new element with the same key will overwrite one that was there, if any.
When you want to fetch them all, you can manually generate the keys (since they are all known), do a bulk fetch, sort by ID, then split them into two parts at the value of the counter and swap those.
If you want the IDs to be unique, you could maintain a separate counter (using transactions to modify and read it so you're safe) and create entities with IDs as its value, then delete IDs when the limit is reached.
You can create a second entity (let's call it A) that keeps a list of the keys of the entities you want to limit, like this (pseudo-code):
class A:
List<Key> limitedEntities;
When you add a new entity, you add its key in the list of A. If the length of the list exceeds the limit, you take the first element of the list and the remove the corresponding entity.
Notice that when you add or delete an entity, you should modify the list of entity A in a transaction. Since, these entities belong to different entity groups, you should consider using Cross-Group Transactions.
Hope this helps!

Avoid default index but keep explicitly defined index in AppEngine?

I have some properties that are only referenced in queries that require composite indices. AppEngine writes all indexed properties to their own special indexes, which requires 2 extra write operations per property.
Is there any way to specify that a property NOT be indexed in its own index, but still be used for my composite index?
For example, my entity might be a Person with properties name and group. The only query in my code is select * from Person where group = <group> and name > <name>, so the only index I really need is with group ascending and name ascending. But right now AppEngine is also creating an index on name and an index on group, which triples the number of write operations required to write each entity!
I can see from the documentation how to prevent a property from being used for indexing at all, but I want to turn off indexing only for a few indexes (the default ones).
From what I understand currently, you can either disable indexing on a property all together (which includes composite indexes) or you are stuck with having all indexes (automatic indexes + your composite indexs from index.yaml).
There was some discussion about this in the GAE google group and a feature request to do exactly what you are suggesting, but I wasn't able to find it. Will update the answer when I get home and search more.

Can we have a Model with a lot of properties (say 30) while avoiding the exploding indexes pitfall?

I was thinking that maybe you can have the index.yaml only specify certain indexes (not all the possible ones that GAE automatically does for you).
If that's not a good idea, what is another way of dealing with storing large amount of properties other than storing extra properties as a serialized object in a blob property.
The new improved query planner should generate optimized index definitions.
Note that you can set a property as unindexed by using indexed=False in python or Entity.setUnindexedProperty in Java.
A few notes:
Exploding indexes happen when you have multiple properties that contain "multiple values", i.e. an entity with MULTIPLE list properties AND those properties are listed in a composite index. In this case index entry is created for each list property value combination. In different words: index entries created equals a product of list properties size. So a list property with 20 entries and another list property with 30 entries would create, when BOTH listed in index.yaml under one compound index, 600 index entries.
Exploding indexes do not happen for simple (non-list) properties, or if there is only one list property in entity.
Exploding indexes also do not happen if you do not create a compound index in your index.yaml file, listing at least two list properties in same index.
If you have a lot of properties and you do not need to query upon them, than you can simply put them in a list or two parallel lists (to simulate map), or serialize them. The simplest would be two two parallel lists: this is done automatically for you if you use objectify with embedded classes.

Resources