Indexes and indexes entries limits in Google App Engine Datastore - database

I'm having some problem in understanding how indexes work in GAE Datastore, in particular somthing really unclear to me are the limits related to indexes.
For what I understood one can create some custom indexes in the datastore-indexes.xml file and additionally the Datastore will generate some automatic indexes to match the user queries.
A first question is: the "Number of indexes" quota limit defined in the quotas page (https://cloud.google.com/appengine/docs/quotas#Datastore) is referred only to the custom indexes defined in datastore-indexes.xml, or it applies also to indexes automatically generated?
Another concept eluding me is the "index entry for a single query".
Assume I don't have multi-dimensional properties (i.e. not lists) and I have some entities of kind "KindA". Then I define two groups of entity properties:
- Group1: properties with arbitray name and boolean value
- Group2: properties with arbitray name and double value
In my world any KindA entity can have at most N properties of Group1 and N properties of Group2. For any property P an index table is created and each entity having that P set will add a row in the P index table (right?). Thus initially any KindA entity will have 1 entry for each of the max. 2N properties (thus in total max 2N index entries per entity) right?
If this is correct than it follows that I can create an entity with a limited number of properties, however this is strange since I 've always read that an entity can have unlimited properties...(without taking in account the size limit).
However let assume now that my application allows users to query for KindA entities using an arbitrarly long sequence of AND filters on properties of Group1 (boolean one). Thus one can query something like:
find all entities in KindA where prop1=true AND prop2=true ... AND propM = true
This is a situation in which query only contains equalities and thus no custom indexes are required (https://cloud.google.com/appengine/docs/python/datastore/indexes#Index_configuration).
But what if I want to order using properties of GroupB ? In this case I need an index for any different query right (different in terms of combination of filtering properties names)?
In my developmnet server I tried without specifying any custom index and GAE generates them for me (however any time I restart previous generated indexes get removed). In this case how many index entries does a signle KindA entity have in a single query index? I say 1 because of what GAE docs says:
The property may also be included in additional, custom indexes declared in your index configuration file (index.yaml). Provided that an entity has no list properties, it will have at most one entry in each such custom index (for non-ancestor indexes) or one for each of the entity's ancestors (for ancestor indexes)
Thus in theory if N is limited I'm safe with respect to the "Maximum number of index entries for an entity" (https://cloud.google.com/appengine/docs/java/datastore/#Java_Quotas_and_limits) is it right?
But what about receiving over 200 different queries? does it leads GAE to automatically generate over 200 custom indexes (one for distinct query)? If yes, do these indexes automatically generate influence the Indexes number limit (which is 200) ?
If yes, then it follows that I can't let user do this (IMHO very basic) queries. Am I misunderstanding something?

first of all I was trying to understand your question which I find difficult to follow.
The 200 index limit only counts towards the indexes you (or are define for you automatically by the devappserver) define by using queries. This means that the indexes that will be created alone for your indexed properties are not counted towards this limit.
Your are correct in the 2N automatic indexes created for every indexed property.
You can have any number of properties indexed in any entity as long as you don't get over the 1MB limit per entity. But.. this really depends on the content of the properties stored.
For the indexes created for you on your indexed properties... you don't really have an actual limit rather than an increasing cost as your writes and storage per entity put will increase for each added property.
When using sort orders, you are limited to one sort order when using automatic indexes. More sort orders will require a composite index (your custom index). Thus if you are already using an equality filter you need anyway a custom index.
So, yes, on your example the devapp server will create a composite index for each query you will be executing. However you can reduce this indexes manually by deleting the ones not needed. The query planner can use query time to find your results by merging different indexes as explained here:
https://cloud.google.com/appengine/articles/indexselection
Yes, every index definition on your index.yaml will count towards the 200 limit.
I find out that you really don't use composite indexes too much when you know how gae apps can be programmed. You need to balance what users need to do and what not. And also balance between doing query side job, or just query all and filter by code (it really depends on how many max entities you can have in that particular kind).
However, if your trying to do some complex queries available to your users then maybe the datastore is not the choice.

Related

Datastore efficiency, low level API

Every Cloud Datastore query computes its results using one or more indexes, which contain entity keys in a sequence specified by the index's properties and, optionally, the entity's ancestors. The indexes are updated incrementally to reflect any changes the application makes to its entities, so that the correct results of all queries are available with no further computation needed.
Generally, I would like to know if
datastore.get(List<Key> listOfKeys);
is faster or slower than a query with the index file prepared (with the same results).
Query q = new Query("Kind")(.setFilter(someFilter));
My current problem:
My data consists of Layers and Points. Points belong to only one unique layer and have unique ids within a layer. I could load the points in several ways:
1) Have points with a "layer name" property and query with a filter.
- Here I am not sure whether the datastore would have the results prepared because as the layer name changes dynamically.
2) Use only keys. The layer would have to store point ids.
KeyFactory.createKey("Layer", "layer name");
KeyFactory.createKey("Point", "layer name"+"x"+"point id");
3) Use queries without filters: I don't actually need the general kind "Point" and could be more specific: kind would be ("layer name"+"point id")
- What are the costs to creating more kinds? Could this be the fastest way?
Can you actually find out how the datastore works in detail?
faster or slower than a query with the index file prepared (with the same results).
Fundamentally a query and a get by key are not guaranteed to have the same results.
Queries are eventually consistent, while getting data by key is strongly consistent.
Your first challenge, before optimizing for speed, is probably ensuring that you're showing the correct data.
The docs are good for explaining eventual vs strong consistency, it sounds like you have the option of using an ancestor query which can be strongly consistent. I would also strongly recommend avoiding using the 'name' - which is dynamic - as the entity name, this will cause you an excessive amount of grief.
Edit:
In the interests of being specifically helpful, one option for a working solution based on your description would be:
Give a unique id (a uuid probably) to each layer, store the name as a property
Include the layer key as the parent key for each point entity
Use an ancestor query when fetching points for a layer (which is strongly consistent)
An alternative option is to store points as embedded entities and only have one entity for the whole layer - depends on what you're trying to achieve.

Query with filter on a Key field and another (string) field very slow on App Engine Datastore

I have an entity Kind A with around 1.3 million entities. It has several fields, two of which I use as filters in a Datastore query - K (which is a Key) and S (which is a string).
I perform the following query, and it takes it almost 8 seconds to complete:
SELECT * FROM A WHERE K=KEY('...') AND S='...' LIMIT 1
(I do this by constructing a query with filters in java, but it works just as slow in GQL.)
Is this reasonable? Can this be improved in any way?
Replacing the key query with other fields (or removing completely), speeds things up. Does it make sense to save the key field as a string, and query on that instead of on the original Key type field?
Can you please send your index.yaml file, or at least paste all the indexes for Kind A? My initial feeling is that you don't have a proper index for K and S, so it's doing a full table scan or using another index to project the results. Check out https://developers.google.com/appengine/docs/python/datastore/indexes for an explanation.
If you're still banging your head, I would suggest turning appstats on and determining the slowness from the visuals on your local dev server (https://developers.google.com/appengine/docs/python/tools/appstats).
Finally, I wanted to point out that single property lookups are fast in App Engine, so either K= or S= lookups would be fast. Together, they require special attention (indexing, memcache, etc.).

Query for greater/less than in combination with other field asks for already existing index

I got an index like this for my entity Word:
changed ▲ + type ▲
I've added one entity to the datastore and can successfully query it in the console like this:
KIND Word
changed is a number greater than 0
However if I add a second field to the filter like this:
KIND Word
changed is a number greater than 0
type is a string that is noun
It will fail with the error "You need an index to execute this query."
It seems to me that the exact index it's asking for is already present. Can anyone shed some light on this?
In order to query on two fields you need a combined index, not an individual index on both of them.
Please review this for more information about Datastore indexes:
https://developers.google.com/appengine/articles/indexselection
The best way to get ALL the indexes you need is to run the queries on your devserver, making sure you hit EVERY query you can do on production.
When you're in development, the dev server will create all the indexes needed for your application in you xml/yaml (java/python). If you didn't hit that particular query on your devserver, the index is not in your xml/yaml and you'll need to add it manually.
It's a paradigm shift when you're working with the datastore, but you have to understand that, for scalability reason, the datastore CANNOT do ANY computation. It only returns you scans of your table. When you save one Entity in your table, behind the scene the datastore will actually save it multiple times, one for each index, already sorted for that index. So even if you have your index for parameter 1 AND a separate index for parameter 2, it will not necessarily create the composite index for parameter 1 & 2. So if you don't create that composite index in your xml/yaml, you "force" the datastore to do a join between your tables, which it unfortunately cannot.
I took it as a best practice to hit EVERY query I can do with my code in the dev server and then look at my indexes created. You just have to be careful not creating tons of indexes since they work on your quota and can end up costing you(if your app has billing enabled of course).

Google App Engine - NDB - INDEX Quesions

I had the following question about GAE NDB - Index.
I assume you can specify index via index.yaml or within the model definition using property option, indexed = true. Am I correct? If yes is one preferred over the other?
Is there a way to add/drop index during the life cycle of the data objects?
Can I specify an index on a structured property field?
If so, then can you please let me know** as the syntax for this?
Thanks in advance
By default, the properties that can be indexed (i.e. those that aren't variants of Blob) are indexed, which means you can filter or sort by them on their own. Adding single-property indexes to index.yaml would be unusual. Setting indexed=False for a property will mean fewer write-operations when saving entities, but will mean filtering or sorting by the property is no longer possible. I'd suggest reading the documentation on indexes.
If you want to filter or sort (in combination) by more than one property, then you need to include them in index.yaml. However, as you run code in the development server, if it requires an index that hasn't yet been specified, then index.yaml will be modified to contain a suitable index for the query being run. Adding indexes manually isn't necessarily something you'll ever have to do.
You can't index an entire StructuredProperty, the properties of Structured Properties are individually indexed, and don't need to think about them any differently than for regular properties. If you want to manually specify a multi-property index that includes a sub-property, then you should be able to do so by using 'property.subproperty' (e.g. 'address.city').
s1) Yes, you can set certain properties as being indexed. Some property types do not allow indexing at all. It's preferable to set the indexes programmatically within each model definition.
2) Although you can drop the index programmatically (i.e. remove indexed=True), I would not recommend it. It will leave your data store in inconsistent state.
3) It's not possible to set index on a structured property, however, you can set a Key relationship between your model and the models in the structured property.
See:
https://developers.google.com/appengine/docs/python/ndb/entities
https://developers.google.com/appengine/docs/python/ndb/properties
"You can specify the usual property options for structured properties
(except indexed)."
As I discovered the hard way (see GAE python NDB projection query working in development but not in production), there's a big difference between having an index (and therefore needing an entry in index.yaml) and marking properties as indexed or not indexed. These things are there for different purposes:
Having an index allows you to search, sort, or do projection queries.
Marking properties as indexed or not tells an index which entities to include in the index and which to ignore.
Yea, absolutely, you can add or drop index at any time:
Update your index.yaml
Then run one of the gcloud commands such as '$ gcloud datastore indexes create index.yaml' or '$ gcloud datastore indexes cleanup index.yaml'
No, you can't create an index on a structured property. See more info here https://cloud.google.com/appengine/docs/standard/python/ndb/entity-property-reference#structured

Should I denormalize properties to reduce the number of indexes required by App Engine?

One of my queries can take a lot of different filters and sort orders depending on user input. This generates a huge index.yaml file of 50+ indexes.
I'm thinking of denormalizing many of my boolean and multi-choice (string) properties into a single string list property. This way, I will reduce the number of query combinations because most queries will simply add a filter to the string list property, and my index count should decrease dramatically.
It will surely increase my storage size, but this isn't really an issue as I won't have that much data.
Does this sound like a good idea or are there any other drawbacks with this approach?
As always, this depends on how you want to query your entities. For most of the sorts of queries you could execute against a list of properties like this, App Engine will already include an automatically built index, which you don't have to specify in app.yaml. Likewise, most queries that you'd want to execute that require a composite index, you couldn't do with a list property, or would require an 'exploding' index on that list property.
If you tell us more about the sort of queries you typically run on this object, we can give you more specific advice.
Denormalizing your data to cut back on the number of indices sounds like it a good tradeoff. Reducing the number of indices you need will have fewer indices to update (though your one index will have more updates); it is unclear how this will affect performance on GAE. Size will of course be larger if you leave the original fields in place (since you're copying data into the string list property), but this might not be too significant unless your entity was quite large already.
This is complicated a little bit since the index on the list will contain one entry for each element in the list on each entity (rather than just one entry per entity). This will certainly impact space, and query performance. Also, be wary of creating an index which contains multiple list properties or you could run into a problem with exploding indices (multiple list properties => one index entry for each combination of values from each list).
Try experimenting and see how it works in practice for you (use AppStats!).
"It will surely increase my storage size, but this isn't really an issue as I won't have that much data."
If this is true then you have no reason to denormalize.

Resources